GUI Testing Powered by Deep Learning

Deep Learning (DL) is revolutionizing the face of many industries these days, such as computer vision, natural language processing, and machine translation, and it penetrates many science-driven products and technological companies, including eBay. These days, DL is taking its first strides in eBay’s Quality Engineering (QE) space, and it has already proven to outperform the best test veteran and industry-grade applications one could find.


Current methods of Graphical User Interface (GUI) testing gravitate between Functional Testing (focusing on a system’s external behavior or its elements) to Structural Testing (focusing on internal implementation). These methods are susceptible to changes and usually involve extensive automation efforts. Cross-screen testing, like in the case of desktop Web and mobile Web or mobile App testing, accentuates these risks and costs. Testing across multiple operating systems, devices, screen resolutions, and browser versions quickly becomes a huge challenge that is difficult to execute and govern. Quality risk-control measures, such as coverage-based or usage-based testing, address some of these uncertainties, but only to a certain degree, as it comes at a cost to the overall quality.

Testing methodologies of web interfaces are mostly browser-dependent, while mobile app interfaces are platform-dependent, where the GUI and its detailed implementation are validated with test applications that hold interactive capabilities with the GUI under test. These tools may be Selenium WebDriver for testing HTML-based web pages and Espresso Test Framework for testing View-based Android Applications. While product developers wrap-up GUI implementation, quality engineers begin breaking down the screen to its elements, identifying locators for each UI components and writing up large pieces of their code around asserting the elements’ aspects, such as dimension, position, and color, to make sure the GUI implementation matches the design. Even a slight design change or refactoring of product code could end up failing the regression suites and may involve significant re-work for QE to fix the automaton code.

Some testing tools, like the ones mentioned above, call for a developer skill set and an intimate knowhow of the hosting platforms. Such prerequisites introduce a technical-proficiency dependency and compels the QE to master multiple test applications, frameworks ,and operating systems, such as TestNG, Selenium, Appium, IOS Driver, and Selandroid, etc. As a result, writing and maintaining test suites and scripts for multiple platforms take considerable time and effort and come at the risk of reducing the test scope.

Contemporary developments in DL unleashes efficiencies in GUI testing and in the software lifecycle, potentially. A recent pilot, described below, proved this approach to be realistic and practical.

Deep Learning Technology

DL simulates the human way of finding errors or anomalies. Humans are driven by past experience and conditioning to make decisions. Machines with the proper application of training or conditioning can detect errors that surpass human precision.

We begin our understanding of DL as the subset of a broader class called as the supervised machine learning algorithm. The supervised learning algorithms take a set of training examples called as the training data. The learning algorithm is provided with the training data to learn a desired function. Further, we also validate our learning algorithm by a set of test data. This process of learning from training data and validating against test data is called modeling.

fig.1 General ML system

Neural Nets (NN)

A NN is a group of logically connected entities that transfer a signal from one end to another. Similar to brain cells or neurons that are connected to enable the human brain to achieve its cognitive powers, these logically connected entities are called perceptrons, which allow signals to move across the network to do formidable computations, for example, differentiating a lily from a lotus or understanding the different signals in the traffic. These operations become possible when we expose our NN to a significant amount of data. A deep neural net (DNN) is an addition of multiple layers arranged in the order shown in Fig 2. This mathematical lattice is the core structure that drives autonomous vehicles and other inventive applications, such as eBay’s image search.

Screen Shot 2018 06 22 at 12.06.13 PM

Fig. 2. NN example, 2 layers deep

Deep Learning for GUI testing

DL can be utilized to contribute to the efficiency of GUI testing and reduce the churn associated with this work.   


fig.3 from capture to quality processFig. 3 Process outline

The suggested methodology begins with capturing the entire webpage as an image (see Fig. 3). This image is then divided into multiple UX components. The division of UX components or groups of components helps in generating training and test data to feed our model. Once our model is ready, we can test any new UX component across browsers, resolutions, and additional test dimensions by feeding the image of the UX components for the desired specification to the model. Our model would classify whether our test UX component passes the desired quality criteria or not. This process of deciding the particular images into one of the classes (passing or failing the quality criteria) is called Classification.

Training data and test data creation

We create the training and test data by automated modification of UX components taken from the webpage wireframes. Based on the design guidelines and the test variations, we introduce potential flaws in direct correlation to the design input. These flawed design mockups are manifested as images. Proper labeling of these images ensure proper organization of test data. Once we have a minimal set of images in our arsenal, we are ready to train our model.


Based on the training data and the complexity of our scenarios, different models such as Convolutional Neural Nets (CNNs), Support Vector machines (SVMs) or Random Forests (RFs) can be chosen. Once the model is decided, we can train our model to capture GUI defects.

Proof of concept at eBay

Pursuing the above-mentioned procedure and steps, we implemented our own process for one of the new home page modules called “Popular Destination.” Using the mockups created by the Design team, we generated 10,000 images that included different defects; we have introduced intentional design flaws by modifying images, texts, and layout to simulate the real world scenarios and issues.

The following were some of the examples we used for emulating the defects.

1.  Missing images

fig.4 Baseball Cards

2.  Layout issues

fig.5 Smart Home

The system provides a classification score between 0 to 1. A score closer to 0 should signify a model prediction of a potential test-case failure, which may imply a certain GUI imperfection was detected by our model. A score closer to 1 could signify a prediction of a test-case that meets its quality criteria. In such a case, we intend to establish a cutoff threshold. A cutoff threshold determines a value below which signifies that the module is having a potential GUI defect. This cutoff varies for different modules.

Based on our model, we were able to capture defects with a 97% accuracy. During the process of testing, we were successful to find real production bugs.

For example, we captured the UX component with an Internet Explorer 11 browser and found the production issue below, where thin lines appear across circular images in the Popular Destination module against this specific browser version. Automation testing would have never captured it and manual testers would probably need Steve Austin’s bionic eye and a whole lot of time and patience to even notice this artifact in the vast continuum of their test matrix.

fig.6 Popular Destinations

Fig. 6: Production bug for Popular Destination in IE 11

Key learnings and benefits

Our learned lessons and insights came from testing our GUI in eBay’s top two domains: Homepage (HP) team and Advertising (Ads). Both team wanted to have a test tool and methodology that would enable them to conduct Ads testing using new approaches and tools that differ from their traditional validation and verification applications.

  1. Traditional approaches and tools come at a high cost to the individual engineer. Ramping up on some test applications can take more than a week and proficiency comes with much longer periods of time. ML calls for a different developer skill set, which deprecates the need to master a great deal of traditional validation and verification techniques and tools, such as Selenium WebDriver or iOS and Android drivers.

  2. The new approach eliminated the need for a deep and intimate domain knowledge. A new eBay intern was able to ramp up in a matter of a day or two and start generating test data when training a ML model. Previously, some QE teams would require a few weeks of daily work in order to become familiar with the domain’s specifics and the intricate knowledge of our webpages.

  3. The QE teams witnessed a quick set-up time for ML-based testing when a single engineer was able to prepare test automation to run against their main UX components in a matter of day or two. Usually, the teams invest multiple weeks to achieve such test coverage.

  4. Our experiment kicked-off the quality assurance process early-on, using design mockups. Training our model with these wireframes allowed us to begin the QE work potentially before substantial development phase even started. Now, implementation details are becoming irrelevant for such QE process.  

  5. Some findings were particularly prominent when it became clear that the defects detected by the model would have been practically impossible to capture by any other means of manual or automated testing.

  6. The model produced a classification score per asserted output. Such results allowed the QE to focus their attention on GUI artifacts with the highest probability of having a fault.

  7. Maintaining test suites and scripts for several platforms take considerable amount of time and effort. It comes at high risk of reducing test scope, when time is of essence. Even a slight design change or refactoring of product code could end up failing the regression suites and may involve significant re-work for QE to fix the automaton code. Our ML process became agnostic to implementation details and less sensitive to the platforms it runs on.

  8. Teams were excited about the use of innovative techniques and unleashing its potential. It inspired engineers to hone their skills and learn new tools and approaches.

Ongoing and future work

In addition to the benefits listed above, we are also on the hunt for adopting other useful ML-based approaches, enhancements, and optimized processes. Some of the ideas described below are work-in-progress and some are exciting future concepts we are toying with.

Attribute-based assertion

ML can be applied to detect abstract components, such as images/shapes/text, and extract those components and their respective positions. When focusing on a narrower perspective of ROI (Region Of Interest) we can build an attribute-based assertion to compare with the wireframes. For example, layout-related assertion, such as image position extracted from our ML model, can be checked against predefined characteristics. Such approach would provide a more granular validation mechanism. This work is beneficial for scrutinizing specific elements or areas of elements, contrasted against certain predetermined anchored attributes.

Feedback-adapting models

As a next step to raise the accuracy of the system, one may train the model by integrating it with new or existing processes.

Bug reports can be re-used to teach the model what is a real bug and what is not. Say an issue was closed with a fixed status, this may trigger a learning opportunity for our model and add this data set into its logic. In the case of an issues being closed as “not a bug,” then the system could automatically learn how a misleading issue may look like.

When a classification score is assigned with in an inconclusive range (say, a range of 0.4 to 0.7, where it is not certain whether the quality criteria was passing or failing) then human judgment should be applied and fed back into the training process.

Packaging ML modalities and creating an open-source service

This testing methodology is browser, resolution, and even language agnostic if we use certain OCR libraries, therefore we need not train our models against different browsers, resolutions, or localization constraints.

Further enhancements could be done by packing Accessibility and Application Security libraries. Captions and tabbing can be run against Accessibility requirements (by WCAG2.0, for example) as part of the ML modalities.

Enhancing our current system into a full-scale service solution would let the service take our training images with other hyper parameters, such as the module name and module parameters, into consideration. In turn, this would allow us to create a full-fledged real-time solution for entire webpages, where each module would be a test case and a collection of modules would be called as a test suite. Test suites would be available for each webpage.

The work above could be open-sourced for communities benefit by sharing ideas and libraries that enhance and extend the ML modalities. Such real-time service could become widely available for common use.

Enhancements to Software Design Verification

DL-powered GUI testing could be expanded into additional testing fields in software design verification techniques, where the machine develops an understanding about the relationships between businesses and people. Software design subjects, such as visual functionality, usability, accessibility, and others, could be captured by the same ML paradigm when enhanced with additional modalities.

Some software functionality matters may manifest in a graphic manner, like in cases of slow-loading ads or poor implementation of user-control widgets. Usability has become an increasingly important factor, when Apple’s Human Interface Guidelines or the Android User Interface Guidelines may accept or reject certain an application’s availability in their stores. While User Experience Research (UER) is limited by time, resources, and quality of human feedback, DL can compute the highest score for effective Usable Designs, cutting down time and costs.

Multimedia Accessibility Testing (especially for complying with Web Content Accessibility Guidelines 2.0) might be done by the use of recurrent neural networks (RNNs), which holds the power to understand natural language and extract a relationship between UX components and their descriptions provided by the developers.


The current days of manual and automated GUI testing gradually become ineffective when contrasted with an ML-based solution. By following the recipe explained above, both stakeholders and engineering teams who deal with UX components would gain from one of the latest technology stacks the world has to offer. By applying this DL-fueled, human-like expertise on prevalent alternatives, we can now finally scale the existing labor-intensive and skill-expensive methodologies.