Failbot—Improving Visibility on End-to-end Tests

In eBay’s Global Shipping team, we use end-to-end tests to detect problems on eBay’s platform introduced by new developments. When those tests are failing, it is hard to see what is going on. From an intra-team effort to improve our visibility, Failbot was born.

In systems composed of several distributed applications (services), as in eBay, testing a service or a subset of services in isolation is not enough. Running tests against a full replica of the real system with the latest changes, in a global end-to-end scope, spots integration problems before those changes are deployed to the real environment. When these end-to-end tests are successful, confidence is very high and stress very low during deployments.

At eBay’s Global Shipping, our end-to-end tests consist of UI testing (simulation of user interaction with a browser) and API testing (execute HTTP calls and validate their responses). The batch of end-to-end tests runs continuously from our Continuous Integration Server, Jenkins. All test executions (builds) run by Jenkins are sent to a dashboard (powered by Project Monitor) that is highly visible from a TV screen. When one end-to-end test fails, we see it right away.

End-to-End result history using Project Monitor

Figure 1. Global End-to-End build history

This dashboard only tells us something is wrong and provides the overall results from the last 10 builds. We realized this dashboard was not enough. There was little to no visualization on what was going on. Diagnosing problems was very time consuming: knowing what tests failed, when problems started, if more test started failing, finding the root cause. The problem could be anywhere. Keeping that little square green (all tests passing) is no easy task, so we decided to do something to make it easier.

Failbot to the rescue

Cucumber, the testing tool we use for automated acceptance tests on our end-to-end tests, produces powerful JSON reports from the test executions. We use those JSON reports to present the results differently:


Figure 2. Scenario result chronology

Each little rectangle represents the execution result from one test. Red means the test failed; green, it passed. It is easy to see which tests failed, when they started failing, if the failure is a symptom something stop working or just a temporary glitch from the testing environment. The failing tests appearance was enhanced with different red shades to represent how far it executed before it failed. Tests failing for the same reason consistently produce a distinct visual pattern.

point of failure2

Figure 3. Knowing more about a failed test

A test fails when the behavior observed does not match the expected one. Cucumber reports provide details on disagreeing situations. By clicking a particular failed test execution on Failbot, it is possible to see the step where it failed and an excerpt about the failed expectation. The full explanation is available from the JSON report itself, accessible for download from Failbot.

one click away2

Figure 4. What we can get from one click

The technology stack used on the end-to-end tests enable us to take screenshots from the browser window when the UI tests are executing, more specifically Capybara and Selenium. We take screenshots when UI tests fail and include them in the Cucumber JSON report. Those screenshots are picked up by Failbot and made available by clicking on the corresponding  failed test.

Speeding up investigation

We know now which tests failed, when they started failing, and where and why they failed. A considerable amount of information to start investigation, but we still have to determine the exact cause. With end-to-end tests, as mentioned before, the problem can be anywhere.

To know exactly what is going wrong, we need go through the logs produced by the services when the test was executed. Finding a particular log is time consuming. Sometimes it is easier to try and reproduce the problem manually instead. So we found a way to get the IDs from those logs, embed them in the Cucumber JSON report, and provide a direct link to them on Failbot. Diagnoses thus became a faster process.

link to api calls logs

Figure 5. Direct link to execution logs (red means HTTP response error)

Definition-Of-Done (DoD)

We have another tool developed in-house, ShipShip, which keeps track of all code changes and when they are deployed to the testing environment. Failbot picks up that information and matches changes deployment with end-to-end tests executions. This overlap aids to relate a code change and when a certain test started failing. It is not extremely accurate, bit it gives us a rough idea.

deployment timeline described2

Figure 6. Deployment matched with tests execution timeline

On the other hand, the inverse is quite accurate: if all tests passed after a change was deployed, it means the change caused no breakage (no regression in functionality). This particular feature became an important part of our Definition of Done (DoD): before deploying a set of changes into the real environment, all changes are required to pass the end-to-end tests. This is done quite easily by looking at Failbot.

Our developer lives now

End-to-end tests are fundamental to software development: every problem the end-to-end tests find is a problem that won’t reach our final users, will cause no downtime, revenue loses, nor any negative impact on the company’s image. On the other hand, it brings many challenges to us engineers, including being aware when tests are failing and understand why they are failing.

We improved alerting, visibility, and diagnoses by grabbing information already there and presenting it differently in the form of Failbot. Failbot features were implemented in incrementally and by different people. Failbot is a team effort, not a single-person work. The tool grew to solve problems we were facing and changed our lives to better.

Office photo with TV displaying Failbot page

Figure 7. Failbot “TV” at the office

What feature will be implement next? We do not know. We will let our needs tell us.