In systems composed of several distributed applications (services), as in eBay, testing a service or a subset of services in isolation is not enough. Running tests against a full replica of the real system with the latest changes, in a global end-to-end scope, spots integration problems before those changes are deployed to the real environment. When these end-to-end tests are successful, confidence is very high and stress very low during deployments.
At eBay’s Global Shipping, our end-to-end tests consist of UI testing (simulation of user interaction with a browser) and API testing (execute HTTP calls and validate their responses). The batch of end-to-end tests runs continuously from our Continuous Integration Server, Jenkins. All test executions (builds) run by Jenkins are sent to a dashboard (powered by Project Monitor) that is highly visible from a TV screen. When one end-to-end test fails, we see it right away.
This dashboard only tells us something is wrong and provides the overall results from the last 10 builds. We realized this dashboard was not enough. There was little to no visualization on what was going on. Diagnosing problems was very time consuming: knowing what tests failed, when problems started, if more test started failing, finding the root cause. The problem could be anywhere. Keeping that little square green (all tests passing) is no easy task, so we decided to do something to make it easier.
Failbot to the rescue
Cucumber, the testing tool we use for automated acceptance tests on our end-to-end tests, produces powerful JSON reports from the test executions. We use those JSON reports to present the results differently:
Each little rectangle represents the execution result from one test. Red means the test failed; green, it passed. It is easy to see which tests failed, when they started failing, if the failure is a symptom something stop working or just a temporary glitch from the testing environment. The failing tests appearance was enhanced with different red shades to represent how far it executed before it failed. Tests failing for the same reason consistently produce a distinct visual pattern.
A test fails when the behavior observed does not match the expected one. Cucumber reports provide details on disagreeing situations. By clicking a particular failed test execution on Failbot, it is possible to see the step where it failed and an excerpt about the failed expectation. The full explanation is available from the JSON report itself, accessible for download from Failbot.
The technology stack used on the end-to-end tests enable us to take screenshots from the browser window when the UI tests are executing, more specifically Capybara and Selenium. We take screenshots when UI tests fail and include them in the Cucumber JSON report. Those screenshots are picked up by Failbot and made available by clicking on the corresponding failed test.
Speeding up investigation
We know now which tests failed, when they started failing, and where and why they failed. A considerable amount of information to start investigation, but we still have to determine the exact cause. With end-to-end tests, as mentioned before, the problem can be anywhere.
To know exactly what is going wrong, we need go through the logs produced by the services when the test was executed. Finding a particular log is time consuming. Sometimes it is easier to try and reproduce the problem manually instead. So we found a way to get the IDs from those logs, embed them in the Cucumber JSON report, and provide a direct link to them on Failbot. Diagnoses thus became a faster process.
Definition-Of-Done (DoD)
We have another tool developed in-house, ShipShip, which keeps track of all code changes and when they are deployed to the testing environment. Failbot picks up that information and matches changes deployment with end-to-end tests executions. This overlap aids to relate a code change and when a certain test started failing. It is not extremely accurate, bit it gives us a rough idea.
On the other hand, the inverse is quite accurate: if all tests passed after a change was deployed, it means the change caused no breakage (no regression in functionality). This particular feature became an important part of our Definition of Done (DoD): before deploying a set of changes into the real environment, all changes are required to pass the end-to-end tests. This is done quite easily by looking at Failbot.
Our developer lives now
End-to-end tests are fundamental to software development: every problem the end-to-end tests find is a problem that won’t reach our final users, will cause no downtime, revenue loses, nor any negative impact on the company’s image. On the other hand, it brings many challenges to us engineers, including being aware when tests are failing and understand why they are failing.
We improved alerting, visibility, and diagnoses by grabbing information already there and presenting it differently in the form of Failbot. Failbot features were implemented in incrementally and by different people. Failbot is a team effort, not a single-person work. The tool grew to solve problems we were facing and changed our lives to better.
What feature will be implement next? We do not know. We will let our needs tell us.