Setting the stage
Maintaining zero p0/p1 production issues is of utmost importance, and this speaks volumes about the effort that goes into the development and testing of each release. There are thousands of runtime configurations as well, which can change the code execution at runtime. The item page also has a lot of historical context covering countries all across the globe and all kinds of devices and networks. Any change on the item page can create a bad user experience.
Despite these challenges, we have a goal to achieve to be able to release every day in production. Given these conditions, this goal can only be achieved by covering all possible combinations of use cases under continuous automation and delivery pipelines. It is not possible for a single person to know all the combinations of use cases to be covered to have a full automation coverage.
This requires thinking in unconventional ways and taking an out-of-the-box approach.
Identifying the evil
There are three issues to consider.
- Each item page release needs an extreme cycle of QA and testing to prove feature functionality and reduce potential bugs in production. Each configuration change has to go through multiple layers of approval to make sure no unfortunate event happens in production that can impact users. All of this is costly, time consuming, slow paced, and not all the scenarios from a production environment are covered.
- QA cycles happen before every release with a defined set of use cases without randomization or going outside a defined set of rules. While in production, there can be use cases that are outside this defined environment. Today, there is no way to capture these unknown production use cases before releasing a new build.
- Testing happens in staging environments that might not have the correct data to test all the corner cases, and the code is tested against a static dataset in static conditions, while the same code runs against a different dataset in production in dynamic conditions. As the software gets bigger, the dependency graph increases, use cases expand, and the combination of all these scenarios grows exponentially, which adds uncertainty.
Taming the bull
To be able to release every day, we need to realize our goal of discovering and identifying unknown use cases and capturing them from a dynamic set of data in production. To do this, we built “Neo,” which helps us run the new code against production traffic without any impact in production.
We took the request mirroring concept from networking and used it to build a continuous automation pipeline. We compare [request, response] between the production environment vs a mirrored environment [N vs (N+1)] and generate hourly reports. This comparison can be done for the full lifecycle of a request, starting from front-end pool up to any nth level of downstream pool. This tool allows us to expand the envelope of QA and testing outside the defined set of use cases.
How it works
To make this tool work, we need a production machine with Nth build and a test machine with (N+1)th build.
-
Neo in action on Production Machine, (N)th build
-
When a request is received by a production machine with the (N)th build of the item page, Neo intercepts incoming requests and assigns a “RequestMirrorId” to identify each request uniquely.
-
Neo then mirrors each and every aspect of the incoming request in production and sends this mirrored data to a test machine with (N+1)th build.
-
Neo then stores a copy of the production request in a file (let's call this file “ProductionRequest”) in a central storage location.
-
When the production machine is ready to return the response, Neo also creates a copy of this response and stores it in a file (let's call it “ProductionResponse”) in central storage.
-
-
Neo in action on Test Machine, (N+1)th build
-
Neo intercepts the mirrored request on a test machine with (N+1)th build and stores this request in a file (let's call it “MirroredRequest”) in central storage.
-
When test machine is ready to return the response, Neo creates a copy of response and stores it in a file (let's call it “MirroredResponse”) in central storage.
-
Once the data is collected, Neo can compare these files based on “RequestMirrorId” by running different comparators and can than generate a report of the delta and mark any use cases that are not covered in comparators or automation.
-
This comparison can be done all the way to last dependent service in the call hierarchy of dependencies. The following diagram illustrates how each payload is stored and compared using “requestMirrorId.”
-
Neo then processes this delta against the acceptance criteria, and anything that is not acceptable or unknown is marked as a potential issue.
The following sample diagram illustrates the full approach.
More details on the pipeline
Neo has many components that needed integration of different code bases and pipelines. Below are the components and the steps of the full pipeline.
-
ItemPage code base: As part of the tool, HttpRequest, HttpResponse interceptors were introduced in ItemPage backend code to intercept request responses to be able to mirror and make a copy of the data. We can choose and filter what requests to Mirror based on Header, queryParams or request path criteria. These interceptors are customizable with the properties shown in tables below.
-
Picking machines from Production: The next step is to pick a couple of healthy production machines (Nth build) at random from a production pool and configure them with mirroring, and then point them to the test machines (N+1th build) to send mirrored data. This step is automated to encourage randomization and to pick up healthy machines in case previously chosen machines are not part of a pool anymore or are not available. The following configurations are set on production machine to start mirroring.
MIRROR_ENABLED
true
MIRROR_TARGET_HOSTNAME
"," separated target test machines
MIRROR_TARGET_PORT
target port
MIRROR_PERCENTAGE
% of traffic to be mirrored
-
Connection to Central Storage: The next step in the pipeline is to change the configuration on production and staging machines to connect and store data from the Nth and N+1th builds in central storage. MIRROR_RESPONSE_DISPATCH_PATH is used to connect to central storage. The production machine will connect to the dispatch path with /BASE and the test machine will connect to the dispatch path with /NEW.
MIRROR_RESPONSE_DISPATCH_PATH |
API path to collect data |
MIRROR_RESPONSE_DISPATCH_ENABLED |
true |
-
Central Storage: Central Storage is a file-based system where production and mirrored requests, responses and header data are stored. Data from each service pool is stored in its own file and against the same “RequestMirrorId” that was assigned at the time of mirroring from a source production machine. Ex: “/production/ItemPage/”RequestMirrorId” vs “/test/ItemPage/”RequestMirrorId”
-
Central Storage Wrapper: Each Service pool makes a call to “/Base" and "/New” services to set up data storage in central storage using the path provided in “MIRROR_RESPONSE_DISPATCH_PATH.” These APIs accept Nth and N+1th responses to map them against “/production” and “/test” folders, respectively.
-
Comparators: These are a series of HTML, JSON, and header comparators that are integrated in a Jenkins pipeline that runs every hour and reads the data from central storage to compare and generate a temporary delta file.
-
Report Generator: This is a separate Jenkins job that is kicked off at the end of the Comparator's pipeline to pick up the temporary delta file and generate an HTML readable report out of it. This report contains all the diffs in headers, requests, responses, etc. This diff is important, because headers can get encoded, additional request parameters can get added or missed, and responses can change while moving from one pool to another or by downstream services.
Important nuances
-
Item page total production traffic is more than 350 million requests per day. It's not possible to mirror all the requests, because mirrored request adds load on all downstream services and storage as well. Typically we pick 2-3 machines randomly from a production pool and mirror their traffic.
-
Mirrored requests can fire tracking events similar to production requests. These tracking events are suppressed by passing a “nocache=true” value in tracking header or by pointing test machines to test tracking pools.
-
Important user-specific and sensitive information is masked out while storing HTML and JSON responses and creating reports.
Conclusion
This tool can be run in a production environment continuously for a long time to cover all the possible unknown and random use cases without impacting production traffic or users. It provides continues automation and can capture multiple issues that would have been otherwise missed due to change in dependencies or other factors. We use similar approaches to test our key pages, such as search, and automate and verify other efforts such as platform migration.