The Design of A/B Tests in an Online Marketplace

By: Jason (Xiao) Wang

A/B testing is at the heart of data-driven decision making at eBay when launching product features to our site. However, the tests must be designed to carefully manage the interaction between the test and control groups.

Typically, when developing a test, a part of the live traffic is randomly selected to receive a new feature (test) while another part is given the product status quo (control). The randomization is based on a cookie ID or user ID (provided when the customer logs in). It is assumed that customers’ behavior is independent of each other (not affected by other people’s treatment), so that we can extrapolate from the observed difference test vs. control regarding the effect when launching to all customers.

The independence assumption however doesn’t always hold, given the complex dynamics inside our marketplace. In one scenario, the experiment is a new ranking algorithm for auction items that introduces factors like predicted item quality, in addition to the default of time-ending-soonest. Imagine that the algorithm does indeed do a better job and that people in the test engage more with the auction. However, a surprisingly large amount of credit will be claimed by the control group. The reason is that, by the time an auction is close to ending, it is easier to get prominent impressions in control and to convert. The observed lift of auction conversions therefore doesn’t provide a good estimation to the actual impact if we launch the algorithm to site.

Another scenario where we face the challenge of test-control interaction is with customer-based randomization. The experiment is about offering price guidance to sellers so that they don’t list their items at an unrealistically high price, which reduces the chance that the item will sell. The hypothesis, in this case, is that sellers will take the advice, and eBay overall will sell enough additional items so that the total GMV (gross merchandise value) will at least be the same. A natural design is to randomize by seller ID, i.e. some sellers are placed in the test receiving price guidance while others are in the control receiving no guidance. The experiment can be run for weeks, then the amount sold by the test group is compared to the amount sold by the control group. There are at least two possibilities of interaction with this way of testing:

Buyers will shift their purchases from control to test to purchase at a lower price. The experiment will show test selling more than control, but there is no net sales gain for eBay (cannibalization).
Sellers in the control group monitor price changes of sellers in the test group, if they are competitors. The effect of price guidance will spill over from test to control.

Randomization by leaf category

Martin Saveski, et al.¹ discussed the violation of the independency assumption in social networks and proposed cluster-based randomization. We have a different context and strategy to mitigate test-control interaction. Buyers often shift purchases from seller to seller, but they are less likely to substitute the intention of purchasing one type of merchandise with another type, e.g. replacing a cellphone with shoes. At the same time, cellphone sellers may be competing with each other, but they won’t care how shoe sellers manage their inventories. So instead of randomizing on the seller, the idea is to randomize by the type of merchandise so as to control both cannibalization and spillover. eBay has various ways to define merchandise types, and we chose the leaf category of listed items for randomization. Examples include Cases Covers & Skins, Men's Shoes:Athletic, and Gold:Coins. If a definition has too fine a granularity, it won't help much to control the interactions (e.g. people easily switch from buying an iPhone 8 to buying an iPhone 8 Plus). On the contrary, too coarse a definition, e.g. Electronics, Fashion, and Collectibles, may diminish the sample size so severely that the experiment becomes too insensitive and even useless. The leaf category provides a reasonable balance.

There are total about 10,000 leaf categories on the eBay site. Some of them have very sparse sales, for example Collectibles:Rocks Fossils & Minerals:Fossils:Reproductions. Even if we do an even split of 50% vs 50%, the test and control groups will each only have a few thousand effective samples. Moreover, these samples are often incomparable with regards to GMV or BI (bought item). Taking the category DVDs & Movies:DVDs & Blu-ray Discs as an example, it sells nearly twice the number of items as the next most sold category. If an experiment is to measure BI and the Discs category is assigned to test by chance, then there is a valid concern of selection bias. Mathematically let $i$ be the index of a category and $x_i$ be the metric of interest during experiment, the *treatment effect* is measured as $$mean(\{x_i \thinspace : i \in test\})-mean(\{x_i \thinspace : i \in control\})$$

The concern is that it will largely reflect the inherent variation among selected categories rather than the actual price guidance effect.

Difference in differences is a technique commonly used for a remedy. Let $x.pre_i$ be the value of the metric for a period before the experiment, $x_i-x.pre_i$ is the change over time for category $i$. If price guidance works, we expect on average bigger change in test, so the treatment effect can be measured instead as $$mean(\{x_i-x.pre_i \thinspace : i \in test\})-mean(\{x_i-x.pre_i \thinspace : i \in control\})$$

While the idea of using pre-experiment data for variance reduction is intuitive, a better way than difference in differences is to use post-stratification. Thinking of pre-experiment characteristics as covariates, Alex Deng and others² studied both building regression models as well as constructing strata and established their connection. In our implementation, the data is fit into a linear regression model $$x_i = a + b*I_i + c*x.pre_i$$
where $I_i$ is the indicator variable which equals $1$ if category $i$ falls in test and $0$ in control. Coefficient $b$ is then the treatment effect: when category switches from control to test, the expectation of $x_i$ increases by $b$.

Comparison of analysis methods

Recall that we put forward category-based randomization, but are worried that the small number of categories and their large variation will make the estimate of treatment effect noisy. We plot the daily BI of test categories vs. control categories below. Notice that the green line is consistently higher than the red, and it is exactly because categories like DVDs & Movies:DVDs & Blu-ray Discs are randomized into the test group. The experiment was started on Feb. 1, but due to a logging glitch, data is missing during Feb. 1-14. The missing data is not relevant to our study. The following graph shows the pre-experiment period Jan. 9-31 and the experiment period Feb. 15-March 22.

We are interested in not just the testing result, but also comparing the different analysis methods. For that purpose, we do A/A simulation where categories are randomly divided into test and control with no regard to actual treatment and all three equations are computed. The simulation was run multiple times. Since there is no treatment effect, we expect and do see the mean of simulated lift (treatment effect divided by control mean) is close to 0. Its standard deviation, however, provides a measure of the noise level.

As clearly shown in the above table, it is vastly better to leverage pre-experiment data than directly comparing test and control, yielding over a 10X reduction in the standard deviation. Post-stratification provides a further 12% reduction over difference in differences.

In summary, we discussed the issue of test-control interaction when conducting A/B tests in an online marketplace and proposed category-based randomization rather than the traditional user-based strategy. To cope with the selection bias, we advocate for leveraging pre-experiment data as covariates in a regression model. Together it gives us a better design to detect the treatment effect with trust and sensitivity.

References

¹Martin Saveski, Jean Pouget-Abadie, Guillaume Saint-Jacques, Weitao Duan, Souvik Ghosh, Ya Xu and Edoardo M. Airoldi, 2017, Detecting Network Effects: Randomizing Over Randomized Experiments, in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p1027-1035
²Alex Deng, Ya Xu, Ron Kohavi and Toby Walker, 2013, Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data, in Proceedings of the sixth ACM international conference on Web search and data mining, p123-132

Leadership for the development of this method provided by Pauline Burke and Dave Bhoite.

Tags: Mathematics, Testing