Testing

Measuring Success with Experimentation

By: Tianlin Duan

Tips from eBay's Experimentation Science team on how you can best leverage A/B tests to measure the success and health of your product.

During a recent internal product conference, I had the honor of sitting on the Product Health & Opportunity Sizing panel. Along with four amazing product owners, we discussed the importance of opportunity sizing, and shared tips, tools, and challenges when measuring the success and health of products. As the Experimentation Science team, we have the great pleasure of working with product teams all across eBay to understand the impact of their products. In this blog, we would like to share with you some of our insider tips on how you can leverage experimentation to guide your product development journey.

Definition of success

Step #1 towards effectively measuring the success of your product? Have a clear definition of success.

Often times when product owners approach us to discuss the feasibility of an experiment, they already have a product or feature change in mind or already in development and at least a vague idea or vision of what impact this change might have. The very first thing to ask at this early stage would be, what is your definition of win, or what does success look like for this product/feature. This definition should cover both the potential business impact and how it might change user behavior. We currently put a lot of emphasis on the first part, and less so on the latter, but the two actually go hand in hand. Try to frame your definition of success in terms of the engagement you would like to see from the users, and then describe how that positive behavioral change may lead to desired business outcome.

In sum, when building your business case, form a user story first. With a good user story comes a strong hypothesis and naturally follows your set of success metrics.

How to measure success

Success is rarely defined by a single goal. Success can mean increasing conversion and decreasing friction; it can encompass both short-term win and long-term impact; it may also include keeping potential cannibalization to other parts of the site at an acceptable level. Each of these aspects of success corresponds to one or more metrics that have their own expected movement in response to the testing feature, and together they help you see the whole picture of the impact of your product. Brainstorming and doing your homework on all these components of success will prepare you for the upcoming test. You will know what might happen and what to look for.

When finalizing the design of your experiment, we work with you to translate your clear and thorough definition of success into metrics that serve various purposes: some measure the direct impact of the testing feature, some measure actions that are a few more steps from the change, and others provide additional insight into your product. Next we’ll dive into three main categories of metrics that we focus on in experimentation context: primary success metric, guardrail metrics, and monitoring metrics.

Let’s consider an example. Say you are testing different designs of the Search bar to improve its prominence. By drawing out the User Flow Diagram, you define a desired user path as noticing the more prominent design → perform a search → find the item they want or perform a new search until then → showing purchase intent by clicking on Add to Cart or Buy It Now → complete the purchase.

Primary success metric

This is the metric that measures the most direct impact of your product, and is thus by definition a product metrics, not a business one. It is also the metric that Touchstone, our experimentation platform, uses to determine the duration (or days remaining) for your test, and what you should focus on when interpreting the results of the test.

In the Search Bar prominence example above, the expected immediate next action after users in test are exposed to the new design (definition of “treated”) is “perform a search”, which translates into the number of searches. If the number of searches per treated sample is higher in Test than Control, then you’ve realized your definition of success—making the Search Bar more prominent.

Guardrail metrics

These are metrics that measure the overall impact on the site or the business, and thus also the metrics you don’t want to break. They are generally business metrics that are usually a couple steps away from the immediate action your feature triggers, and thus they are noisier and take longer to reach statistical power. Business metrics include metrics such as Purchases, Bought Items (BI, number of items purchased in a transaction), and GMB (gross merchandise bought) fall into this category.

In our example (or in every experiment that we run), the ultimate goal is to drive more purchases, have more purchasing users, and make more revenue eventually. But notice how many steps we have between performing a search and making a purchase? Chances are, if the new design is indeed working and we observed a statistically significant lift in the number of Searches, that exciting little lift might have already diminished before users reach the last step, completing a transaction. If you chose business metrics, especially BI or GMB, as your primary success metric, you will very likely be disappointed when the experiment ends after weeks in noise without a clear launch signal.

Secondary/monitoring metrics

These are metrics that measure the indirect impact of your feature and/or provides additional insights into how your feature is impacting user behaviors. Product metrics measuring events or user actions between the immediate next step (measured by your primary success metric) and end-of-funnel transactions (measured by your guardrail metrics), or cannibalization of your feature on related features on site all fall into this category.

In our example, metrics like SRP (Search Result Page) to VI (View Item page) conversion and purchase intent metrics such as Buy It Now or Add to Cart clicks make good secondary metrics, and you may also benefit from tracking and comparing click shares on different components of the Global Header (where Search Bar lives) to understand how the more prominent search bar is affecting these coexisting features.

Hopefully the above categorization of metrics provides some inspiration on how you want to measure the success of your experiment. Try identifying your primary, secondary, monitoring, and guardrail metrics for your next experiment, and if you’re new to the process or have questions about it, reach out to your experimentation team for help.

Is your success really measurable?

Often times we jump right into the discussion on how to measure the success of your product, but we also need to think about whether that success is measurable. There are three important pieces of information that can help us answer that question: the baseline performance of your success metric, the expected lift as a result of your product/feature change, and your treated percentage. In this section, let’s deep dive into treated percentage.

Treated percentage represents the percentage of users that would actually experience the product change you have in mind. For example, if you’re testing some changes on the checkout success page, fewer than 10% of our users who come to the site might actually have the chance to complete a purchase and really see the change. If you’re testing something on the Home Page, the treated percentage would be much higher.

Know when NOT to experiment

Having a decent treated percentage is key to make sure your definition of success is actually measurable. Here’s an example: let’s say you have one test group and a control and you’re hoping to test it on US Desktop, which usually has the least concern about traffic. Having a 70% treated percentage would let you identify a 1% lift on purchase and BI within the minimum test duration required by our Experimentation Platform. A 10% treated percentage would extend that time to months. A 1% treated traffic? Over a year! In case you wonder, it takes longer than 5 years to measure a 1% lift in GMB in this case. This is not to mention that we rarely see a 1% or larger lift in these business metrics, meaning that these numbers are, in fact, an underestimate of the actual duration your test is going to take.

If you already know you will be dealing with small treated percentages, here are a couple options you should consider:

If your treated percentage is extremely small, like <5%, don’t run an A/B test to measure business impact. Use your best product judgment to launch it. As you’ve seen above, it’s going to take months, if not years, to measure a 1% lift that rarely happens, so what usually happens is that the test would run for a month and end up having all metrics in the noise and underpowered.
If you have a relatively small treated percentage, like 10%, be our friend and do yourself a favor by choosing a product metric, especially one that measures the direct engagement with the feature, as your primary success metric. Also, A/B test in this case will only be efficient and helpful if you’re testing a high-impact change—if you’re expecting a 0.1% lift, even engagement metrics can take months to reach statistical power.

The myth of GMB

From a company’s perspective, there is no doubt that revenue matters a lot. In eBay’s business context, GMB has always been a key metric that finance teams track and monitor, as it should be. It's a separate discussion about whether it’s the money or the users (buyers and sellers) that matters more for a company’s health, growth, and success, but in both experimentation and product development context, we would like to argue that the users and their behaviors are much more measurable and provide much more actionable insights for your product development journey. GMB is the most noisy metric among all and measures purchase behavior, which is the very end of the funnel. Remember that the more steps between the event measured by your primary success metric and your actual product change, the more noise you have in the data and often the longer the duration. If you don’t want to wait 6.5 years to get a read on your test, resist the temptation to define success as “boosting GMB by 1%.”

A related case is driving conversion. One question to always ask is, what exactly is “conversion” in your case? Conversion is the general movement of users moving one step further down the shopping funnel, and thus could mean quite different things depending on the context: conversion for a personalized module on the homepage might mean users showing interest in the module and clicking to land on a VI page, while conversion on a VI page might mean a Buy It Now click or other actions showing purchase intent. Note that Conversion never meant GMB or revenue. It is widely accepted in the industry as “the percentage of users who perform a desired action.” If this desired action is making a purchase, then the Conversion Rate can be a percentage of users who completed at least one purchase during the test period, but not the average GMB generated by a user or the average GMB per session. The good news is, if defined properly, conversion rate can be a great candidate for a primary success metric, and it usually requires less duration to reach statistical power given its nature as a binary metric.

Hopefully by now you have a much better sense of how to define the success of your product and feature, whether that version of success is easily measurable, and how to leverage different types of metric to get the whole picture when measuring success with experimentation. Thank you for reading and please feel free to leave comments, questions, or suggestions below! Have fun experimenting.

Tags: Testing, User Experience