Experimentation Platform Articles

External Validity of Online Experiments: Can We Predict the Future?

Momo Jeng — Wed, 20 Nov 2024 20:45:40 +0000

“It is difficult to make predictions, especially about the future”

– Yogi Berra (perhaps apocryphal)

How well can experiments be used to predict the future? At Microsoft’s Experimentation Platform (ExP), we pride ourselves on ensuring the trustworthiness of our experiments. We carefully check to make sure that all statistical tests are run correctly, and that the assumptions that underlie them are valid. But is this enough? All this goes to the internal validity of our experiments. What about their external validity [1]?

Suppose an experiment on Norwegian users shows that a change to your website increases revenue by 5%, with a tiny p-value and narrow confidence interval. How confident can you be that the treatment would increase revenue by 5% when shipped in Korea? Most data scientists would express reservations, urging that a separate experiment be run on Korean users. They would point out that Norwegian and Korean users likely have different preferences and user behaviors, and a change that users in one country love may be hated in another. In other words, they would question the external validity of the experiment and say that if you wanted to draw conclusions about the second population of users, you should run an experiment based on that population.

External Validity of the Future

However, at ExP we (along with every other online experimentation platform in the world) routinely assume the external validity of our results on a population that we never experimented on: users in the future. If we see a 5% revenue gain in an experiment one week, we assume that this means we will get revenue gains after we ship it, even though the future is different: user behavior may change over time, the type of users who use the product may change, other developers may ship features which interact with the first one, etc… How much should we worry about external validity here?

It’s a bit strong to say that we just “assume” this. We’re of course well aware both that the future is different, and that issues like “the winner curse” lead to systematically overestimating treatment effects [2,3]. We frequently validate our assumptions, by running reverse experiments (where the treatment reverts a feature) or holdout flights (where a holdout group of users are never shipped one or more new features) [4,5]. Our focus here is not on these sorts of checks, but rather on how often we should expect to have problems with external validity for users in the future.

External Validity and Surprises

Suppose we’ve collected data for a week and calculated a treatment effect and 3σ confidence interval for a metric. We’re planning to collect a second week of data and combine it with the first week of data to get an even better estimate. How confident are you that the new estimate will lie within that original 3σ confidence interval? How often would you expect to be surprised? Would you expect the surprises to come from external validity problems? Before reading on, try to form a rough estimate for your expectation.

Next Day External Validity: What Will Tomorrow Bring?

We started by looking at one-day metric movements. Most ExP experiments generate a 7-day scorecard which calculates treatment effects and their standard errors for all metrics relevant to the feature being experimented on. The scorecards typically also calculates those for each of the individual 7 days. We collected all 7-day scorecards generated by ExP over the course of a week. For every metric, for all 6 pairs of adjacent days, we compared the treatment effect estimates for those two days by calculating

$ z = \frac{\Delta_{t+1} – \Delta_t}{\sqrt{\sigma_t^2 + \sigma_{t+1}^2}}$

Here Δ_t is the observed treatment effect on day t and σ_t is its standard error. This gave us several million treatment effect pairs, drawn from over a thousand online experiments.

Next-Day Deviations: a First Look

If the treatment effects from the two adjacent days are drawn independently from the same distribution, we should expect z to have the distribution of a unit width Gaussian. In practice, we might expect some positive correlation between the two, which would shift the distribution to smaller values of |z|. Comparing the distributions:

Figure 1: Day-to-day differences of treatment effects follow the expected normal distribution for |z|<3.

At first sight, it looks pretty good! There’s an extra spike in the observed distribution at z=0, which corresponds to a large number of metrics that have exactly 0 treatment effect on both days. Most of those come from poorly designed metrics that are almost always 0 in both control and treatment. But other than that, the fit looks quite close.

Next-Day deviations: a second look

Before we declare success and close up shop, let’s switch to a cumulative distribution function, and plot on a log scale:

3, indicating external validity problems." class="wp-image-1101972" style="width:553px;height:auto" />

Figure 2: Day-to-day differences of treatment effects are much more common than normal distribution would predict for |z|>3.

Now we see that the match is pretty good for |z| < 3, but past that point, we start to get large |z| values much more than the Gaussian distribution would predict. As mentioned above, if there were positive correlations, we would have less values of large |z| than predicted by the unit Gaussian. But we have more. To show a few sample values from the graph above:

z	Observed CDF	Unit Gaussian CDF
1.96	3.4%	5.0%
3	0.25%	0.27%
4	0.09%	0.006%
5	0.07%	0.00006%
10	0.03%	2*10-²¹%

Table 1: Selected data points from the graph above, comparing observed CDF with the CDF of a unit Gaussian distribution.

Observing differences with |z| > 10 should essentially be impossible. In practice, it’s not common, but it does happen much more than it should: 3 out of 10,000 times isn’t a lot in absolute terms, but it’s a factor of 1.5*10¹⁹ too high!

Weekday and Weekend Effects

You might think that these large discrepancies come from comparing weekdays to weekends. For many business products, like Office or MS Teams, we would expect usage to be quite different on Friday and Saturday, for example. If we segment based on whether we’re comparing two weekdays, or two weekends, or a weekday to a weekend, we do find more large discrepancies when comparing a weekday to a weekend. But large discrepancies are found for all three categories:

Figure 3: Day-to-day differences of treatment effects, segmented by whether we are comparing two weekdays, two weekends, or a weekday to a weekend.

This tells us that there’s a problem with using the data on day n to draw conclusions about day (n+1). It’s not a huge external validity problem. The assumption that today’s treatment effect is a reasonable predictor of tomorrow’s holds most of the time — but the bad predictions are worse than we’d naively expect. The distributions from which our samples are drawn on day n and day (n+1) are not the same, and that leads to large differences more often than you would expect. Another way of saying this is that if you looked at the standard error on day n, you would be overconfident about what you might see on day (n+1).

Differences between treatment effects on adjacent days follow the expected Gaussian distribution for |z|<3, but larger values of z occur much more often than expected.

This means that the true underlying treatment effects that we’re trying to measure are changing from day to day. These might be due to external factors, or they might be due to learning effects such as novelty and primacy effects [6]. We’ll return to this later – and show that systematic time trends such as from novelty and primacy aren’t the sole causes – but first let’s dig into the day-to-day movements.

Day-to-Day Volatility

This graph shows treatment effects on each of the first 7 days of an experiment at ExP (the metric and product have been anonymized):

Figure 4: Time series for an anonymized metric, illustrating multiple shifts from day to day larger than expected by the daily error bars.

We see that there’s a lot of day-to-day volatility. The treatment effects shift up and down from day to day more than we would expect from the daily error bars. For example, on day #1 we are quite sure that the treatment effect is 0.0009, with a small error bar. But on day #5, we are sure that it’s 0.0015 – almost twice as much – again with a very small error bar.

The final result in the full 7-day scorecard is shown as a purple line, with the 95% confidence interval as a band around it. At ExP we warn customers that they should usually run their experiments for at least a week, to avoid weekday vs. weekend effects, or other events isolated to a single day. We next look to see if this gets rid of the external validity problem.

What Will Next Week Bring? (Not Necessarily Smaller Error Bars)

We next looked at all experiments at ExP over the course of a month that had 7-day and 14-day scorecards, to see what impact the addition of a second week of data had. We restricted ourselves to cases where the 7-day results were statistically significant (p < 0.05), since those are the metric movements our customers are most interested in. Since this is a biased sample, we also checked that the results were qualitatively similar when looking over all metric movements, without restriction by p-value.

The first big surprise, before even looking at the treatment effect values, is that when we go from 7-day to 14-day scorecards, the standard error decreases 83% of the time, and increases 17% of the time. Under the assumption that all data is drawn from the same fixed distribution, increasing the amount of data should almost always decrease the error bars. We’ve excluded from this analysis metrics that are cumulative, like “# of active user days,” restricting ourselves to metrics that don’t intrinsically grow with time, like “% of days a user is active.”

This is a clear sign that the data in the second week is drawn from a substantially different population than in the first week. We’re often not getting a more precise estimate of a point value. We’re instead mixing in samples from a new distribution and increasing the error bars.

Adding in a second week of data increases error bars on treatment effects 17% of the time.

This shows that even after getting a full week of data, we still don’t have a stable treatment effect. Whether that’s because the treatment effects are changing due to factors external to the experiment, or due to learning effects within the experiment is something we will address later, but the conclusion is still the same: we often do not have a stable measurement of the treatment effect even with a full week of data.

How Confident Can We be in Confidence Intervals?

Next, we looked at how often the 14-day result fell outside the 7-day 95% and 3σ confidence intervals. We call these “14-day surprises.” If we could get the true treatment effect, instead of just a 14-day result, this should only happen 5% and 0.3% of the time, respectively. The 14-day result instead fell outside 9% and 4% of the time. As a concrete example, for the time series we saw earlier in Figure 4, the 14-day result falls outside the 95% confidence interval, showing that even analyzing a full week of data didn’t fully capture the volatility of this metric.

You might object that the 14-day result isn’t the true treatment effect (and the very concept of the “true” treatment effect is rather nebulous, given that we’ve established that the distribution is changing over time). In reality, the 14-day result is a separate measurement, correlated with the 7-day result, since it’s based on a superset of the 7-day data. It’s not possible to generally calculate the probability of falling outside the 7-day confidence interval, because this correlation can be very complicated. For example, metrics can be user-level, with user activity spanning both weeks, or percentile metrics, which depend holistically on all the data.

How Often Should We be Surprised?

However, if the metrics are average metrics, with N samples in the first week, and N samples in the second week, then in the limit of large N, the probability of falling outside the 7-day confidence interval associated with z_α is 2*Φ(√2* z_α) where Φ is the unit Gaussian cumulative distribution function. This gives a probability of 0.56% for falling outside the 95% confidence interval, and 0.002% for the 3σ confidence interval. The probabilities are even smaller than when we earlier thought of the 14-day result as the true treatment effect, because the 14-day result has errors directionally correlated with those of the 7-day result, a result which should generally hold true even for more complicated metrics.

In the end, our concern here is less the precise theoretical value, and more that a typical data analyst takes a confidence interval as a good sign that “the right answer is probably in here,” and expects collecting more data to get them closer to “the right answer.” How likely are they to be surprised when they get more data? A 4% surprise rate for 3σ confidence intervals means a lot of surprises: we can’t be as confident in our confidence intervals as we would have initially expected.

When we add a second week of data, the new point estimate falls outside the 1-week 3σ confidence interval 4% of the time.

External Validity and Novelty: Why do Things Change Over Time?

We’ve established that the measured treatment effect changes over time, so that even a 7-day scorecard often doesn’t produce a stable measurement of the treatment effect. Does this occur due to factors external to the experiment, or due to learning effects?

Learning Effects

Alert readers may have been wondering if these changes can be attributed to learning effects, such as “novelty” or “primacy” [6]. We’re talking about the distribution changing with time, and it’s a well-known problem that users may take some time to adjust to online changes. Perhaps when a feature is introduced, users engage with it out of curiosity, making the feature look quite popular, but once the novelty has worn off, the gains fade away. Or perhaps the initial reaction appears bad, because users have difficulty adjusting to change, but once they get used to the new feature, the apparent regressions disappear or even reverse.

Given the following 7-day time series, most informed experimenters would warn that the treatment effect is systematically increasing with time, and that the 7-day confidence interval should be viewed with suspicion. They would say there’s a good chance that the 14-day result will lie above the 7-day confidence interval. (And in this case, they would be right.)

Figure 5: Time series of treatment effects showing a systematic time trend.

Is this the cause of the 14-day surprises that we saw earlier? When we visually scan through the time series for the 14-day surprises, we do see many graphs with systematic time trends, such as in figure 5, but we also see many graphs like that in figure 4, where the results move up and down randomly from day to day.

Kendall’s Tau Statistics

Learning effects should generally result in systematic trends, with treatment effects monotonically increasing or decreasing with time. To quantify our observation that the 14-day surprises are from a mixture of causes, we plot the distribution of Kendall’s τ statistic on the 7-day series [7]. This statistic is a measure of how “time-trendy” the data is: it’s +1 if the data is monotonically increasing, -1 if it’s monotonically decreasing, and otherwise takes on some intermediate value depending on how consistently the data is trending up or down.

Figure 5 shows the distribution of Kendall’s τ across all 7-day series in blue. The theoretical distribution of Kendall’s τ that we should expect if there were no time trends, so that all 7! permutations of time series are equally likely, is plotted in purple. (For simplicity we’re removed a small number of time series with ties between treatment effects on different dates.) When looking over all time series, larger absolute values of τ are slightly preferred, showing that some time trends exist, but that the overall bias is relatively small.

Figure 6: Distributions of Kendall’s τ . The distribution over all time series looks similar to that from the theoretical trend-free distribution. The 14-day surprises have a shift towards larger values of | τ |, but still a significant number of low values of | τ |, showing that a large number of surprises cannot be attributed to learning effects

Learning Effects vs. External Validity

When we restrict to the cases where we have 14-day surprises (14-day results outside the 7-day 95% CI) the distribution shifts significantly to larger absolute values of τ, showing that learning effects are responsible for a significant number of the surprises. However, there is still a healthy portion of the distribution at low values of | τ | that would occur naturally even with a trend-free distribution, confirming our observation that many of the time series look like those in figure 4, with no observable time trend, and thus no sign of learning effects. Novelty and primacy effects are significant causes of treatment effects changing over time but are not the sole causes.

Conclusions

This shouldn’t be a cause of great pessimism or despair about the value of statistical tests or online experimentation. The external validity issues that we’ve seen here are the exception, not the norm. Extrapolating from statistical tests on past data to the future works most of the time, and online experimentation is still the best way to measure the impact of your changes and estimate what will happen when you ship them.

But it should be a cause for at least some healthy skepticism about the future impact of your changes. If you really want to be sure about that impact in the future, you need to continue to measure it, with extended experiment durations, reverse flights or holdout experiments. Observing much higher day-to-day volatility in the treatment effects than you would expect from the single day standard errors provides a reasonable starting point for detecting cases where you need to do this.

Bibliography

[1] “The Importance of External Validity,” A. Steckler and K. R. McLeroy, Am. J. Public Health, 98: 9-10 (2008)

[2] “Selection Bias in Online Experimentation,” M. Shen, Medium.com (opens in new tab)

[3] “Winner’s Curse: Bias Estimation for Total Effects of Features in Online Controlled Experiments,” M. Lee and M. Shen, KDD ’18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, p.491-499

[4] “Holdout testing: The key to validating product changes,” statsig.com blog (opens in new tab)

[5] “Holdouts: Measuring Experiment Impact Accurately,” geteppo.com (opens in new tab)

[6] “Novelty and Primacy: A Long-Term Estimator for Online Experiments,” S. Sadeghi et. al., https://arxiv.org/abs/2102.12893 (opens in new tab)

[7] Kendall, M. (1938) A New Measure of Rank Correlation. Biometrika, 30, 81-89

The post External Validity of Online Experiments: Can We Predict the Future? appeared first on Microsoft Research.

Experimentation in Generative AI: C++ Team’s Practices for Continuous Improvement

Cindy Chiu — Wed, 13 Nov 2024 00:13:26 +0000

By Sinem Akinci, Microsoft Developer Division and Cindy Chiu, Microsoft Experimentation Platform

Generative AI [1] leverages deep learning models to identify underlying patterns and generate original content, such as text, images, and videos. This technology has been applied to various industries, including customer service, marketing, and software development. A popular example is GitHub Copilot, which generates code based on open-source data.

The generative AI space is undergoing rapid transformation with new updates and changes emerging daily. Products leveraging generative AI must constantly make decisions on the right set of parameters, models, and prompts to find the best combination. Experimentation plays a crucial role in navigating this dynamic landscape, which enables data-driven decision-making and refining generative AI features. As a case study, let’s now explore how the Microsoft C++ team applies this technology in practice, using experimentation to develop and refine GitHub Copilot features.

In this blog post, we will first provide a general overview of best practices for experimenting and evaluating generative AI features. Then we will highlight some of these practices that the C++ team uses to develop GitHub Copilot features with experimentation. We will explain how these best practices benefit the product. Lastly, we will conclude with an example of a new feature we shipped leveraging these practices.

Methods for making data-driven decisions for generative AI products

What are qualitative methods?

Qualitative methods [2] offer valuable insights into the user experience through various approaches such as usability studies, surveys, focus groups, interviews, and diary studies. These methods help uncover the nuances that are hard for quantitative methods to capture, providing an initial understanding of user interactions. However, since qualitative methods often come from smaller sample sizes, they may not provide a complete picture. Instead, these methods enable developers to identify gaps between features and user needs, particularly for generative AI features that involve both model content and user interface.

What are quantitative methods?

Quantitative methods for evaluating generative AI features can be divided into two categories: offline evaluation and online evaluation.

Offline evaluation, which includes techniques like hyperparameter tuning and grid search, assesses model accuracy and feature performance before deployment. This approach works particularly well when there are known ground truth values and clean datasets. By using various datasets and predefined metrics, developers can compare models and benchmarks cost-effectively without exposing them to actual users.

Online evaluation, such as A/B testing, involves exposing the feature to actual customers. It verifies the results observed during offline testing in a real-world context, capturing true user interactions and ensuring the feature performs effectively in production.

Incorporating all methods into your product lifecycle

AI solution lifecycle for data science and ML engineering

The generative AI product lifecycle [3] is an iterative approach to preparing, deploying, and improving a generative AI feature over time. During the experimentation and evaluation stage, offline evaluation is used to assess whether the model performs better than other baselines. Although offline evaluation provides an understanding of model accuracy, it does not represent user interactions, making online testing essential.

A/B testing helps validate the results by capturing real user interactions. Once the model is deployed, qualitative methods such as user studies can be used to collect user feedback, particularly for features designed for user interaction. This feedback is then incorporated to further refine and improve the feature, completing the product lifecycle.

Using progressive rollout to test your generative AI feature

What is progressive rollout?

Progressive rollout starts by exposing a new feature to a small percentage of users and gradually rolling it out to more users based on its performance. Traffic as small as a few thousand samples is used to test whether the feature works as expected and observe any movement in user metrics, rather than to make a definitive decision on shipping.

What’s the benefit of progressive rollout?

Mitigating risk of errors or bias: Due to the non-deterministic nature of AI, generative AI features can sometimes produce unexpected or inappropriate content. By gradually rolling out the feature, developers can be assured that the work they have done to address unexpected output holds up broadly, safeguarding against potential harm. This approach also helps in detecting data quality issues, such as Sample Ratio Mismatch (SRM) or inappropriate data movement, ensuring a more reliable deployment.

Learning and Improvement through performance management: Latency is a key component of performance, and it can significantly impact generative AI products. Users may abandon the feature if the response time is too long. Measuring performance and latency is essential to ensure that the user is getting the intended value in a timely manner. By identifying regressions in performance metrics, such as increased response times or higher crash rates, early on, these issues can be addressed promptly. Progressive rollout not only allows the product team to provide hotfixes while the feature is still exposed to a small percentage of users, but also helps predict capacity needs more accurately, ensuring the best user experience as capacity ramps up.

Iterating experiments to optimize your feature

Why run multiple iterations? What are the benefits?

Developers frequently run multiple experiments on the same product. As highlighted in the generative AI product lifecycle, after collecting user feedback or analyzing experiment results, developers can iterate on the experiment to better incorporate the feedback and enhance the product. As generative AI models evolve, various models become available for production. One key question is: which model is best for the users? This varies by feature. For instance, AI-assisted renaming functions require quick response times. Renaming occurs during the natural developer flow, which requires a responsive interaction. If this responsiveness isn’t achieved, the feature’s benefit may decrease as developers might prefer to continue their work stream rather than be delayed by latency. Conversely, features like pull request reviewer benefit from models that are capable of more complex reasoning, where precision is more critical than speed. A/B testing different models helps developers determine whether users prefer faster responses or higher quality.

When iterating over experiments, teams can refine hypotheses, modify treatments, and test new variations. For example, experiments can be conducted on different language models. This iterative method enables an experience in production that maximizes user engagement. Continuous iteration and refinement not only lead to more polished products but also ensure that the product evolves in alignment with user needs and preferences.

Combing best practices to help C++ users: Copilot in Quick Info case study

An example of a feature that the Microsoft C++ team developed using both qualitative and quantitative methods was C++ Copilot integrations in the Quick Info dialog in Visual Studio.

Copilot in Quick Info is an AI-based feature that provides users with an AI-generated summary of a symbol that they are hovering over. Users need to select “Tell me more” on hover to invoke Copilot to provide a summary on this particular symbol. The goal with this feature was to provide users with accurate and quick information of a symbol that may have lacking documentation without switching context.

Example of the symbol that invokes the Copilot in Quick Info feature

An AI-generated summary of what the function does that appears in the Visual Studio Quick Info

Progressive rollout of initial design

After initial development, the C++ team ran an A/B experiment to measure the feature’s impact on a series of metrics. They defined metrics to ensure that it would provide value to the customer, while not introducing errors to the product. This first iteration of experimentation revealed that this functionality improved engagement with Copilot Chat for C++ users, while not regressing errors.

Qualitative studies of initial design

In tandem, they ran a user study to validate the design of the feature. Notable feedback from the developers interviewed prioritized quick results and wanted an option to follow-up on the response. This feedback was instrumental in shaping the subsequent iterative A/B experiment.

Iterative experimentation on feature

In response to this feedback, they ran two follow-up quantitative A/B experiments. First, to evaluate how quicker results affected user value, they ran an A/B experiment to swap the model behind the feature to a lightweight but faster model. Second, to evaluate the follow-up prompt, they ran an A/B experiment with a new “Continue in chat window…” option added below results to measure how this affected product value and ensure it did not introduce errors.

Iterative A/B experimentation of AI models can lead to more widespread learnings across product behavior. For example, features that are frequently invoked and close to users’ workflows may benefit from models with faster response times, such as this Quick Info feature. On the other hand, response times may not affect features that provide users with more in-depth levels of information which break user workflow to interpret, such as Fix with Copilot feature. These types of features would benefit more from models that provide more verbosity and accuracy in response.

Putting things together

Determining the effectiveness of our generative AI feature requires a blend of various evaluation methods. We begin by deciding whether to start with quantitative or qualitative approaches. These evaluations are integrated into our product lifecycle to continually enhance our generative AI product. Once our experiment is set up, we progressively roll out the feature to minimize unexpected behavior. We start by testing on a small group before expanding to a broader audience. After obtaining our experiment results, we use them to refine and improve the product through iterative experimentation.

By combining these best practices, we achieve a comprehensive understanding of our generative AI feature’s impact and effectiveness. This holistic approach ensures that our generative AI feature is both user-centric and performance-driven, providing a better user experience and achieving our business goals.

– Sinem Akinci (Microsoft Developer Division), Cindy Chiu (Microsoft Experimentation Platform)

References

[1] Reddington, C. (2024, May 14). How companies are boosting productivity with generative AI. The GitHub Blog. https://github.blog/ai-and-ml/generative-ai/how-companies-are-boosting-productivity-with-generative-ai/#what-is-generative-ai (opens in new tab)

[2] Peckham, S., & Day, J. (2024, July 1). Generative AI. Microsoft Learn. https://learn.microsoft.com/en-us/ai/playbook/technology-guidance/generative-ai/ (opens in new tab)

[3] Stevenson, J., & Ostrowski, S. (2022, February 11). Measurably improve your product by combining qualitative and quantitative methods. Microsoft Research. http://approjects.co.za/?big=en-us/research/articles/measurably-improve-your-product-by-combining-qualitative-and-quantitative-methods/

The post Experimentation in Generative AI: C++ Team’s Practices for Continuous Improvement appeared first on Microsoft Research.

A/B Testing Infrastructure Changes at Microsoft ExP

Serguei Michtchenko — Mon, 29 Jan 2024 18:25:23 +0000

Key takeaways:

A/B tests are a powerful tool for rolling out infrastructure and backend changes in a controlled way, uncovering surprises and important issues that other testing approaches may miss.
A little data can go a long way. With Variance Reduction enabled, A/B testing internal-facing services with modest traffic can still generate trustworthy stat-sig results and enable data-driven ship decisions.
Focusing on back-end specific metrics such as individual API call latency is insufficient. Product-wide guardrail metrics are needed to see the full impact of infra changes to user experience.
Even mature product data pipelines and metric definitions can harbor bugs. Validate and improve the telemetry and metrics used for running your infrastructure A/B tests.
Plan for multiple iterations of Infra A/B tests, learn more about your product and users from every iteration, and use those learnings to refine your design.

Introduction

The Experimentation Platform at Microsoft (ExP) has evolved over the past sixteen-plus years and now runs thousands of online A/B tests across most major Microsoft products every month. Throughout this time, we have seen impactful A/B tests on a huge spectrum of changes, everything from large product overhauls to the smallest bug fixes.

ExP infrastructure had to evolve and scale significantly over time to meet the needs of our users and the requirements of advanced methodologies that the team pioneered [1], [2]. In the most recent phase we’ve made ExP infrastructure more secure, resilient and available across multiple Azure geographies – features expected by product teams both inside and outside Microsoft.

Making major changes to infrastructure can be risky, but A/B experiments are an excellent tool to mitigate that risk and understand the causal effects of the changes being made. A frequent misconception is that A/B experiments are only suited for front-end or user-facing changes, but we will demonstrate how we leveraged our own A/B testing platform for rolling out extensive infrastructure changes in a controlled manner. We’ll share the lessons learned to help other teams use A/B experimentation for their own infra deployments!

The desired state

Our system started with a relatively simple n-tier architecture: a web UI that interacts with a set of backend API services that connect to a database. The following figure illustrates an example of this architecture:

Figure 1. Initial state.

As noted earlier, the end goal was to scale-out the platform to multiple regions and this also presented a great opportunity to harden the API surface and modernize some of the underlying infrastructure. In practice that meant sending all API requests through Azure FrontDoor, which adds DDoS and firewall protections, then isolating all backend resources from direct internet traffic. Our system includes a reverse-proxy layer which agglomerates the numerous backend services into one Experimentation Platform API which the users interact with and this reverse proxy layer was key for orchestrating A/B tests in the later stages of the process.

Figure 2. Desired state.

A/B Tests Designs

We wanted to assure that our architecture changes did not affect the performance or reliability of the system in a negative way. To do this, we used A/B testing to compare the old and new architectures and see how they influenced key metrics such as latency, throughput, availability, and error rate. We divided our work into two major A/B tests to roll out the changes gradually and measure the effects precisely:

The first test measured the impact of adding new routing and networking layers into the system, we anticipated some performance degradation in the treatment variant and needed to quantify it. Treatment variant traffic would pass through additional layers while control variant traffic took the existing simpler path.
The second test measured the difference between the existing “serverless” infrastructure (control variant) and equivalent services hosted on an App Service Environment inside a virtual network (treatment variant) – we did not expect this change to cause significant difference.

From an engineering perspective it might seem easier to ship the whole set of changes as a single release, however since we have two distinct hypotheses each should be tested independently. This methodical approach paid off by providing us with better insight when things did not go as planned, as you will in the results & learning section.

First A/B Test: Routing requests through Reverse Proxy

In the first A/B test, we evaluated the impact of introducing additional routing layers into the system – the requests in the treatment variant were now processed by Azure FrontDoor, Web Application Firewall, and a reverse-proxy before being processed by the application service. In addition to rewriting URLs to make them aligned with URL patterns, the new reverse-proxy enables us to route traffic based on variant assignments – which we did in the second A/B test. The hypothesis for this first A/B test was that we will observe an acceptably small increase in page load metrics and API latency due, to the new functionality and features in the networking stack. Measuring duration of individual API requests is simple but that may not capture the true impact on users’ experience – websites typically make multiple calls to the back-end APIs, sometime sequentially, and a small regression in API latency may cause a much more severe regression in overall user experience. Therefore, it was imperative to know how a change like this will impact on the overall user experience.

Figure 3. First A/B Test: Routing requests through Reverse Proxy.

First A/B Test: Results & Learnings

We anticipated a regression of 30-40ms per API request (about 5-10% for most calls). Weighing the tradeoffs between the API client workloads, the security improvements, the added functionality, improved observability, the internal-facing nature of these APIs – the team deemed this to be acceptable.

In this first iteration Azure FrontDoor was communicating with our backend App Service via private links, this configuration is simple and elegant but, in some cases, results in cross-datacenter routing since Azure FrontDoor only supports private links in a limited set of regions [5] (opens in new tab). We knew of this potential inefficiency but overall did not expect it to be significant. However, the initial A/B scorecard showed that on some frontend components the treatment group saw a severe degradation in “main functionality load” metrics (the time between the user landing on a page and seeing something useful on the screen) where in some cases it doubled! An in-depth investigation by the engineering team revealed that the minor API latency increase was amplified by (1) the UI code making certain requests sequentially, and (2) calls requiring CORS preflight requests [3] (opens in new tab) could double the network overhead. Without an A/B scorecard the unanticipated impact could have been easily missed, some of the components receive relatively little traffic and drastic changes in daily metric values can be easily dismissed as an outlier or high variance.

For the next iteration we replaced private links with App Gateways to optimize the network paths. After addressing the unexpected latencies, we obtained an acceptable result and we shipped the treatment variant to 100%.

With a clear understanding of the trade-off made by introducing an additional networking layer, we were now ready to confidently evaluate the next phase of our infrastructure changes.

Second A/B Test: Routing requests to different backend services

The second A/B test was conducted within the Reverse Proxy with the goal to evaluate the impact of hosting backend services inside a private virtual network which we did to align with best practices for system architecture and security [4] (opens in new tab). We used the reverse-proxy component introduced in the first A/B test to split API traffic between the existing publicly accessible application service instance (control) and the new hosted app service inside the virtual network (treatment). The hypothesis was that there would be no significant difference between the two variants in terms of performance and product experience (e.g. no increase in error rates).

Figure 4. Second A/B Test: Routing requests to different backends

Second A/B Test: Results & Learnings

We ended up going through more than 10 iterations of this A/B test:

Several bugs had to be fixed, code redeployed, which meant running the A/B test again.
We leveraged feature flags to quickly rollback bad deployments, in some cases before an A/B scorecard was even generated.
A/B scorecards helped identify subtle performance differences in the new infrastructure.
Early iterations revealed new gaps in telemetry, some backend and frontend events did not correlate well limiting the usefulness of early A/B scorecards.
Telemetry pipelines need to evolve in parallel with the rest of the system, sometimes that means re-running the A/B test to collect new data.

In one iteration we discovered a redundant API call in the treatment variant (the instance deployed inside the virtual network). The regression was introduced when we modified the service startup code to run in the new isolated environment. This change did not cause any correctness issues and sailed through the automated test suites. Applying feature flags to service startup code or runtime parameters is generally difficult and often impossible, in order to A/B test changes like these the A/B split must happen earlier and the reverse proxy plays that key role by routing traffic based on the assigned variant.

In another iteration, we discovered a difference in the SQL connection configuration. The treatment instances connected to the Azure SQL database through a private link (so we can eventually disable public access to the SQL server) and the default network configuration caused SQL connections to regress from using “redirect” to “proxy” mode which incurs some latency and throughput penalty [6] (opens in new tab). The initial phases of the rollout did not show a stat-sig differences in performance metrics, but subsequent phases included endpoints which were much more sensitive to latency and those results clearly showed a statistically significant difference in performance metrics. The result from the later scorecard came as a surprise because we thought this part of the infrastructure was already vetted with the initial set of endpoints. Leveraging our A/B analysis allowed us to identify this surprise before it created problems in the migration.

In most iterations the A/B analysis proved essential for obtaining trustworthy results and making objective “ship” decisions, otherwise several regressions could easily have gone undetected if relying solely on health counters and anomaly detection. There were also cases where the impact from the changes was severe enough that we did not need to wait for an A/B scorecard with fine-tuned metrics and p-values – it was apparent from basic alerts and health counters. The problems stemming from authentication credentials and authorization used by production resources are especially difficult to vet ahead of time because of the unique nature of credentials and authorization between environments. However controlling traffic via feature flags (exposure control) enabled us to quickly and precisely mitigate the problem by stopping the A/B test and reverting all traffic back to the control variant.

Summary

Ultimately, by leveraging A/B tests for our infrastructure changes, we were able to quickly identify and iterate on several unexpected metric changes that were easily identified from the A/B test results. Anytime we detected a regression, we were able to stop the exposure in seconds and start iterating on the variant. We still had to make trade-offs, but we were able to clearly articulate these trade-offs with quantifiable numbers. And in several iterations, we were able to investigate, adjust and verify to ensure that the metrics aligned with our expectations and requirements.

– Serguei Michtchenko, Heng-Yi Liu, Caleb Hug, Aleksander Fabijan, Craig Boucher, Microsoft Experimentation Platform

Special thanks to Dhanushi Wijeyakulasuriya, Yanshu Zhu, Michael Cherkasov, Anshuman Vyas, Elvis Tran, Ramesh Boddu, Sashank Kolli, Wenli Li, Benjamin Arai, Jen Townsend, and the rest of the ExP team for their contributions and involvement in this work!

References

The post A/B Testing Infrastructure Changes at Microsoft ExP appeared first on Microsoft Research.

How to Evaluate LLMs: A Complete Metric Framework

Widad Machmouchi — Thu, 28 Sep 2023 03:39:09 +0000

Over the past year, excitement around Large Language Models (LLMs) skyrocketed. With ChatGPT and BingChat, we saw LLMs approach human-level performance in everything from performance on standardized exams to generative art. However, many of these LLM-based features are new and have a lot of unknowns, hence require careful release to preserve privacy and social responsibility. While offline evaluation is suitable for early development of features, it cannot assess how model changes benefit or degrade the user experience in production. In fact, multiple explorations of GPT-4 capabilities suggest that “the machine learning community needs to move beyond classical benchmarking via structured datasets and tasks, and that the evaluation of the capabilities and cognitive abilities of those new models have become much closer in essence to the task of evaluating those of a human rather than those of a narrow AI model” [1]. Measuring LLM performance on user traffic in real product scenarios is essential to evaluate these human-like abilities and guarantee a safe and valuable experience to the end user. This is not only applicable for deploying a feature; In fact, continuous evaluation of features as they are being developed provides early insight into any regressions or negative user experience while also informing design decisions.

At Microsoft, the Experimentation Platform has worked closely with multiple teams to launch and evaluate LLM products over the past several months. We learned and developed best practices on how to design AB tests and metrics to evaluate such features accurately and holistically. In this article, we are sharing the standard set of metrics that are leveraged by the teams, focusing on estimating costs, assessing customer risk and quantifying the added user value. These metrics can be directly computed for any feature that uses OpenAI model (opens in new tab)s and logs their API response (opens in new tab).

GPU Utilization

To estimate the usage cost of an LLM, we measure the GPU Utilization of the LLM. The main unit we use for measurement is token. Tokens are pieces of words used for natural language processing. For Open AI models, 1 token is approximately 4 characters or 0.75 words in English text. Prompts passed to LLM are tokenized (prompt tokens) and the LLM generates words that also get tokenized (completion tokens). LLMs output one token per iteration or forward pass, so the number of forward passes of an LLM required for a response is equal to the number of completion tokens.

We use the following primary utilization metrics – please check the appendix for a full list of metrics.

Number of 429 responses (opens in new tab) received. A 429 error response is sent when the model and/or service is currently overloaded (opens in new tab). We recommend measuring the 95^th or 90^th percentile of the number of 429 responses to measure the peak performance.
Total number of tokens, computed as the sum of prompt tokens and completion tokens. This is the main utilization metric we recommend for tracking for GPU Utilization. OpenAI charges based on the total number of tokens used by the prompt and response. (opens in new tab)

Responsible AI

As LLMs get used at large scale, it is critical to measure and detect any Responsible AI (opens in new tab) issues that arise. Azure OpenAI (opens in new tab) (AOAI) provides solutions to evaluate your LLM-based features and apps on multiple dimensions of quality, safety, and performance. Teams leverage those evaluation methods before, during and after deployment to minimize negative user experience and manage customer risk.

Moreover, the Azure Open AI Content filtering system (opens in new tab) captures and blocks some prompts and responses that have RAI issues. It also produces annotations (opens in new tab) and properties in the Azure Open AI API (opens in new tab) that we use to compute the following metrics.

% Prompts with HTTP 400 error. This is the percentage of prompts that are classified at a filtered category and severity level.
% Responses with “finish_reason”: “content_filter”. This is the percentage of responses that didn’t return content due to content filtering.

The annotations could be further used to provide stats for each filtering category (e.g. to what extent certain filtrations have happened).

Performance Metrics

As with any feature, measuring performance and latency is essential to ensure that the user is getting the intended value in a timely and frictionless manner. LLM interactions (opens in new tab)have multiple layers hence tracking and measuring latency at each layer is critical. If there are any orchestrator or added components between the LLM and the final rendering of the content, we also measure the latency for each of the components in the full workflow as well.

We use the following metrics to measure performance:

Time to first token render from submission of the user prompt, measured at multiple percentiles.
Requests Per Second (RPS) for the LLM.
Tokens rendered per second when streaming (opens in new tab) the LLM response.

Utility Metrics

LLM features have the potential to significantly improve the user experience, however, they are expensive and can impact the performance of the product. Hence, it is critical to measure the user value they add to justify any added costs. While a product-level utility metric [2] functions as an Overall Evaluation Criteria (OEC) to evaluate any feature (LLM-based or otherwise), we also measure usage and engagement with the LLM features directly to isolate its impact on user utility.

Below we share the categories of metrics we measure. For a full list of the metrics, check the appendix.

User Engagement and Satisfaction

In this category, we measure how often the user engages with the LLM features, the quality of those interactions and how likely they are to use it in the future.

Prompt and response funnel. We compute metrics at each stage to understand how the user interacts with the model.
Some stages (e.g., editing the response) are not applicable to all scenarios (e.g., chat).

Prompt and Response Funnel: As the user interacts with the LLM, prompts are sent in, and responses are sent back. We measure the usefulness of these responses and whether the user is in fact using them in their current task. The funnel tracks the interaction from the time the LLM is triggered until the user accepts or rejects the response.
Prompt and Response Quality: Not all engagement with features provide value. To assess whether the user had a successful interaction with the LLM with minimal effort, we measure additional aspects that reflect quality of engagement: length of the prompt and response indicate whether they were meaningful, average edit distance (opens in new tab) between prompts indicate the user reformulating the same intent and Number of responses with Thumbs Up/Thumbs Down provide explicit feedback from the user on the quality of the response. Check out the appendix for detailed description of these metrics.
Retention: These metrics measure how sticky the feature is and whether the user gets retained into the LLM feature. It is an important measure to detect any novelty effect where the usage drops after the initial engagement. Any retention metric that works for your product can be modified to focus on the LLM feature. Check the appendix for the ones we use.

Increase in productivity for collaboration scenarios

For scenarios where content can be created with AI and then consumed by users, we also recommend measuring any increase or improvement in productivity, both on the creation and consumption side. Such metrics measure the value-add beyond an individual user when the AI-generated content is used in a collaboration setting.

Data Requirements

To compute the metrics, the product needs to collect the properties needed from the OpenAI API (opens in new tab) response. Moreover, we recommend collecting the end user Id (opens in new tab) from the product’s telemetry to pass to the API.

For an LLM feature that can modify a user’s text directly, we add telemetry to differentiate user edits from machine or LLM edits. Otherwise, it will be hard to measure reduction in user-added characters or text when the LLM auto-completes the content.

Running A/B tests

A/B testing is the golden standard to causally measure the impact of any change to the product. As mentioned in the intro, this is even more critical for LLM features, both at launch time as well as subsequent improvements. The metrics we share above are then used to evaluate the changes and tradeoff costs and user value.

As you embark on the journey of launching an LLM-powered feature and innovating further, we recommend running the following types of experiments at launch and post launch of the feature.

Launch an LLM Feature

Ensure that the feature at launch is performant, reliable and increasing productivity and making the right cost v. benefit tradeoffs.

Dark mode experiment: When launching an LLM Feature, we want to ensure that the feature at launch is performant and reliable. Before exposing the feature to end customers, we recommend running a dark mode experiment where the components for the feature are loaded without showing anything to the end customer.
0-1 Experiment: 0-1 experiments are special as the treatment has the LLM-powered feature and the control variant does not. We recommend rolling out the feature in a controlled rollout to ensure that you have enough GPU capacity and the product OEC and guardrail metrics are not affected, while you see an increase in productivity metrics.

Post Launch

Continue to innovate and optimize the feature to quickly address new customer needs through prompt optimization, using newer models, and UX improvements.

Shadow Experiment: Before exposing a change in the LLM feature that changes the response shown to the user, we run shadow experiments to measure the impact in a low-risk and safe manner. Shadow experiments allow you to compute the treatment and control response for the same user, but only show them the control response. For example, when a user issues a query or prompt, the user’s input is fed into both the control workflow and the treatment workflow at the same time. All users get the response from the control workflow but now that we have both treatment and control responses on live traffic for the same user, hence metrics can be evaluated for both variants. Metrics are more sensitive than regular A/B tests as the treatment and control samples have exactly the same set of users leading to variance reduction. We can also get further sensitivity gains for by using paired samples t-tests (opens in new tab) in the statistical analysis. Metrics that could be measured in shadow experiments include GPU utilization, performance and latency, RAI metrics and prompt metrics that do not rely on user engagement. However, metrics that need user response cannot be evaluated in shadow experiments as no user experiences the treatment response.
1-N Experiment: These are the regular A/B tests we run to evaluate any change introduced to the product, including LLM features. Refer to our earlier blog posts on pre-experiment, during-experiment, and post-experiment patterns of trustworthy experimentation for best practices in this space.

Summary

LLMs can be a great tool to build features that add user value and increase their satisfaction with the product. However, properly testing and evaluating them is critical to safe release and added value. In this blog post, we shared a complete metrics framework to evaluate all aspects of LLM-based features, from costs, to performance, to RAI aspects as well as user utility. These metrics are applicable to any LLM but also can be built directly from telemetry collected from AOAI models. We also described the various experimentation designs used at Microsoft to evaluate the features at release time and continuously through any change.

Acknowledgements

Many thanks to our colleagues in Azure Open AI, particularly Sanjay Ramanujan, for all their input on the API responses as well as for ExP’s experimenting partners for testing and using the metrics.

Widad Machmouchi, Somit Gupta – Experimentation Platform

References

[1] S. Bubeck et al., “Sparks of Artificial General Intelligence: Early experiments with GPT-4”, https://doi.org/10.48550/arXiv.2303.12712.

[2] W. Machmouchi, A. H. Awadallah, I. Zitouni, and G. Buscher, “Beyond success rate: Utility as a search quality metric for online experiments,” in International Conference on Information and Knowledge Management, Proceedings, 2017, vol. Part F1318, doi: 10.1145/3132847.3132850.

Appendix

GPU Utilization Metrics

Number of 429 responses (opens in new tab) received. A 429 error response is sent when the model and/or service is currently overloaded. We recommend measuring the 95^th or 90^th percentile of the number of 429 responses to measure the peak performance.
Total number of tokens, computed as the sum of prompt tokens and completion tokens. This the main utilization metric we recommend for tracking for GPU Utilization. OpenAI charges based on the total number of tokens used by the prompt and response. (opens in new tab)
Number of prompt tokens (opens in new tab). The number of tokens resulting from tokenizing the prompt passed to the LLM. While OpenAI also charges for these tokens, they are much cheaper (opens in new tab) than completion tokens and can be optimized by the product team.
Number of completion tokens. Completion tokens (opens in new tab) are the largest cost incurred when using OpenAI models. These can be controlled by changing the Max_Tokens parameter (opens in new tab) in the request.
Wasted Utilization per LLM. Some responses from the LLM will not provide any value to the user. This is due to issues such as truncation (see below), errors, “not able to understand” responses or other unactionable responses that can be defined based on the user scenario. We recommend tracking the number of completion tokens associated with these non-actionable or unused responses to keep them to a minimum.
Number of LLM calls with truncated responses. If the API response has a “finish_reason”: “length (opens in new tab)”, it implies that the call reached the max_tokens limit set in the API request, so the response is likely truncated/incomplete.

Utility Metrics

User Engagement and Satisfaction

Prompt and Response Funnel
1. Number of opportunities to suggest content: This captures all instances where the LLM was called, irrespective of whether the response was shown to the user. This is important in case there is an added layer or orchestrator between the LLM and the feature that determines whether the response is in fact shown to the user.
2. Number and Rate of prompts made to LLM
3. Number and Rate of response from LLM
4. Number and Rate of responses seen by user: As mentioned earlier, it’s possible not all responses are shown to the user due to content moderation, relevance or performance.
5. Number and Rate of accepts by user: How to identify accepts depends on the user scenario. In a text prediction or summarization scenario, the user accepts the responses by including it in the document or text they are writing. In a conversational context, an accept is when a user thumbs up a response, gets positive utility from a link provided or reengages with the bot for more information.
6. Number and Rate of responses kept (retained) by user at end of time X: This metric is particularly relevant in the context of text prediction where the user keeps the content and uses it in the doc or text they are creating.
Prompt and Response Quality
- Average length of the prompts and responses
- Average time between prompts and between responses
- Time spent on writing prompts and on generating responses
- Average edit distance (opens in new tab) between prompts: Edit distance has long been used in information retrieval as a measure of reformulating queries and hence restating user intent. The more often a user reformulates a query or prompt, the more likely it is that the original prompt or query did not provide the information they are looking for. Note that since prompts can be changed or expanded by the product beyond what the user inputs, it’s important to also separate user and product components of the prompt. Moreover edit distance metrics require some data cooking for efficient computation.
- Average edit distance (opens in new tab) between LLM response and retained content: this is applicable in text prediction or summarization scenarios where the user can accept a response and edit it to fit their needs. For other scenarios and content types, you will need to tailor the definition of edit distance.
- Number of responses with Thumbs Up/Thumb Down feedback from the user: such metrics are explicit feedback from the user on how well the LLM response answered their prompt. However, these metrics, like other user sentiment metrics, suffer from low sample size and selection bias, as users who provide such feedback are not representative of the whole population.
Retention: The following metrics can be averaged across users, sessions, days or any other unit as needed by the product.
- LLM conversation (opens in new tab) length and duration
- Average number of LLM conversations
- Average Number of days an LLM feature was actively used.
- Daily Active LLM users
- Retention rate of new-to-LLM users
- New users who use an LLM feature in their first session

Increase in productivity for collaboration scenarios

For scenarios where content can be created with AI and then consumed by users, we also recommend measuring any increase or improvement in productivity, both on the creation and consumption side.

Creator Productivity (better content created in less time)

As content creation becomes easier with LLMs, more creators will edit more documents faster and the quality of the content should improve.

Reach of the content:
- #users, #sessions creating content per document
- #documents edited with the LLM
Quality of the content – the length and richness of the prompts and responses created automatically and overall:
- Total characters retained per user
- Number and length of interactions with the LLM
- Number of total and user edits
- Number of artifacts used like images, charts
Effort:
- Average time spent by user in editing mode.

Consumer Productivity (better consumption of content in less time):

Reach of the content
- # users, #sessions consuming content per document
- # documents read that were edited with the LLM
Quality of the content
- # consumption actions (e.g. sharing, commenting, reviewing) per AI-edited document
Effort:
- Average time spent in consumption mode per document per user

The post How to Evaluate LLMs: A Complete Metric Framework appeared first on Microsoft Research.

A/B Interactions: A Call to Relax

Momo Jeng — Wed, 02 Aug 2023 21:01:03 +0000

If you’re a regular reader of the Experimentation Platform blog, you know that we’re always warning our customers to be vigilant when running A/B tests. We warn them about the pitfalls of even tiny SRMs (sample ratio mismatches), small bits of lossiness in data joins, and other similar issues that can invalidate their A/B tests [2, 3, 4]. But today, we’re going to switch gears and tell you to relax a little. We’re going to show you why A/B interactions – the dreaded scenario where two or more tests interfere with each other – are not as common a problem as you might think. Don’t get us wrong, we’re not saying that you can completely let down your guard and ignore A/B interactions altogether. We’re just saying that they’re rare enough that you can usually run your tests without worrying about them.

A/B Interactions

But we’re getting ahead of ourselves. What are A/B interactions? In an A/B test, users are randomly separated into control and treatment groups, and after being exposed to different product experiences, metrics are compared for the two groups [1]. At Microsoft’s Experimentation Platform (ExP), we have hundreds of A/B tests running every day. In an ideal world, every A/B test would get its own separate set of users. However, splitting users across so many A/B tests would dramatically decrease the statistical power of each test. Instead, we typically allow each user to be in multiple A/B tests simultaneously.

A case where concurrent A/B tests are safe

For example, a ranker might have one A/B test that changes the order of web results, and another A/B test that changes the UX. Both A/B tests can run at the same time, with users assigned independently to the control or treatment of each A/B test, in four equally likely combinations:

	Ranker #1	Ranker #2
UX #1	Control-control	Control-treatment
UX #2	Treatment-control	Treatment-treatment

Table 1: Two A/B tests for which independent control/treatment assignment is safe

In most cases, this is fine. Because which ad UX the user sees probably doesn’t have much impact on how they respond to the ranking of the results, the differences reported in the ranker A/B scorecard results will look the same regardless of what the UX A/B test is doing, and vice versa.

A case with A/B interactions

On the other hand, some cases are more problematic. For example, if there are two A/B tests, one which changes the ad text color from black to red, and one which changes the ad background color from grey to red, whether the user is in the control or the treatment of one A/B test will greatly impact the treatment effect seen in the other A/B test. The user can no longer see the ad text when it’s red on red, so the red ad text treatment might be good for users in the control of the ad background A/B test, but bad for users in the treatment of the ad background A/B test.

	Ad text color: black	Ad text color: light red
Ad background color: grey	Buy flowers!	Buy flowers!
Ad background color: red	Buy flowers!	Buy flowers!

Table 2: Two A/B tests for which independent control/treatment assignment is not safe

A/B interactions can be a real concern, and at ExP we have techniques to isolate A/B tests like these from each other when we suspect they will interact, so that users aren’t assigned independently for each test. However, as already mentioned, doing this decreases the number of users available for each A/B test, thus decreasing their statistical power.

Looking for A/B Interactions at Microsoft

Our previous experience with A/B tests at Microsoft had found that A/B interactions were extremely rare. Similarly, researchers at Meta found that A/B interactions were not a serious problem for their tests [5 (opens in new tab)].

We recently carried out a new investigation of A/B interactions in a major Microsoft product group. In this product group, A/B tests are not isolated from each other, and each control-treatment assignment takes place independently.

The data analysis

Within this product group, we looked at four major products, each of which runs hundreds of A/B tests per day on millions of users. For each product, we picked a single day, and looked at every pair of A/B tests that were running on that same day. For each pair, we calculated every metric for that product for every possible control or treatment assignment combination for the two tests in the pair. The results for metric Y are shown here for a case where each test has one control and one treatment.

	A/B Test #2: C	A/B test #2: T	Treatment effect
A/B test #1: c	Y_C,c	Y_T,c	Δ_c=Y_T,c– Y_C,c
A/B test #1: t	Y_C,t	Y_T,t	Δ_t=Y_T,t– Y_C,t

Table 3: Treatment effects for one A/B test, segmented by user control/treatment assignment in a different A/B test

A chi-square test was performed to check if there was any difference between the two treatment effects. Because there were hundreds of thousands of A/B test pairs and metric combinations, hundreds of thousands of p-values were calculated. Under the null hypothesis of no A/B interactions, the p-values should be drawn from a uniform distribution, with 5% of the p-values satisfying p<0.05, 0.1% of the p-values satisfying p<0.001, etc. [6]. Accordingly, some were bound to be small, just by chance.

The results: few or no interactions

Therefore, to check whether there were A/B interactions, we looked at the distribution function of p-values, shown here for a single day for a specific product:

Figure 1: Cumulative distribution of p-values for A/B interaction tests

The graphs for all four products look similar; all are very close to a uniform distribution. We then looked for deviations from a uniform distribution by checking if there were any abnormally small p-values, using a Benjamini-Hochberg false positive rate correction test. For three of the products, we found none, showing that all results were consistent with no A/B interactions. For one product, we did find a tiny number of abnormally small p-values, corresponding to 0.002%, or 1 in 50,000 A/B test pair metrics. The detected interactions were checked manually, and there were no cases where the two treatment effects in Table 3 were both statistically significant but moving in opposite directions. In all cases either the two treatment effects were in the same direction but different in magnitude, or one of them was not statistically significant.

Discussion

It’s possible that there were other A/B interactions that we just didn’t have the statistical power to detect. If the cross-A/B test treatment effects were, for example, 10% and 11% for two different cross-A/B test assignments, we might not have detected that difference, either because the chi-square test returned a high p-value, or because it returned a low p-value that got “lost” in the sea of other low p-values that occurred by chance when doing hundreds of thousands of statistical tests.

This is possible, but it raises the question of when we should worry about interaction effects. For most A/B tests at Microsoft, the purpose of the A/B test is to produce a binary decision: whether to ship a feature or not. There are some cases where we’re interested in knowing if a treatment effect is 10% or 11%, but those cases are the minority. Usually, we just want to know if key metrics are improving, degrading, or remaining flat. From that perspective, the scenario with small cross-A/B test treatment effects is interesting in an academic sense, but not typically a problem for decision-making.

Conclusion

While there are cases where A/B interaction effects are real and important, in our experience, this is rare issue where people often worry more than they need to. Overall, the vast majority of A/B tests either don’t interact or have only relatively weak interactions. Of course, the results depend on the product and the A/B tests, so don’t just take our word for it: try running your A/B tests concurrently and perform your own meta-analysis of interaction effects!

– Monwhea Jeng, Microsoft Experimentation Platform

References

[1] Kohavi R., Tang D., & Xu Y. (2020). Trustworthy Online Controlled Experiments: A practical Guide to A/B Testing. Cambridge: Cambridge University Press. doi:10.1017/9781108653985

[2] Fabijan A., Blanarik T., Caughron M., Chen K., Zhang R., Gustafson A., Budumuri V.K. & Hunt S. Diagnosing Sample Ratio Mismatch in A/B Testing, http://approjects.co.za/?big=en-us/research/group/experimentation-platform-exp/articles/diagnosing-sample-ratio-mismatch-in-a-b-testing/.

[3] Liu P., Qin W., Ai H. & Jing J. Data Quality: Fundamental Building Blocks for Trustworthy A/B testing Analysis, http://approjects.co.za/?big=en-us/research/group/experimentation-platform-exp/articles/data-quality-fundamental-building-blocks-for-trustworthy-a-b-testing-analysis/ .

[4] Machmouchi W. and Gupta S. Patterns of Trustworthy Experimentation: Post-Experiment Stage, http://approjects.co.za/?big=en-us/research/group/experimentation-platform-exp/articles/patterns-of-trustworthy-experimentation-post-experiment-stage/.

[5] Chan T. Embrace Overlapping A/B Tests and Avoid the Dangers of Isolating Experiments, https://blog.statsig.com/embracing-overlapping-a-b-tests-and-the-danger-of-isolating-experiments-cb0a69e09d3 (opens in new tab).

[6] Mitchell C., Drake A., Litz J., & Vaz G. p-Values for Your p-Values: Validating Metric Trustworthiness by Simulated A/A Tests. http://approjects.co.za/?big=en-us/research/group/experimentation-platform-exp/articles/p-values-for-your-p-values-validating-metric-trustworthiness-by-simulated-a-a-tests/.

The post A/B Interactions: A Call to Relax appeared first on Microsoft Research.

Deep Dive Into Variance Reduction

Laura Cosgrove — Tue, 15 Nov 2022 15:22:30 +0000

Variance Reduction (VR) is a popular topic that is frequently discussed in the context of A/B testing. However, it requires a deeper understanding to maximize its value in an A/B test.  In this blog post, we will answer questions including: What does the “variance” in VR refer to?  Will VR make A/B tests more trustworthy?  How will VR impact the ability to detect true change in A/B metrics?

This blog post provides an overview of ExP’s implementation of VR, a technique called CUPED (Controlled experiment Using Pre-Experiment Data). Other authors have contributed excellent explainers of CUPED’s performance and its ubiquity as an industry-standard variance reduction technique [1][2]. We have covered in previous blog posts how ExP uses CUPED in the experiment lifecycle [3].

In this post, we share the foundations of VR in statistical theory and how it amplifies the power of an A/B testing program without increasing the likelihood of making a wrong decision. [a][4]

[a] Many of the elements covered quickly in this blog are covered in excellent detail in Causal Inference and Its Applications in Online Industry [4].

Variance is a Statistical Property of Estimators

To understand where variance reduction fits in, let’s start with a more fundamental question: What’s our ideal case for analyzing an A/B test? We want to estimate the difference in two potential outcomes for a user: the outcome in a world where the treatment was applied, and the outcome in a world where the treatment was not applied – the counterfactual.

The fundamental challenge of causal inference is that we cannot observe those two worlds simultaneously, and so we must come up with a process for estimating the counterfactual difference. In A/B testing, that process relies on applying treatments to different users. Different users are never perfect substitutes for one another because their outcomes are not only functions of the treatment assignment, but also impacted by many other factors that influence user behavior.

Causal inference is a set of scientific methods to estimate the counterfactual difference in potential outcomes between our two imagined worlds. Any process of estimating this counterfactual difference introduces uncertainty.

Statistical inference is the process of proposing and refining estimators of an average counterfactual difference to improve the estimators’ core statistical properties:

asymptotic bias, or consistency;
rate of convergence to this asymptotic bias; and
variance.

In fact, that’s what the “variance” in variance reduction refers to: the property of the estimator of the average treatment effect. Variance reduction (as in CUPED-VR) is not a reduction in variance of underlying data such as when sample data is modified through outlier removal, capping, or log-transformation.  Instead, variance reduction refers to a change in the estimator which produces estimates of the treatment effect with lower standard error.

The procedure of inference. We want to estimate the parameter $ \beta $, so we gather data, evaluate it with the estimator and end up with an estimate $ \hat{\beta} $. In A/B testing, $ \beta $ is commonly the average treatment effect. Image courtesy of Dr. Laura Hatfield and diff.healthpolicydatascience.org.

The Difference-in-Means Estimator Provides Consistency in A/B tests

Random assignment ensures that the difference between treatment and control populations is an unbiased estimator. However, we need to consider how much uncertainty our estimation process has introduced.

To do so, we use the known rate of convergence to the true population difference – called consistency – to estimate the true variance of the average treatment effect using our sample. With the delta estimate from difference-in-means ($ \delta_{DiM}$) and the sample variance estimate, we report an interval of estimates that is likely to contain the true population difference, called a confidence interval:

$ \begin{aligned} Var(\delta_{DiM}) &=\frac{ \sigma_{Y^T}^2}{{n^T}} + \frac{ \sigma_{Y^C}^2}{n^C} \\ CI_{lb,ub}&= \delta_{DiM} \pm z_{\alpha/2}\sqrt{Var(\delta_{DiM})} \\ \end{aligned} $ [b]

The difference-in-means estimator for the average treatment effect is unbiased, and the variance of the estimator shrinks at a known rate as the sample size grows. When we propose VR estimators, we’ll need to describe their relationship to the bias, variance, and the consistent variance estimate of the difference-in-means estimator to understand if we’re improving.

[b] $ z_{\alpha/2} $ is the standard normal quantile at your acceptable $ \alpha $, or false positive rate. For example, a 95% confidence interval uses 1.96 for $ z_{0.05/2} $.

CUPED-VR Outperforms the Difference-in-Means Estimator

Statistical tests that use variance reduction rely on an additional strategy to reduce the variance of an estimator of average treatment effect, which has a similar power benefit to increasing the A/B test sample size.

This is rooted in the insight that even if we have a single-user treatment and single-user control, if the users are good substitutes for one another, we’ll expect to obtain a treatment effect estimate that’s closer to the true treatment effect than if the users are very different from one another. The assignment procedure can be modified to try to ensure “balanced” treatment and control assignments. Re-randomization of assignments with checks to ensure baseline balance uses this idea [5].

In many online A/B tests, we don’t modify our assignment procedure. Instead, we perform a correction in the analysis phase with VR estimators. VR combines large-sample asymptotic properties of A/B tests with the optimization of comparing similar users through statistical adjustment. Similarity is modeled through use of characteristics known to be independent of the assignment of A or B test feature to the user.

CUPED-VR Procedure

CUPED is one method of VR, with the following steps:

Linear models $ \vec Y_i \sim \vec \theta \vec X_i $ are estimated separately for treatment and control (or with an assignment group indicator).

The product of $ \hat \theta$ and the overall mean $ \overline X_i $ is subtracted from $ Y_i $, giving adjusted metric values $ Y_{CUPED,T} $ and $ Y_{CUPED,C} $. In each group, users’ adjusted metrics are shifted as a function of their similar prior characteristics.
The difference in the average adjusted metric values gives a still-consistent and lower-variance estimate of the average treatment effect estimand.

From simulating CUPED-VR’s performance versus difference-in-means on repeated samples of the same data, we can observe the extent of variance reduction for the estimator (plot below). In this plot of estimates, the set of estimates that are closer to the true effect of 2.5 compared to the difference-in-means estimates on the same trial are shifted because, in those trials, CUPED-adjusted metrics accounted for chance imbalance in the pre-A/B test period.

When the estimated coefficients are weighted by assignment probability, the CUPED-VR estimator is equivalent to another popular regression adjustment estimator for A/B tests: ANCOVA2, or Lin’s estimator [6][7] [Table 1].

CUPED adjusts metrics by the predicted value from a regression of Y on X. The treatment effect estimate has lower standard error. Estimated confidence intervals are narrower as a consequence, and power of tests are increased.

Measuring CUPED-VR Performance with Effective Traffic Multiplier

The CUPED-VR estimator has known analytic results [7] of how its variance compares to the variance of the difference-in-means estimator:

$\begin{aligned} Var(\delta_{VR}) &=(\frac{ \sigma_{Y^T}^2}{n^T} + \frac{ \sigma_{Y^C}^2}{n^C}) (1 – R^2) \\ Var(\delta_{DiM}) &=\frac{ \sigma_{Y^T}^2}{n^T} + \frac{ \sigma_{Y^C}^2}{n^C} \\ \end{aligned} $

The variance is reduced in proportion to the amount of variance explained by the linear model in treatment and control, or the total $ R^2 $. And, importantly, the estimator is still consistent: We don’t sacrifice bias in favor of lower variance. This means that when we estimate the variance of our $ \delta_{VR} $ , we can build narrower confidence intervals, with values that are closer to the $ \delta_{VR} $ but reflect the same level of confidence about the range. This also means that if the true treatment effect is non-zero, we are more likely to detect a statistically significant effect. Indeed, the ratio of raw variance to VR variance $ \frac{1}{1-R^2} $ represents the amount of traffic that would need to be added to the simple difference estimator to provide the same level of variance reduction as VR.

Decision-makers understand that having more traffic in an A/B test for a given time period helps decrease time-to-decision or increase confidence in a decision if evaluating at a fixed time. And at ExP, we have found this to be an easy-to-interpret representation of VR’s efficacy for Microsoft experimenters. We surface it for each variance-reduced metric and refer to it as the “effective traffic multiplier”.

From a simulated total $ R^2$ of 0.4, the median effective traffic multiplier is 1.66 in simulations. This translates to a power gain of 22%.

The effectiveness of CUPED-VR is influenced by various attributes of the product, telemetry, experiment, and metric. At Microsoft, we see substantial difference in efficacy across different product surfaces and metric types.

Based on a recent 12-week sample of week-long experiments, groups of VR metrics from two different surfaces for the same product have very different average performance. In one Microsoft product surface, VR is not effective for most metrics: a majority of metrics (>68%) have effective traffic multiplier <=1.05x. In contrast, another product surface sees substantial gain from VR methods: a majority of metrics (>55%) have effective traffic multiplier >1.2x.

Summary

Variance reduction is the use of alternative estimators, like CUPED, to improve difference-in-means and effectively multiply observed traffic in an A/B test. Its variance-reducing properties are rooted in the foundations of design-based statistical inference, which makes it a trustworthy estimator at scale.

– Laura Cosgrove, Jen Townsend, and Jonathan Litz, Microsoft Experimentation Platform

CUPED-VR and ANCOVA2 Comparison Table

Estimator	Procedure
ANCOVA2[6][7]	$\begin{aligned} \\ Y_i &= \beta_0 + \delta T_i + \beta ( X_i – \overline X) + \gamma ( X_i – \overline X) T_i + \epsilon_i \\ \delta &= \hat \delta \end{aligned} $
CUPED-VR	$ \begin{aligned}\\ Y_i^T &= \beta_0^T + \theta^T X_i^T + \epsilon_i^T \\ Y_i^C &= \beta_0^C + \theta^C X_i^C + \epsilon_i^C \\ Y_i^{CUPED, T} &= Y_i^T – (p\hat {\theta^C} + (1-p) \hat {\theta^T})* X_i^T \\ Y_i^{CUPED, C} &= Y_i^C – (p\hat {\theta^C} + (1-p)\hat {\theta^T})*X_i^C \\ \delta &= \overline Y^{CUPED, T} – \overline Y^{CUPED, T} \end{aligned} $

The CUPED procedure is statistically equivalent to ANCOVA2

References

[1] Berk, M. (2021) How to Double A/B Testing Speed with Cuped, Towards Data Science. Available at: https://towardsdatascience.com/how-to-double-a-b-testing-speed-with-cuped-f80460825a90 (opens in new tab) (Accessed: November 1, 2022).

[2] Craig (2022) Cuped on Statsig, Medium. Available at: https://blog.statsig.com/cuped-on-statsig-d57f23122d0e (opens in new tab)(Accessed: November 1, 2022).

[3] Machmouchi, W., et al. (2021) Patterns of Trustworthy Experimentation: Pre-Experiment Stage, Microsoft Research. Available at: http://approjects.co.za/?big=en-us/research/group/experimentation-platform-exp/articles/patterns-of-trustworthy-experimentation-pre-experiment-stage/ (Accessed: November 1, 2022).

[4] Deng, A., 2021. Causal Inference and Its Applications in Online Industry. [online] Alexdeng.github.io. Available at: <https://alexdeng.github.io/causal/index.html (opens in new tab)> [Accessed 5 July 2022].

[5] Zhao, A. and Ding, P. (2021) No star is good news: A unified look at rerandomization based on $p$-values from covariate balance tests, arXiv.org. Available at: https://arxiv.org/abs/2112.10545 (opens in new tab) (Accessed: November 1, 2022).

[6] Lin, W. (2013). Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique. Annals of Applied Statistics, 7, 295–318.

[7] Deng, A., 2021. Chapter 10: Improving Metric Sensitivity. Causal Inference and Its Applications in Online Industry. [online] alexdeng.github.io. Available at: <https://alexdeng.github.io/causal/index.html (opens in new tab)> [Accessed 5 July 2022].

The post Deep Dive Into Variance Reduction appeared first on Microsoft Research.

For Event-based A/B tests: why they are special

Aleksander Fabijan — Tue, 27 Sep 2022 00:07:56 +0000

An “event-based” A/B test is a method used to test two or more variables during a limited duration. We can use what we learn to increase user engagement, satisfaction, or retention of a product, while also applying our insights to future event and product scenarios. We often use A/B testing when there is a launch of a new feature. This allows the product team to try different messaging to determine which content maximizes user engagement.

Unlike classic A/B testing, where a feature is developed, incrementally tested, gradually rolled out and then becomes a permanent part of the product, an event-based feature has limited time for experimentation. The period can be as little as a day or a handful of days. For example, International Olympics-related headlines on a news app throughout the duration of the Tokyo Olympic Games. In this blog post, we will explore some of the challenges of running event-based A/B tests and share some insights for the set-up and analysis for such experiments.

What are the challenges of running event-based A/B tests?

Feature Testing

As a best practice, product teams perform manual or unit testing before exposing features to end users. This method helps detect and fix bugs to minimize user harm. However, not every bug can be detected at this stage. Feature teams often run A/B tests to discover things that may have been overlooked or cannot be tested in manual/unit testing. Teams expose the feature to a small traffic percentage, measure user engagement, identify issues, remedy them, and then do another round of experimentation to verify improvement [1]. Yet in the case of event-based A/B tests it’s almost impossible to test and iterate on the feature during the experiment given time constraints.

Rotating traffic

For global events like International Women’s Day which spans multiple regions, we may want to run an A/B test at the same local time in each region. Depending on the experimentation system’s capability, this could mean setting up multiple A/B tests with each targeting a specific region and starting at a different UTC time. If many regions need to be covered, the experiment set-up would require quite a bit of effort. It might be tempting to use a single A/B test for all regions and start at exactly the same time. However, if there is an issue specific to a region, feature teams cannot do anything about it but stop the entire experiment. On the contrary, having one experiment per region allows us to turn off the feature for the affected region alone. This method provides a way to manage risk while also adding small overhead to experiment management.

Analysis latency

Metric results help us understand the impact of a feature. Data becomes available once the:

Telemetry for the experiment is collected;
Data gets transformed into an analyzable format;
Analysis job is submitted;
Job is queued; and
Job is completed.

Depending on the product scenario, it can take some time to complete #1 and #2. If an A/B test targets a broad audience, or very large data sets need to be consumed, then it could take hours for results to be ready. In an example that we observed recently, experimenters could not know the feature performance until the ~8th hour after the experiment had started using the first 4-hour data. If there is an issue with the feature and metrics are the only way to know about the issue, then a significant amount of time would have elapsed by the time the feature team discovers the problem.

Experiment debugging

It is possible to encounter issues during an A/B test. For example, Sample Ratio Mismatch (a.k.a. SRM) has been found to happen relatively frequently in A/B tests. It is a symptom for a variety of experiment quality issues ranging from assignment, execution, log processing, analysis, and interference between variants and telemetry [2]. Debugging such issues takes time. For a classic A/B test with SRM, it could take days, weeks, or even months to identify the root cause. Given the time limit and data latency, it may not be feasible for the feature team to identify the root cause before the event ends.

Experiment learning

Displaying an event-based feature often comes at the cost of not showing another feature. Let’s return for a moment to our Olympics Games carousel. The spaces used for Olympics headlines could also be used for other types of information. A noticeable increase in user engagement in the carousel could mean a missed ad engagement opportunity. How should we make a tradeoff between different types of engagement? How do we quantify and understand the impact in both the short and long term?

Treatment – Olympics carousel:

Control – No Olympics carousel (Shopping carousel is displayed by default):

Figure 1. Event-based feature comes at a cost of not showing another feature

There are multiple variables in event-based A/B tests. In the Olympics carousel example, its format, content, and where it appears could be all the things that we want to test out. We may also want to test it for different geolocations. Moreover, event-based experiments introduce a unique dimension of variability – the event itself. What would you say if the result shows a stat-sig decrease in content interaction for users located in the US? Does that mean that the carousel is bad? What if the result shows a stat-sig increase for users in Asian countries? What if you see opposite results on a similar carousel but for a different event i.e., Super Bowl?

Feature for A/B testing	Region	Metric movement	Immediate reaction	Actual reason
A carousel for the Olympics event	US	-0.8%	Carousel does not work well in US -> may need to change the format or source of content in carousel	CJK users are more interested in the Olympics event than US users
A carousel for the Olympics event	CJK (China, Japan, Korea)	+1%	Carousel works well in CJK	CJK users are more interested in the Olympics event than US users

Figure 2. User engagement metric moves in different direction for users in different region

In classic A/B testing, we can have two different UX treatments for a feature. Through testing, we can see which one works better and use what we learn in other experiments. In event-based experiments, however, insights from one event may not be transferrable to others. For instance, the Olympics is different from International Women’s Day, which is quite different from the Super Bowl. Thus, it can be difficult to draw conclusions from one experiment and define the right execution for the next event.

Recommendation on experimentation infrastructure and analysis

We recommend the following to address the challenges of event-based A/B tests.

Experiment tooling

Provide the option to schedule A/B tests in batch and automatically run experiment and metric analysis. This is especially helpful when traffic rotation needs to happen, and multiple A/B tests need to be created for the same feature targeting different regions. The feature team should be able to schedule A/B tests for different time zones and have the tests start automatically at preset times. The short-period metric analysis should be kicked off as soon as data becomes available so that experimenters can see results early before the event ends.

Ideally, this option should be an integral part of the experimentation system. Depending on how often event-based A/B tests are projected to run and the ROI (return on investment) of the engineering investment, it might be good enough to have a plug-in or tooling that leverages an existing system’s API (application programming interfaces) to set up, run, and analyze event-based A/B tests automatically.

Near-real-time monitoring

Establish near-real-time pipeline to monitor, detect and debug A/B tests. Feature teams need to react quickly when something is off. Waiting hours for metrics to be calculated puts users and teams at risk of adverse impacts. With near-real-time pipeline, experiment data is aggregated every couple of minutes and key guardrail metrics are computed. These metrics help detect egregious effects, especially in the first few hours after the experiment starts. Although they may be a subset of all the metrics the feature team cares about, they allow the team to closely monitor event-based A/B tests, debug issues quickly, and take action to shut down experiments when needed. At Microsoft, we have established a near-real-time pipeline for live site monitoring for online services (details to be shared in future blog post). It allows us to detect a number of experiments quickly that have bad outcomes. Note that having near-real-time data can motivate experimenters to check scorecards more frequently before reaching the fixed time horizon. This is called “p-hacking”. It can inflate the Type I error rate and cause experimenters to see more false positives. Using a traditional A/B test or “fixed-horizon” statistic no longer works, and sequential testing is better suited for continuous monitoring of the experiment. To develop the sequential probability ratio, it is advisable to understand metric distributions beforehand. You can then verify that the independence assumption holds for the test to be applicable [3].

Triggered analysis

Use triggered analysis to increase sensitivity of metrics. When an event-based feature is displayed, it is possible that not every product user sees it. For example, a component may require that the user scroll the page to be seen. If the user does not scroll, then the component will not be discovered. Sometimes, the feature might be enabled only when certain conditions are met. For instance, we might show sports-events-related features only if the user has previously visited sports articles or websites. Using the full user population for analysis can dilute the results. It would be valuable to do a triggered analysis, such as analyzing only those users who see the experience [4]. From our observations on A/B tests run at Microsoft, the more targeted the audience and metrics for analysis, the more likely we are to get results with stat-sig metric movements.

Post-experiment analysis

Conduct post-experiment analysis to understand the impact not reflected in A/B test results. These analyses help establish a more complete picture about the experiment and the event itself. For example, an event-based carousel may cause a drop in revenue due to less ads being displayed (as shown in Figure 1). However, if users like the carousel, there might be lingering effect that makes them revisit the app more frequently. Conducting post-experiment retention analysis helps quantify the impact that is not observed during the time of A/B test. By comparing the retention of cohorts in the treatment and the control after the experiment, we may find that the feature leads to an increase in user retention over the long term.

We can also dig deeper to uncover other insights. For instance, if the overall difference in retention is small, could it be prominent for some subset of users? Could there be a shift from product “active users” to “enthusiasts”, or “random users” to “active users” for those seeing treatment experience? Could there be a more observable difference if we look at cohorts that have been exposed to multiple event-based features on a cumulative basis?

As an event itself is a variable, doing cross-experiment analysis helps shed light on the differences between events. This requires keeping a repository of historical A/B tests and metrics data. By comparing the metric movements between different events, or applying machine learning techniques, we can find out how event, region, feature format, content, and other variables play a role in the metric movement. The value of such analysis is dependent on the data accumulated over time. By testing more event-based features and collecting more data, we can derive more value out of the cross-experiment analysis.

Summary

Event-based experiments are a unique type of A/B testing due to their time sensitivity and limited duration. Likewise, these events face unique challenges throughout the experimentation lifecycle including feature testing, experiment set-up, analysis, debugging, experiment understanding and learning. Event-based testing requires specific tooling, monitoring, and an analysis approach to address these and other challenges. As we embrace diversity and inclusion across the world, we expect to see more event-based A/B tests happen across various products. In this blog post, we share our thoughts and recommendations on this type of experiment and hope that it is helpful for you if you are or consider running event-based A/B tests in future.

– Li Jiang (Microsoft ExP),

– Ben Goldfein, Erik Johnston, John Henrikson, Liping Chen (Microsoft Start)

References

[1] T. Xia, S. Bhardwaj, P. Dmitriev, and A. Fabijan, “Safe Velocity: A Practical Guide to Software Deployment at Scale using Controlled Rollout,” in 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 2019, pp. 11–20.

[2] A. Fabijan et al., “Diagnosing Sample Ratio Mismatch in Online Controlled Experiments,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining – KDD ’19, 2019, pp. 2156–2164.

[3] A. Tartakovsky, I. Nikiforov, M. Basseville, Sequential Analysis: Hypothesis Testing and Changepoint Detection. Chapman and Hall/CRC Press.

[4] R. Kohavi, D. Tang, and Y. Xu, Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press.

The post For Event-based A/B tests: why they are special appeared first on Microsoft Research.

STEDII Properties of a Good Metric

Somit Gupta — Wed, 06 Apr 2022 16:30:31 +0000

When a product adopts an experimentation-driven culture, software development tends to shift from being a top-down decision to more of a democratized approach. Instead of arguing about what should be built, product leaders define goals for metrics to improve the product, and they empower their teams to invest in changes that will ultimately achieve those goals. This allows the organization’s culture to innovate faster by testing multiple ideas for improvement, fail fast, and iterate.

One of the best ways to experiment with a software product is to run A/B tests. For successful A/B tests, it is very important to have the right metrics. But what makes a good metric? Is the company’s stock price a good metric for a product team? Probably not. It is not sensitive to small changes in the product, and we cannot observe the counterfactual – that is, the stock price in the universe where the treatment is not present. Perhaps the company could conduct an extensive user survey for each change, and then measure the degree of satisfaction that their users have for the change. However, such a survey for each product change would annoy users, it would be very costly to scale, and it would not be reflective of the overall user population because many users won’t respond to the survey. These examples demonstrate just how challenging it is to define a good A/B metric.

So how do we define a good metric? After running hundreds of thousands of A/B tests at Microsoft, we have identified six key properties of a good A/B metric:

Sensitivity
Trustworthiness
Efficiency
Debuggability
Interpretability and Actionability
Inclusivity and Fairness

In this blog post, we will examine each of these properties more closely to understand what makes a good metric, and we will provide checks to test a metric against each property. We would like to emphasize, however, that these are general properties of a good experimentation metric. They are necessary properties for most experiment metrics, but they may not be sufficient for use in every single case. In later blog posts, for instance, we will discuss Overall Evaluation Criteria (OEC) metrics that should have additional properties, such as being a proxy for the overall product, user, and business health.

STEDII Checklist for Creating Good Metrics

Sensitivity

In a previous blog post (opens in new tab), we discussed the details of measuring and checking the sensitivity of the metric. We will briefly summarize those details here. A sensitive metric has a high chance of detecting an effect when there is one. Conversely, when a sensitive metric that is well powered has no stat-sig movement, we have high confidence that there is no treatment effect. Let $ H_1$ be the alternative hypothesis – i.e., there is real treatment effect of a certain size. Then,

$Prob$[Detecting the treatment effect on the metric] = $Prob(H_1)Prob(\text{p-value}<0.05|H_1)$.

To measure the sensitivity of a metric, we can use a labeled corpus, consisting of tests in which there is high confidence that a treatment effect exists. That confidence is built by examining many metric movements to check if they align with the hypothesis of the test, deep dives, and offline analyses that add more evidence to the correctness of the hypotheses of the A/B tests. A sensitive metric will have a high proportion of stat-sig changes when there is an empirically known impact from the corpus. In cases where there is no labeled corpus, a more sensitive metric would have a higher proportion of overall tests where it is stat-sig. This will include no more than 5% false positive stat-sig tests, in the event that all tests in the corpus have no impact on the metric.

There are two components of metric sensitivity [1]:

Movement Probability $[Prob(H_1)]$: How often is the alternative hypothesis true?
Statistical Power $[Prob(\text{p-value}<0.05|H_1)]$: Given that an alternative hypothesis is true, how likely is it that we are able to detect the effect?

Movement Probability

If the movement probability of a metric is low, the metric will rarely have a statistically significant movement, even though it may have very high statistical power. In cases of well-optimized products, high-level metrics (such as “days active” and “sessions per user”) can be very difficult to improve in a short period of time, but they can regress very easily. So those metrics are poor indicators of success, but they are good guardrails. Proxy metrics with higher positive metric movement probability are better success metrics that we can aim to improve in an experiment. In other cases, we can improve metric design to measure more sensitive transformations of a metric, such as the “capped average of log of time spent in Teams,” instead of a simple average.

Statistical Power

Statistical power is the probability of detecting a stat-sig change if the alternative hypothesis is true. We refer to the smallest magnitude of metric change detectable with high probability (usually 80%) as the Minimum Detectable Effect (MDE). The smaller the MDE, the higher the statistical power. If we assume that Treatment and Control have the same size and the same population variance, the relative change (expressed as a fraction of the control mean) that we can detect (using the t-test) with 80% power is approximately $\frac{4 \times CV}{\text{sample size}}$ [4]. Here, $CV$ is the coefficient of variation, defined as the ratio of the standard deviation ($\sigma$) of a metric to its mean ($\mu$), $CV = \frac{\sigma}{\mu}$. When designing a metric, we want to aim for a low $CV$ in order to get more statistical power.

Checklist for Creating Sensitive Metrics

Trustworthiness

Metrics are not only used to make informed product decisions, but they also help determine the incentives and future investments of feature teams. An untrustworthy metric can send a product down a wrong path and away from its aspired goal. While the accurate estimation of the variance of a metric should be handled by the experimentation platform, the metric authors should focus on data quality, alignment with the goal and user experience, and generalization.

Data Quality

Our previous blog post on data quality (opens in new tab) provides an in-depth guide on data quality. For completeness we will summarize some of its key points here. When creating a metric, we should check for the following aspects related to data quality: missing data, invalid values, low join rates with other data sources, duplicate data, and delayed data. As mentioned in another past blog post (opens in new tab) on validating the trustworthiness of a metric, we must check to ensure that the p-value distribution of a metric under multiple simulated AA tests is uniform. Moreover, we recommend regularly monitoring these aspects of key metrics through dashboards, anomaly detection methods, and AA tests in order to detect and resolve any regressions.

Alignment with the Goal and User Experience

When a metric is aligned with a key product goal, it aggregates data from all observation units into a single number that is most pertinent for that goal. For example, a common goal for end-to-end performance of a product such as Page Load Time (PLT) is that it should be satisfactory for the large majority of the page load instances. The distribution of PLT is usually very skewed with a long tail. Average PLT is a poor metric to track the goal, but a metric like the 95^th percentile or the 99^th percentile of the PLT is more suitable. If that is hard to compute, then another option would be a metric which estimates the proportion of page loads where PLT exceeds a threshold. Further PLT is only useful if it is measuring the latency of users’ experiences when loading a page. For instance, some webpages load most of the content after the standard page load event (opens in new tab). In such cases, a good metric would measure the latency to the point where the page becomes functional for the end user.

Often, the goal will be clear (e.g., “increase user satisfaction”), but it will be hard to determine a metric that actually aligns with that goal. In those cases it is important to test those metrics on a corpus of past A/B tests that we can confidently label as good or bad in terms of the desired goal [2]. In an upcoming blog post, we will discuss how to develop a trustworthy OEC that reflects user satisfaction.

Generalization

A trustworthy metric provides an unbiased estimate for the entire population for which a product decision is being made. The most important factor to look for in this case is selection bias in data generation, collection, transformation, and metric definition and interpretation.

An example of a bias in data generation is “app ratings” in the App Store. If an app only sends users with a positive outlook to the App Store, then the data generated from user reviews will be biased in a positive direction. A data collection bias can occur if we are unable to collect data from users who abandoned the product too quickly, or if the data collection is affected by the feature being tested. Data transformation can introduce a bias if we are incorrectly identifying spurious users (such as bots or test machines) and we are either removing legitimate users or including spurious ones [5]. Metric definition and interpretation can also introduce bias if the metric is analyzing only a subset of the intended-to-treat users or if it puts more weight on certain users or actions unintentionally.

We recommend being very vigilant of selection bias from end-to-end, validating whether a metric value and its movement are consistent with other tracked measurements and our expectations of its movement in an A/B test. For metrics that are particularly important, it may be a good idea to run A/B tests where we are certain about the outcome, and then verify whether or not the metric movement aligns with it.

Checklist for Creating Trustworthy Metrics

Efficiency

As the Experimentation Flywheel (opens in new tab) turns, more and more A/B tests are run regularly, and we will need to compute A/B metrics for a large number of A/B tests. Therefore, the time, complexity, and cost to compute the metrics should be manageable. We will examine these factors related to efficiency in more detail below.

Time: Agile software development needs reasonably quick decision making. For an A/B metric to become an integral part of that decision making, we should be able to compute that metric quickly. Metrics like “proportion of monthly active users,” which are common key product indicators, do not offer agility for scalable experimentation; we would have to run an A/B test for multiple months before we could begin to compute it. Better alternatives are metrics that act as proxies and surrogates of change in Key Performance Indicators – such as “days active per user” or “sessions per user,” as proxies for “proportion of monthly active users.”
Complexity and Failure Rate: Complex metrics, such as those that need to be computed in a lab, won’t scale to properly represent the user base – e.g. Cognitive engagement metrics based on user interviews. We should also avoid complex metrics that may have a high failure rate due to dependence on multiple data sources or large data sets that cannot be parallelized.
Cost: We need to maintain a satisfactory Return On Investment (ROI) for each metric and the insights that we gather at scale from it. Getting labels (such as “is an email spam” or “is an ad relevant”) from human judges at scale is possible, but it will be expensive. Also, metrics that depend upon costly and complex statistical or machine learning models will have a high cost. It is better to find simpler alternatives for efficiency and interpretability reasons (discussed later in this blog post).

Checklist for Creating Efficient Metrics

Debuggability

Debuggability is essential for experimenters to understand why a metric is moving in an A/B test. If the metric regresses, the metric debugging must help the experimenter narrow down the cause of the regression so that a bug can be fixed or the design of a feature can be changed. Equally important, if the metric is improving, the metric debugging should help the experimenter understand why it improved. This will prevent them from falling prey to confirmation bias, and it will also help guide future investments. Let‘s discuss two common methods for making a metric debuggable: debug metrics and tools.

Debug Metrics

Debug metrics capture additional changes or treatment effects that help us better understand the movement of a more complex metric. They typically zoom in on a specific property of the complex metric in order to shed more light on the treatment impact. The reduced scope of the debug metrics usually makes them more sensitive [4].

There are three common ways to construct debug metrics:

Breakdown: Breakdowns allow us to separate the different types of events that are contributing to the overall metric. For example, consider a metric like “app crash rate.” We can break it down by key factors like crash codes, and then we can create a family of debug metrics – such as “app crash rate with code A” – for each code we encounter. We must note that in order to ensure that we capture all crashes, the breakdown debug metric values should add up to the main metric – “app crash.” Usually, a treatment will increase errors of a particular type; therefore, such a breakdown can quickly identify the cause of the main metric movement.
Segment: We can segment a main metric, such as “clicks,” by factors like “browser” in order to create click metrics based on data from a given browser or a given date. Again, this helps narrow down problems that may be specific to a particular factor, such as issues with a browser or outage on a particular day. Such segments are usually defined for a whole set of metrics so that we can obtain the entire set of metrics for a particular segment at a lower cost with batch processing.
Decompose: We can decompose complex metrics, such as “Click Through Rate (CTR),” into component metrics like “numerator” (clicks) or “denominator” (impression) metrics so that we can determine which component is the major contributor to the metric movement. An increase in a CTR metric may generally be considered good; but if it is caused by a decrease in impressions, it may indicate a regression.

Tools

Key guardrail metrics, such as “performance” and “reliability,” benefit from diagnostic and raw data debugging tools that identify the cause of a regression. The diagnostic tools can help reproduce and diagnose an issue in a developer’s machine that is running the treatment. For instance, by inspecting the data that a device is sending, Fiddler can help pinpoint regressions in performance or telemetry loss. We recommend that teams should develop tools that can mine the A/B test data to find instances of regressions in raw data, such as stack traces for crashes that are caused more often by treatment than by control.

Checklist for Creating Debuggable Metrics

Interpretability and Actionability

In order to enable the best informed product decisions, a good metric must be easy to understand and easy to act upon by all team members – not just experts. Interpretability reinforces the trustworthiness and debuggability of the metric, and it also provides the right information for the proper usage of a metric in a product decision. In general, there are two key aspects of interpretability that we should be mindful of – clarity about the goal and direction of the metric, and caveats about its usage.

Clarity About the Goal and Direction of the Metric

For a metric to be interpretable and actionable, it is essential for all team members to understand what the goal of the metric is, and why it is important. This context is provided by the name and description of the metric, as well as by the presentation of its results in relationship to other metrics. We recommend establishing a review process to ensure that at least one person who is not involved with creating a metric can understand its goal and importance.

It should also be easy for all team members to understand when a metric movement is good or bad. We should try to design the metric in such a way that we can assign a “good-for-the-user” or “bad-for-the-user” label to its movement in a specific A/B test. At ExP, we use color coding to indicate the difference between a good or a bad movement. This is usually the first level of information that A/B test owners consume in order to make a decision about the test.

If a movement in a metric can be interpreted as either good or bad, depending on the treatment tested, then it introduces an extra level of subjectivity in the process. It is best to try to avoid that subjectivity through a better metric design. An example would be the tracking of the distribution of page views in a product across users, and understanding how a treatment impacts heavy users (those with a large number of page views) and light users (those with a small number of page views). This metric distribution could potentially be represented in a histogram that tracks the proportion of users falling in buckets labeled as 1, [2,4), [4,8), and so on. However, this would be challenging to properly interpret. A loss of users in the [2,4) bucket could be good if those users moved to the [4,8) bucket, but it would be bad if they moved to the 1 bucket. It would be better to represent the cumulative distribution with buckets labeled as 1, [2,∞), [4,∞), [8,∞), and so on. In this representation, loss in each bucket would have an unambiguous interpretation; a decrease in bucket 1 would always be good, while a decrease in any other bucket would always be bad. This property is also referred to as directionality.

Caveats About the Usage of a Metric

Almost every metric has a blind spot, because it is aggregating a large number of measurements from all observation units into a single number. It is important to communicate the limitations of the metric, and most importantly, the cases when a metric should not be used. Not all metrics will be able to test every kind of change. It is essential to know exactly what kind of changes will be tested in an A/B test, so that the metric can be properly designed to measure their effect. A good example would be a revenue estimate metric that is computed using the historical averages of revenue made per action type. This metric works well for testing changes where the revenue made per action type has not changed; but otherwise, it will give a wrong estimate. In websites, the time to load a page is usually measured as the time difference between the first request sent by the client to load the page and the actual page load event. This estimates the amount of time that a user must wait before they see the page content. But if a treatment is still loading content after the page load event, or if a treatment is issuing the request to load a page in advance, then this metric breaks and is no longer valid.

Checklist for Creating Interpretable and Actionable Metrics

Inclusivity and Fairness

Although there may be a small set of OEC metrics that indicate the success of an experiment, experimenters should rely on a holistic set of metrics to make sure that a product decision is inclusive and fair [5]. Each metric in a metric set provides one data point that is used in conjunction with other data points in order to make a product decision. For inclusive and fair decision making, it is important to make sure that there is no unintended bias in our metrics. This is ensured by looking at three major factors: missing values, weights, and heterogeneity.

Missing Values

We have already discussed selection bias issues in the Trustworthiness section, under the subheading labeled “Generalization.” Selection bias does not exclude observation units randomly between treatment and control. Rather, it leads to the exclusion of observation units that share a common set of characteristics. For example, devices with low network bandwidth may not be able to load a page fast enough to send data before a user gets frustrated and abandons the product, or users who are neither very happy nor very unhappy with a product tend to respond less to surveys about user satisfaction. Metrics that overlook missing value issues in certain segments of observation units are blind to the regression in product experience for that segment.

Whenever possible, we should try to design metrics in a way that either avoids missing values or else can impute missing values. For instance, a “clicks per user” metric can impute 0 values for users who were assigned to treatment or control, but who did not show up in the product logs. But in metrics relating to performance and ratio metrics we cannot impute 0s for missing values. We ought to have data-quality metrics that alert us in case the proportion of missing values changes due to the treatment. For survey-based metrics with large numbers of missing values that cannot be imputed easily, we should have alternative proxy metrics that are based on data that can be observed from most observation units.

Weights

A metric can typically aggregate data from observation units in three different ways:

Giving equal weight to all units (e.g., proportion of users who have at least one click)
Giving equal weight to every activity of a unit (e.g., proportion of sessions with a click, or proportion of impressions with a click)
Simply counting the average number of events (e.g., clicks per unit)

Even if one of these metrics may be the main metric that aligns with the goal of the A/B test, it is best to have multiple metrics that place different weights on observation units and activities, so that experimenters can get more insights on the distribution of a metric movement along key factors. This will ensure that we are more confident in making a good product decision. For a product with a mix of heavy and light users, a “clicks per user” metric generally increases with increase in overall engagement with the product; but it could also increase due to an increase in engagement from heavy users, even when there is a drop in engagement from light users. Similarly, a “clicks per impression” metric generally increases when there is an overall increase in engagement with impressions; but it might also increase when there is more engagement with a popular page of the product, even when there is a decline in engagement with less popular pages. Lastly a “proportion of users with a click” metric increases when non-engaged users become more engaged; but when already-engaged users are more engaged with the product, it may not show an increase [1, 4].

In cases where we want to ensure an improvement in performance for the bottom 5% or 10% of the observation units, we should compute 95^th and 90^th percentile metrics or threshold metrics (as we discussed in the Trustworthiness section, under the subheading labeled “Alignment With the Goal and User Experience”).

Heterogeneity

A/B metrics estimate the average treatment effect over all observation units, and they can sometimes be blind to different impact of a treatment on a specific segment of users. Therefore, it is important to have good segments that allow for the viewing of a metric for a subpopulation. A good segment should have the following properties:

Interpretable: Any team member should be able to understand the segment information.
Low cardinality: A segment should ideally have a small number of groups (ideally less than 20); this will save time on computation, and it will make it easier to go over segment-level information (e.g., divide all the countries of the world into regions, rather than having 180+ countries).
Well-powered: A good segment definition should lead to good and even statistical power across all segment values, in order to be able to detect heterogenous treatment effect.
Correlation with attributes likely to impact experience: Segment definitions should be guided by the product and user understanding to help identify the impact on most vulnerable section of users. E.g., new users, users with low end devices, or users identified by external modeling to be “at risk of churning”.

Common segments include market, country, pre-A/B test activity level, device and platform, day of the week, and product-specific user personas. For more details, read our earlier blog post on Patterns of Trustworthy Experimentation: During Experiment Stage (opens in new tab).

Some metrics where a more uniform distribution of a metric across units is a favorable outcome, we can also create metrics (opens in new tab) that track that goal directly.

Checklist for Creating Inclusive and Fair Metrics

Summary

In this blog post, we introduced the STEDII (Sensitivity, Trustworthiness, Efficiency, Debuggability, Interpretability, and Inclusivity) framework to define and evaluate the good properties of a metric and of an A/B test analysis in general. Each of these properties are essential; and together, they reinforce each other to ensure a good set of metrics for a proper analysis of an A/B test, which will yield valuable insights and enable good product decisions. Many metric authors at Microsoft have successfully used this framework, and we hope that all our readers find it equally valuable!

– Somit Gupta and Widad Machmouchi, Microsoft Experimentation Platform

References

[1] Deng, A. and Shi, X. 2016. Data-Driven Metric Development for Online Controlled Experiments. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining – KDD ’16 (2016), 77–86.

[2] Dmitriev, P. and Wu, X. 2016. Measuring Metrics. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management – CIKM ’16 (2016), 429–437.

[3] Kohavi, R. et al. 2009. Controlled experiments on the web: survey and practical guide. Data Min Knowl Disc. 18, (2009), 140–181. DOI:https://doi.org/10.1007/s10618-008-0114-1.

[4] Machmouchi, W. and Buscher, G. 2016. Principles for the Design of Online A/B Metrics. Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval – SIGIR ’16 (New York, New York, USA, 2016), 589–590.

[5] Dmitriev, P. et al. 2017. A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments. Proceedings of the 23rd ACM SIGKDD international conference on Knowledge discovery and data mining – KDD ’17 (Halifax, Nova Scotia, Canada, 2017).

The post STEDII Properties of a Good Metric appeared first on Microsoft Research.

Measurably improve your product by combining qualitative and quantitative methods

Julie Stevenson — Sat, 12 Feb 2022 00:55:13 +0000

Imagine that you have developed a new hypothesis for how to improve the user experience of your product. Now you need to test it. There are many ways that you could approach this. For instance, running an A/B test, engaging directly with users, or sending out a survey.

Each of these methods fall into one of two categories. The first is quantitative. This is the process of collecting and analyzing numerical data. The second is qualitative. This is the process of collecting and analyzing data in the form of text or observations to understand concepts, opinions, and/or experiences. While both methods can provide key insights into the user experience, they each excel in answering different types of questions. Thus, effective measurement and analysis should encompass both categories. This is the approach that Microsoft’s Developer Division (the team that brings you Visual Studio and Visual Studio Code) employs.

In this blog post, we highlight some of the qualitative and quantitative methods that we use to help us develop our tools. We will explain how we think about choosing a qualitative method alongside quantitative analysis and conclude with an example of a product pain point we solved using both methodologies.

Methods for making data-driven product decisions

What are qualitative methods?

Qualitative research provides insights into the user experience with techniques such as usability studies (opens in new tab) [1], surveys (opens in new tab) [2], focus groups (opens in new tab) [3], customer interviews (opens in new tab) [4], and diary studies (opens in new tab) [5]. In Microsoft’s Developer Division, we use a combination of these methods.

Now, you might be wondering, “How do I know which method to choose?” Great question! The right method largely depends on the research question that you have in mind. By identifying 1-to-3 open-ended research questions, you can determine whether your focus is generative or evaluative research (opens in new tab) [6]. This can help you to identify which method(s) to use. You may feel that you want to conduct more generative research if your research questions are aimed at helping you build a mental model of a problem space or identify new opportunities for your product (e.g., “How do users typically use my product?”). Conversely, evaluative research takes center stage if you want to determine how to alleviate a pain point or improve usability (e.g., “Will this prototype meet my users’ needs?”).

While there isn’t a hard and fast rule about which methods best tease out which type of information, some methods are better suited to answering generative research questions (e.g., customer interviews) while others are better suited for answering evaluative research questions (e.g., usability studies).

At first you might be hesitant to try out new qualitative methods. Yet it is important to approach qualitative research with an open mind and with the goal of learning more about your users. This may even include gleaning insights that were completely different than those that you anticipated.

What are quantitative methods?

Quantitative research provides insights into the user experience at scale. In Microsoft’s Developer Division, quantitative research usually falls into one of two buckets: telemetry monitoring and A/B testing. Like qualitative research, the method that you choose largely depends on the question that you want to answer.

When we have questions about the current experience of our product, we use telemetry monitoring. This involves things like setting up dashboards or tracking user interactions (e.g., hitting an error). These methods can reveal product pain points that can help us to prioritize where to make improvements. When we want to evaluate a specific hypothesis for how to improve a product or to ensure that a new change won’t negatively impact users, we use A/B testing. A/B tests allows us to determine the effect that a new experience has on our full population of users. This makes it possible to measure the impact of individual changes against our larger business goals/metrics.

While both quantitative methods can reveal broad patterns of user behavior, they provide different insights. So, it is important to use both in product development to match the objectives of your research.

It’s not either-or: Combining quantitative and qualitative methods

When it comes to using quantitative methods or qualitative methods, it’s not a question of one or the other. Rather, ask, “What is my research question?” Or “What problem am I trying to solve?” Moreover, you’ll likely need both types of methods to come away with a decision that you feel confident about.

Quantitative data can help us establish a baseline and get some ground truth data. What’s more, it can help us get that information at scale and over a diverse population. We can develop hypotheses and rigorously evaluate them by running A/B tests. Often, quantitative methods are superheroes at answering “what” questions. For instance, what is really happening here? What is the impact?

Conversely, qualitative methods allow us to collect supporting data that help us to make sense of what is happening, evaluate potential solutions before implementing them, and build empathy for our users. Since we don’t have to implement a full solution, it can oftentimes be easier to receive user feedback using qualitative methods rather than large-scale A/B tests. One set of qualitative research alone isn’t going to be enough to validate or invalidate a design or a hypothesis (because chances are your sample of users is not representative of your entire population). Qualitative methods allow us to address “why” questions. For example, why do users want to do that? Why will or won’t this solution work?

To build a robust mental model of a problem space, we usually cycle through “what” and “why” questions to iterate on our understanding of the area. So, to answer the question of when to use which set of methods, the answer is “it depends…,” and “…it’s probably both!”

Using both quantitative and qualitative methods to improve a product: Pylance case study

What is Pylance?

At a high level, Pylance (opens in new tab) [7] and other language servers provide all the bells and whistles of a good developer experience for writing Python code in tools such as Visual Studio Code, Visual Studio, and Azure Notebooks. It provides features like auto-completion, code navigation, function signature help, and refactoring code actions. Pylance also gives us diagnostics on problematic code (e.g., errors and warnings for code that will not run). Python users in Visual Studio Code now expect diagnostics, a feature that inherently helps build trust in the product and overall development experience.

The problem: False-positive diagnostics

A recent example where the Pylance language server team was able to leverage both quantitative and qualitative methods to solve a major pain point in the product was in lessening the number of false positives that occurred during import diagnostics. Diagnostics in VS Code appear as squiggles under problematic or incorrect/invalid lines of code.

In this case, a false positive diagnostic on an unresolved import meant that Pylance was flagging incorrect lines of code as problematic because it detected that the corresponding imported modules could not be found in the project (either in user-defined code or in third-party modules).

Example Python code in VS Code with an unresolved import.

Approaching the problem space

Making the language server smarter, carefully

Looking at the telemetry, we quickly discovered this problem was more widespread than we previously thought. Ideally, we don’t want users to have to deal with implementation or language-specific details. So, the first plan of attack to improve the experience was to make Pylance smarter. We wanted to reduce the unresolved imports and avoid manual interaction with Pylance settings before tackling potential UI changes that increased the discoverability of the setting.

As a team, we designed some new logic that could potentially improve the way that Pylance resolves imports. However, our team had concerns about adding this into the product as permanent logic because while this heuristic would almost certainly decrease the number of unresolved imports in user code and improve completion coverage, it was also possible that other user experience and performance metrics would degrade.

With that in mind, the best course of action for us was to roll this out via an A/B test so that we could measure the impact of this change at scale. The A/B test also provided us with the ability to shut the new experience down quickly if we saw the user experience start to degrade. The hypothesis that we tested was that the heuristic would improve metrics measuring resolving imports without degrading key guardrail metrics related to product performance and successful engagement with Pylance.

Doubling down on our approach via qualitative methods

The heuristic only addressed one case that caused unresolved imports. So, we also wanted to explore options for improving the discoverability of the feature’s related setting. While an A/B test can tell us that users are using a setting more often, it does not easily tell us why users prefer one UI over another. These questions are best suited for evaluative qualitative methods. So, while the A/B test was running, we started concept-value testing new user interface options.

Concept-value testing provides insight into the understanding and perception your users have around a particular idea. It does this by soliciting feedback on the usefulness of a proposed solution. For this study, we recruited several Visual Studio Code users and showed them mockups of different user interface options aimed at addressing our goal of increasing the discoverability of a setting. We asked probing questions about the UI to participants first broadly (e.g., “What do you think this is supposed to do?”) and then more targeted (e.g., “What if we told you that this button is supposed to do x/y/z?”). This allowed us to capture both their expectations for how the UI would work and how they felt about the intended design.

Coming out of the concept-value testing we discovered that the new UI was considerably more actionable and educational than the existing experience and associated documentation. As such, we opted to implement the mockup that was most well-received by our concept-value testing participants.

Example mockup seen by participants in concept-value testing.

Synthesizing qualitative and quantitative outcomes

After finishing the concept-value testing, we also had results from our A/B test. The results confirmed that there was a statistically significant improvement in the way Pylance resolves imports without degrading the product’s performance. This confirmed our hypothesis that this heuristic would benefit users. Given the confirming evidence for both changes, we shipped both to the product. Our work, however, was not complete. These two data points have opened new questions on how to improve Pylance that will be explored further through both qualitative and quantitative methods.

Putting things together

To answer a given research question, it’s important to first identify what methodology would be the best place to start answering your question – quantitative or qualitative. Are you looking to get baseline data, or does your question need to be measured at scale? If so, starting with quantitative methods (e.g., A/B testing, measurement via telemetry) is best. Do you want to better understand why your users behave in a certain way or understand what their perceptions are of your feature? If the answer is yes, then start with qualitative methods (e.g., customer interviews, usability testing).

Although you might start with one type of research, you should remember that it’s advisable to use both qualitative and quantitative methods in tandem. This will help you to tease out meaningful insights and make data-driven decisions for your product. When combined thoughtfully, the sum of quantitative and qualitative methods yield more value than either does independently.

–Savannah Ostrowski (Microsoft Developer Division), Julie Stevenson (Microsoft Experimentation Platform)

References

[1] K. Moran, “Usability Testing 101.” https://www.nngroup.com/articles/usability-testing-101/ (opens in new tab).

[2] S. Farrell, “28 Tips for Creating Great Qualitative Surveys.” https://www.nngroup.com/articles/qualitative-surveys/ (opens in new tab).

[3] User Interviews, “Focus Groups.” https://www.userinterviews.com/ux-research-field-guide-chapter/focus-groups

[4] E. Dopson, “A comprehensive guide to in-depth interviews (IDIs).” https://www.userzoom.com/interviews/comprehensive-guide-to-in-depth-interviews-idis/ (opens in new tab).

[5] M. Hasley and E. Tibias, “How to Conduct a Diary Study: A Start-to-Finish Guide.” https://dscout.com/people-nerds/diary-study-guide (opens in new tab).

[6] J. Estes, “Generative vs. evaluation research: what’s the difference and why do we need each?” https://www.usertesting.com/blog/generative-vs-evaluation-research (opens in new tab).

[7] “Pylance.” https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance (opens in new tab).

The post Measurably improve your product by combining qualitative and quantitative methods appeared first on Microsoft Research.

Microsoft’s Experimentation Platform: How We Build a World Class Product

Sebastian Kohlmeier — Fri, 28 Jan 2022 18:53:32 +0000

Microsoft’s Experimentation Platform (ExP) provides a platform used by product teams across Microsoft to run 1,000s of A/B tests every month. From a product perspective this means that we have a big responsibility both as a steward for data-driven decision-making and as an innovative center of excellence. As a result ExP’s product team must prioritize effectively to maximize the impact of investments, balancing long-term goals with ongoing customer feedback. In this post we describe some of the strategies and processes that we use to build a world–class scalable experimentation platform with a core focus on trustworthy A/B testing.

Set your sights on a long-term vision and measure your progress

ExP’s mission is to accelerate innovation through trustworthy analysis and A/B testing. Trustworthiness is at the core of our platform, in terms of both giving users the ability to run successful A/B tests and empowering them to make high quality decisions. To align our investments with long-term priorities for our customers including product teams throughout Microsoft such as Office, Bing, xBox, and Azure, we revisit our product vision and strategy on an annual basis. This includes using the OKR (objectives and key results) framework to set new goals that measure our impact which includes tracking adoption, engagement, and experiment quality. As a product team we also invest in testing our product by running our own A/B tests to protect the customer experience by assessing the impact of new features on the end-user experience.

Keep a pulse on your customers

A key component of our product development process is to stay in touch with our customers to track their success, collect targeted feedback, and triage new asks. This requires a multi-faceted approach as outlined here:

Regularly engage to accelerate the A/B testing flywheel: As a platform team we have several ways to engage with our customers to accelerate their A/B testing flywheel journey [1]. To ensure that our customers are successful at ramping up A/B testing for their products, we support their continuous and iterative development process by organizing periodic joint initiative reviews that include participants from product, data science, and engineering. To prepare for each initiative review, we collaborate with our customers to write a document that summarizes highlights and lowlights, checks in on progress, and captures opportunities for improvement and new feature requests [2]. These documents are reviewed in a meeting with dedicated reading and commenting time during the first half followed by discussion of any feedback, comments, and decision in the second half to encourage a high-quality conversation.
Create opportunities to collect continuous feedback: In addition to holding regular business reviews, our product team also organizes a monthly customer council meeting in which we present prototypes and get product feedback on new or existing features. The customer council meetings help us get feedback to quickly iterate and serve to supplement other avenues of feedback including focus groups, cognitive design walkthroughs, and one-on-one usability studies. These additional feedback mechanisms are scheduled on an as-needed basis to get more focused feedback that goes beyond in-app feedback surveys and quantitative results from A/B tests.
Prioritize and triage customer asks: To ensure that customer feature requests are prioritized appropriately we hold a weekly triage meeting in which feature asks are discussed and the team decides on a priority relative to our existing roadmap and customer commitments. Regardless of the outcome, a key component of this process is to close the loop with the customer to ensure alignment and maintain trust.

Create a culture that fosters cross-team alignment and collaboration

In addition to having a fantastic team, one of the keys to successfully building a world class product is to have the right process. For our product team, this means carefully balancing roadmap prioritization by considering the return on investment of each project before we make a commitment to our customers. To manage our backlog, ExP Product Managers work within a trio that includes Data Scientist and Engineering representation. In collaboration with their trios, ExP Product Managers spend approximately 50% of their time on project management and execution and the remaining 50% on forward-looking planning which includes focusing on the product vision for their product area, maintaining a long-term backlog, and developing new design prototypes. This gives the team the right balance and ensures that everyone stays aligned.

From a planning and execution perspective, we’ve created a continuous planning process in partnership with our engineering and data science teams that starts with backlog grooming within each product area and culminates in a cross-team backlog grooming exercise to set priorities and determine alignment across all teams. By revisiting prioritization on regular basis using an agile Kanban-like process that considers return on investment, we can react quickly to shift resources to address changes in scope and priorities and stay focused on the highest impact deliverables for our customers.

To manage specific projects, we use Azure DevOps to track ideas and feature requests from their inception until launch. We follow a lightweight process with variable detail depending on the size of the project which includes the following steps: (1) product spec with job stories and UX mocks, (2) dev design with work item breakdown, (3) detailed UX design, (4) implementation, and (5) A/B testing and customer feedback. Following this sequencing enables us to collect customer feedback and ensures that everyone is aligned and can review each other’s work. Before a feature is ready to be launched, we schedule a bug bash and work directly with pilot customers to get feedback with an initial Beta release before rolling out the feature more broadly with an A/B test using our own platform. To monitor progress across the team and learn whether we are on-track, we use Power BI to visualize the status of work. Looking beyond the launch of specific features, we also use Power BI to report on feature usage and product engagement, leverage the customer feedback process outline above and come up with our own innovative ideas to help us determine where to invest next.

Continuously invest in a world class product that powers trustworthy A/B testing

If you are a frequent reader of our blog, you have probably noticed some of our posts which highlight investments in trustworthy A/B testing such as diagnosing sample ratio mismatch [3] and alerting [4]. We take pride in our product development process to continuously invest in a world class product and we will continue to share new and interesting A/B testing features on this blog as we advance our mission to accelerate innovation through trustworthy analysis and A/B testing.

– Sebastian Kohlmeier, Microsoft Experimentation Platform

References

[1] A. Fabijan, B. Arai, P. Dmitriev, and L. Vermeer, “It takes a Flywheel to Fly: Kickstarting and Keeping the A/B testing Momentum.” http://approjects.co.za/?big=en-us/research/group/experimentation-platform-exp/articles/it-takes-a-flywheel-to-fly-kickstarting-and-keeping-the-a-b-testing-momentum/.

[2] B. Porter, “The Beauty of Amazon’s 6-Pager.” https://www.linkedin.com/pulse/beauty-amazons-6-pager-brad-porter/.

[3] A. Fabijan, T. Blanarik, M. Caughron, K. Chen, R. Zhang, A. Gustafson, V. Kavitha Budumuri, and S. Hunt, “Diagnosing Sample Ratio Mismatch in A/B Testing.” http://approjects.co.za/?big=en-us/research/group/experimentation-platform-exp/articles/diagnosing-sample-ratio-mismatch-in-a-b-testing/.

[4] A. Agrawal, and J. Lu, “Alerting in Microsoft’s Experimentation Platform (ExP).” http://approjects.co.za/?big=en-us/research/group/experimentation-platform-exp/articles/alerting-in-microsofts-experimentation-platform-exp/.

The post Microsoft’s Experimentation Platform: How We Build a World Class Product appeared first on Microsoft Research.

Experimentation Platform Articles

External Validity of Online Experiments: Can We Predict the Future?

External Validity of the Future

External Validity and Surprises

Next Day External Validity: What Will Tomorrow Bring?

Next-Day Deviations: a First Look

Next-Day deviations: a second look

Weekday and Weekend Effects

Day-to-Day Volatility

What Will Next Week Bring? (Not Necessarily Smaller Error Bars)

How Confident Can We be in Confidence Intervals?

How Often Should We be Surprised?

External Validity and Novelty: Why do Things Change Over Time?

Learning Effects

Kendall’s Tau Statistics

Learning Effects vs. External Validity

Conclusions

Bibliography

Experimentation in Generative AI: C++ Team’s Practices for Continuous Improvement

Methods for making data-driven decisions for generative AI products

What are qualitative methods?

What are quantitative methods?

Incorporating all methods into your product lifecycle

Using progressive rollout to test your generative AI feature

What is progressive rollout?

What’s the benefit of progressive rollout?

Iterating experiments to optimize your feature

Why run multiple iterations? What are the benefits?

Combing best practices to help C++ users: Copilot in Quick Info case study

Progressive rollout of initial design

Qualitative studies of initial design

Iterative experimentation on feature

Putting things together

References

A/B Testing Infrastructure Changes at Microsoft ExP

Key takeaways:

Introduction

The desired state

A/B Tests Designs

First A/B Test: Routing requests through Reverse Proxy

First A/B Test: Results & Learnings

Second A/B Test: Routing requests to different backend services

Second A/B Test: Results & Learnings

Summary

References

How to Evaluate LLMs: A Complete Metric Framework

GPU Utilization

Responsible AI

Performance Metrics

Utility Metrics

User Engagement and Satisfaction

Increase in productivity for collaboration scenarios

Data Requirements

­­­­­Running A/B tests

Launch an LLM Feature

Post Launch

Summary

Acknowledgements

References

Appendix

GPU Utilization Metrics

Utility Metrics

User Engagement and Satisfaction

Increase in productivity for collaboration scenarios

Creator Productivity (better content created in less time)

Consumer Productivity (better consumption of content in less time):

A/B Interactions: A Call to Relax

A/B Interactions

A case where concurrent A/B tests are safe

A case with A/B interactions

Looking for A/B Interactions at Microsoft

The data analysis

The results: few or no interactions

Discussion

Conclusion

References

Deep Dive Into Variance Reduction

Variance is a Statistical Property of Estimators

The Difference-in-Means Estimator Provides Consistency in A/B tests

CUPED-VR Outperforms the Difference-in-Means Estimator

Running A/B tests