{"id":880938,"date":"2022-09-26T17:07:56","date_gmt":"2022-09-27T00:07:56","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=880938"},"modified":"2022-10-05T10:36:41","modified_gmt":"2022-10-05T17:36:41","slug":"for-event-based-a-b-tests-why-they-are-special","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/for-event-based-a-b-tests-why-they-are-special\/","title":{"rendered":"For Event-based A\/B tests: why they are special"},"content":{"rendered":"\n
An \u201cevent-based\u201d A\/B test is a method used to test two or more variables during a limited duration. We can use what we learn to increase user engagement, satisfaction, or retention of a product, while also applying our insights to future event and product scenarios. We often use A\/B testing when there is a launch of a new feature. This allows the product team to try different messaging to determine which content maximizes user engagement.<\/p>\n\n\n\n
Unlike classic A\/B testing, where a feature is developed, incrementally tested, gradually rolled out and then becomes a permanent part of the product, an event-based feature has limited time for experimentation. The period can be as little as a day or a handful of days. For example, International Olympics-related headlines on a news app throughout the duration of the Tokyo Olympic Games. In this blog post, we will explore some of the challenges of running event-based A\/B tests and share some insights for the set-up and analysis for such experiments.<\/p>\n\n\n\n
As a best practice, product teams perform manual or unit testing before exposing features to end users. This method helps detect and fix bugs to minimize user harm. However, not every bug can be detected at this stage. Feature teams often run A\/B tests to discover things that may have been overlooked or cannot be tested in manual\/unit testing. Teams expose the feature to a small traffic percentage, measure user engagement, identify issues, remedy them, and then do another round of experimentation to verify improvement [1]. Yet in the case of event-based A\/B tests it\u2019s almost impossible to test and iterate on the feature during the experiment given time constraints.<\/p>\n\n\n\n
For global events like International Women\u2019s Day which spans multiple regions, we may want to run an A\/B test at the same local time in each region. Depending on the experimentation system\u2019s capability, this could mean setting up multiple A\/B tests with each targeting a specific region and starting at a different UTC time. If many regions need to be covered, the experiment set-up would require quite a bit of effort. It might be tempting to use a single A\/B test for all regions and start at exactly the same time. However, if there is an issue specific to a region, feature teams cannot do anything about it but stop the entire experiment. On the contrary, having one experiment per region allows us to turn off the feature for the affected region alone. This method provides a way to manage risk while also adding small overhead to experiment management.<\/p>\n\n\n\n
Metric results help us understand the impact of a feature. Data becomes available once the:<\/p>\n\n\n\n
Depending on the product scenario, it can take some time to complete #1 and #2. If an A\/B test targets a broad audience, or very large data sets need to be consumed, then it could take hours for results to be ready. In an example that we observed recently, experimenters could not know the feature performance until the ~8th hour after the experiment had started using the first 4-hour data. If there is an issue with the feature and metrics are the only way to know about the issue, then a significant amount of time would have elapsed by the time the feature team discovers the problem.\u00a0<\/p>\n\n\n\n
It is possible to encounter issues during an A\/B test. For example, Sample Ratio Mismatch (a.k.a. SRM) has been found to happen relatively frequently in A\/B tests. It is a symptom for a variety of experiment quality issues ranging from assignment, execution, log processing, analysis, and interference between variants and telemetry [2]. Debugging such issues takes time. For a classic A\/B test with SRM, it could take days, weeks, or even months to identify the root cause. Given the time limit and data latency, it may not be feasible\u00a0for the feature team to identify the root cause before the event ends.<\/p>\n\n\n\n
Displaying an event-based feature often comes at the cost of not showing another feature. Let\u2019s return for a moment to our Olympics Games carousel. The spaces used for Olympics headlines could also be used for other types of information. A noticeable increase in user engagement in the carousel could mean a missed ad engagement opportunity. How should we make a tradeoff between different types of engagement? How do we quantify and understand the impact in both the short and long term?<\/p>\n\n\n\n
Treatment \u2013 Olympics carousel: <\/em><\/p>\n\n\n\n Control \u2013 No Olympics carousel (Shopping carousel is displayed by default):<\/em><\/p>\n\n\n\n Figure 1. Event-based feature comes at a cost of not showing another feature<\/em><\/p>\n\n\n\n There are multiple variables in event-based A\/B tests. In the Olympics carousel example, its format, content, and where it appears could be all the things that we want to test out. We may also want to test it for different geolocations. Moreover, event-based experiments introduce a unique dimension of variability – the event itself. What would you say if the result shows a stat-sig decrease in content interaction for users located in the US? Does that mean that the carousel is bad? What if the result shows a stat-sig increase for users in Asian countries? What if you see opposite results on a similar carousel but for a different event i.e., Super Bowl?<\/p>\n\n\n\n In classic A\/B testing, we can have two different UX treatments for a feature. Through testing, we can see which one works better and use what we learn in other experiments. In event-based experiments, however, insights from one event may not be transferrable to others. For instance, the Olympics is different from International Women\u2019s Day, which is quite different from the Super Bowl. Thus, it can be difficult to draw conclusions from one experiment and define the right execution for the next event.<\/p>\n\n\n\n We recommend the following to address the challenges of event-based A\/B tests.<\/p>\n\n\n\n Provide the option to schedule A\/B tests in batch and automatically run experiment and metric analysis.<\/strong> This is especially helpful when traffic rotation needs to happen, and multiple A\/B tests need to be created for the same feature targeting different regions. The feature team should be able to schedule A\/B tests for different time zones and have the tests start automatically at preset times. The short-period metric analysis should be kicked off as soon as data becomes available so that experimenters can see results early before the event ends.<\/p>\n\n\n\n Ideally, this option should be an integral part of the experimentation system. Depending on how often event-based A\/B tests are projected to run and the ROI (return on investment) of the engineering investment, it might be good enough to have a plug-in or tooling that leverages an existing system\u2019s API (application programming interfaces) to set up, run, and analyze event-based A\/B tests automatically.<\/p>\n\n\n\n Establish near-real-time pipeline to monitor, detect and debug A\/B tests. <\/strong>Feature teams need to react quickly when something is off. Waiting hours for metrics to be calculated puts users and teams at risk of adverse impacts. With near-real-time pipeline, experiment data is aggregated every couple of minutes and key guardrail metrics are computed. These metrics help detect egregious effects, especially in the first few hours after the experiment starts. Although they may be a subset of all the metrics the feature team cares about, they allow the team to closely monitor event-based A\/B tests, debug issues quickly, and take action to shut down experiments when needed. At Microsoft, we have established a near-real-time pipeline for live site monitoring for online services (details to be shared in future blog post). It allows us to detect a number of experiments quickly that have bad outcomes. Note that having near-real-time data can motivate experimenters to check scorecards more frequently before reaching the fixed time horizon. This is called \u201cp-hacking\u201d. It can inflate the Type I error rate and cause experimenters to see more false positives. Using a traditional A\/B test or \u201cfixed-horizon\u201d statistic no longer works, and sequential testing is better suited for continuous monitoring of the experiment. To develop the sequential probability ratio, it is advisable to understand metric distributions beforehand. You can then verify that the independence assumption holds for the test to be applicable [3].<\/p>\n\n\n\n Use triggered analysis to increase sensitivity of metrics.<\/strong> When an event-based feature is displayed, it is possible that not<\/em> every product user sees it. For example, a component may require that the user scroll the page to be seen. If the user does not scroll, then the component will not be discovered. Sometimes, the feature might be enabled only when certain conditions are met. For instance, we might show sports-events-related features only if the user has previously visited sports articles or websites. Using the full user population for analysis can dilute the results. It would be valuable to do a triggered analysis, such as analyzing only those users who see the experience [4]. From our observations on A\/B tests run at Microsoft, the more targeted the audience and metrics for analysis, the more likely we are to get results with stat-sig metric movements.<\/p>\n\n\n\n Conduct post-experiment analysis to understand the impact not reflected in A\/B test results. <\/strong>These analyses help establish a more complete picture about the experiment and the event itself. For example, an event-based carousel may cause a drop in revenue due to less ads being displayed (as shown in Figure 1). However, if users like the carousel, there might be lingering effect that makes them revisit the app more frequently. Conducting post-experiment retention analysis helps quantify the impact that is not observed during the time of A\/B test. By comparing the retention of cohorts in the treatment and the control after the experiment, we may find that the feature leads to an increase in user retention over the long term.<\/p>\n\n\n\n We can also dig deeper to uncover other insights. For instance, if the overall difference in retention is small, could it be prominent for some subset of users? Could there be a shift from product \u201cactive users\u201d to \u201centhusiasts\u201d, or \u201crandom users\u201d to \u201cactive users\u201d for those seeing treatment experience? Could there be a more observable difference if we look at cohorts that have been exposed to multiple event-based features on a cumulative basis?\u00a0<\/p>\n\n\n\n As an event itself is a variable, doing cross-experiment analysis<\/em> helps shed light on the differences between events. This requires keeping a repository of historical A\/B tests and metrics data. By comparing the metric movements between different events, or applying machine learning techniques, we can find out how event, region, feature format, content, and other variables play a role in the metric movement. The value of such analysis is dependent on the data accumulated over time. By testing more event-based features and collecting more data, we can derive more value out of the cross-experiment analysis.<\/p>\n\n\n\n Event-based experiments are a unique type of A\/B testing due to their time sensitivity and limited duration. Likewise, these events face unique challenges throughout the experimentation lifecycle including feature testing, experiment set-up, analysis, debugging, experiment understanding and learning. Event-based testing requires specific tooling, monitoring, and an analysis approach to address these and other challenges. As we embrace diversity and inclusion across the world, we expect to see more event-based A\/B tests happen across various products. In this blog post, we share our thoughts and recommendations on this type of experiment and hope that it is helpful for you if you are or consider running event-based A\/B tests in future.<\/p>\n\n\n\n – Li Jiang (Microsoft ExP),<\/em><\/p>\n\n\n\n – Ben Goldfein, Erik Johnston, John Henrikson, Liping Chen<\/em> (Microsoft Start)<\/p>\n\n\n\n References<\/strong><\/p>\n\n\n\n [1] T. Xia, S. Bhardwaj, P. Dmitriev, and A. Fabijan, \u201cSafe Velocity: A Practical Guide to Software Deployment at Scale using Controlled Rollout,\u201d in 2019 IEEE\/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP),<\/em> 2019, pp. 11\u201320.<\/p>\n\n\n\nFeature for A\/B testing<\/strong><\/td> Region<\/strong><\/td> Metric movement<\/strong><\/td> Immediate reaction<\/strong><\/strong><\/td> Actual reason<\/strong><\/td><\/tr> A carousel for the Olympics event<\/td> US<\/td> -0.8%<\/td> Carousel does not work well in US -> may need to change the format or source of content in carousel<\/td> CJK users are more interested in the Olympics event than US users<\/td><\/tr> A carousel for the Olympics event<\/td> CJK (China, Japan, Korea)<\/td> +1%<\/td> Carousel works well in CJK<\/td> CJK users are more interested in the Olympics event than US users<\/td><\/tr><\/tbody><\/table> Recommendation on experimentation infrastructure and analysis<\/h2>\n\n\n\n
Experiment tooling<\/h3>\n\n\n\n
Near-real-time monitoring<\/h3>\n\n\n\n
Triggered analysis<\/h3>\n\n\n\n
Post-experiment analysis<\/h3>\n\n\n\n
Summary<\/h2>\n\n\n\n