{"id":957738,"date":"2023-08-02T14:01:03","date_gmt":"2023-08-02T21:01:03","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=957738"},"modified":"2023-08-02T14:01:08","modified_gmt":"2023-08-02T21:01:08","slug":"a-b-interactions-a-call-to-relax","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/a-b-interactions-a-call-to-relax\/","title":{"rendered":"A\/B Interactions: A Call to Relax"},"content":{"rendered":"<p>If you\u2019re a regular reader of the Experimentation Platform blog, you know that we\u2019re always warning our customers to be vigilant when running A\/B tests. We warn them about the pitfalls of even tiny SRMs (sample ratio mismatches), small bits of lossiness in data joins, and other similar issues that can invalidate their A\/B tests [<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/diagnosing-sample-ratio-mismatch-in-a-b-testing\/\">2<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/data-quality-fundamental-building-blocks-for-trustworthy-a-b-testing-analysis\/\">3<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/patterns-of-trustworthy-experimentation-post-experiment-stage\/\">4<\/a>]. But today, we\u2019re going to switch gears and tell you to relax a little. We\u2019re going to show you why A\/B interactions \u2013 the dreaded scenario where two or more tests interfere with each other \u2013 are not as common a problem as you might think. Don\u2019t get us wrong, we\u2019re not saying that you can completely let down your guard and ignore A\/B interactions altogether. We\u2019re just saying that they\u2019re rare enough that you can usually run your tests without worrying about them.<\/p>\n\n\n<h2 class=\"wp-block-heading\" id=\"a-b-interactions\">A\/B Interactions<\/h2>\n\n\n\n<p>But we\u2019re getting ahead of ourselves. What are A\/B interactions? In an A\/B test, users are randomly separated into control and treatment groups, and after being exposed to different product experiences, metrics are compared for the two groups [1]. At Microsoft\u2019s Experimentation Platform (ExP), we have hundreds of A\/B tests running every day. In an ideal world, every A\/B test would get its own separate set of users. However, splitting users across so many A\/B tests would dramatically decrease the statistical power of each test. Instead, we typically allow each user to be in multiple A\/B tests simultaneously. <\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"a-case-where-concurrent-a-b-tests-are-safe\">A case where concurrent A\/B tests are safe<\/h4>\n\n\n\n<p>For example, a ranker might have one A\/B test that changes the order of web results, and another A\/B test that changes the UX. Both A\/B tests can run at the same time, with users assigned independently to the control or treatment of each A\/B test, in four equally likely combinations:<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-regular; table-header\"><table class=\"has-black-color has-text-color\"><tbody><tr><td bgcolor=\"steelblue\" style=\"color: white\"><\/td><td bgcolor=\"steelblue\" style=\"color: white\"><strong>Ranker #1<\/strong><\/td><td class=\"has-text-align-left\" data-align=\"left\" bgcolor=\"steelblue\" style=\"color: white\"><strong>Ranker #2<\/strong><\/td><\/tr><tr><td bgcolor=\"steelblue\" style=\"color: white\"><strong>UX #1<\/strong><\/td><td>Control-control<\/td><td class=\"has-text-align-left\" data-align=\"left\">Control-treatment<\/td><\/tr><tr><td bgcolor=\"steelblue\" style=\"color: white\"><strong>UX #2<\/strong><\/td><td>Treatment-control<\/td><td class=\"has-text-align-left\" data-align=\"left\">Treatment-treatment<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\">Table 1: Two A\/B tests for which independent control\/treatment assignment is safe<\/figcaption><\/figure>\n\n\n\n<p>In most cases, this is fine. Because which ad UX the user sees probably doesn\u2019t have much impact on how they respond to the ranking of the results, the differences reported in the ranker A\/B scorecard results will look the same regardless of what the UX A\/B test is doing, and vice versa.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"a-case-with-a-b-interactions\">A case with A\/B interactions<\/h4>\n\n\n\n<p>On the other hand, some cases are more problematic. For example, if there are two A\/B tests, one which changes the ad text color from black to red, and one which changes the ad background color from grey to red, whether the user is in the control or the treatment of one A\/B test will greatly impact the treatment effect seen in the other A\/B test. The user can no longer see the ad text when it\u2019s red on red, so the red ad text treatment might be good for users in the control of the ad background A\/B test, but bad for users in the treatment of the ad background A\/B test.<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-regular; table-header\"><table class=\"has-black-color has-text-color\"><tbody><tr><td bgcolor=\"steelblue\" style=\"color: white\"><\/td><td bgcolor=\"steelblue\" style=\"color: white\"><strong>Ad text color: black<\/strong><\/td><td bgcolor=\"steelblue\" style=\"color: white\"><strong>Ad text color: light red<\/strong><\/td><\/tr><tr><td bgcolor=\"steelblue\" style=\"color: white\"><strong>Ad background color: grey<\/strong><\/td><td bgcolor=\"lightgray\" style=\"color: black\">Buy flowers!<\/td><td bgcolor=\"lightgray\" style=\"color: orangered\">Buy flowers!<\/td><\/tr><tr><td bgcolor=\"steelblue\" style=\"color: white\"><strong>Ad background color: red<\/strong><\/td><td bgcolor=\"red\" style=\"color: black\">Buy flowers!<\/td><td bgcolor=\"red\" style=\"color: orangered\">Buy flowers!<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\">Table 2: Two A\/B tests for which independent control\/treatment assignment is not safe<\/figcaption><\/figure>\n\n\n\n<p>A\/B interactions can be a real concern, and at ExP we have techniques to isolate A\/B tests like these from each other when we suspect they will interact, so that users aren\u2019t assigned independently for each test. However, as already mentioned, doing this decreases the number of users available for each A\/B test, thus decreasing their statistical power.<\/p>\n\n\n\n<p> <\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"looking-for-a-b-interactions-at-microsoft\">Looking for A\/B Interactions at Microsoft<\/h2>\n\n\n\n<p>Our previous experience with A\/B tests at Microsoft had found that A\/B interactions were extremely rare. Similarly, researchers at Meta found that A\/B interactions were not a serious problem for their tests [<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/blog.statsig.com\/embracing-overlapping-a-b-tests-and-the-danger-of-isolating-experiments-cb0a69e09d3\">5<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>]. &nbsp;<\/p>\n\n\n\n<p>We recently carried out a new investigation of A\/B interactions in a major Microsoft product group. In this product group, A\/B tests are not isolated from each other, and each control-treatment assignment takes place independently. <\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"the-data-analysis\">The data analysis<\/h4>\n\n\n\n<p>Within this product group, we looked at four major products, each of which runs hundreds of A\/B tests per day on millions of users. For each product, we picked a single day, and looked at every pair of A\/B tests that were running on that same day. For each pair, we calculated every metric for that product for every possible control or treatment assignment combination for the two tests in the pair. The results for metric Y are shown here for a case where each test has one control and one treatment.<\/p>\n\n\n\n<figure class=\"wp-block-table\">\n<table>\n<tbody>\n<tr>\n    <td bgcolor=\"steelblue\"><\/td>\n    <td bgcolor=\"steelblue\"><strong>A\/B Test #2: <font color=\"mediumspringgreen\">C<\/font><\/strong><\/td>\n    <td bgcolor=\"steelblue\"><strong>A\/B test #2: <font color=\"mediumspringgreen\">T<\/font><\/strong><\/td>\n    <td bgcolor=\"steelblue\"><strong>Treatment effect<\/strong><\/td>\n<\/tr>\n<tr>\n    <td bgcolor=\"steelblue\">A\/B test #1: <font color=\"cyan\">c<\/font><\/td>\n    <td bgcolor=\"lightgray\">Y<strong><sub><font color=\"mediumspringgreen\">C<\/font>,<font color=\"cyan\">c<\/font><\/sub><\/strong><\/td>\n    <td bgcolor=\"lightgray\">Y<strong><sub><font color=\"mediumspringgreen\">T<\/font>,<font color=\"cyan\">c<\/font><\/sub><\/strong><\/td>\n    <td bgcolor=\"lightgray\">\u0394<strong><sub><font color=\"cyan\">c<\/font><\/sub><\/strong>=Y<strong><sub><font color=\"mediumspringgreen\">T<\/font>,<font color=\"cyan\">c<\/font><\/sub><\/strong>&#8211; Y<strong><sub><font color=\"mediumspringgreen\">C<\/font>,<font color=\"cyan\">c<\/font><\/sub><\/strong><\/td>\n<\/tr>\n<tr>\n    <td bgcolor=\"steelblue\">A\/B test #1: <font color=\"cyan\">t<\/font><\/td>\n    <td bgcolor=\"lightgray\">Y<strong><sub><font color=\"mediumspringgreen\">C<\/font>,<font color=\"cyan\">t<\/font><\/sub><\/strong><\/td>\n    <td bgcolor=\"lightgray\">Y<strong><sub><font color=\"mediumspringgreen\">T<\/font>,<font color=\"cyan\">t<\/font><\/sub><\/strong><\/td>\n    <td bgcolor=\"lightgray\">\u0394<strong><sub><font color=\"cyan\">t<\/font><\/sub><\/strong>=Y<strong><sub><font color=\"mediumspringgreen\">T,<\/font><font color=\"cyan\">t<\/font><\/sub><\/strong>&#8211; Y<strong><sub><font color=\"mediumspringgreen\">C<\/font>,<font color=\"cyan\">t<\/font><\/sub><\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<figcaption class=\"wp-element-caption\">Table 3: Treatment effects for one A\/B test, segmented by user control\/treatment assignment in a different A\/B test<\/figcaption>\n<\/figure>\n\n\n\n<p>A chi-square test was performed to check if there was any difference between the two treatment effects. Because there were hundreds of thousands of A\/B test pairs and metric combinations, hundreds of thousands of p-values were calculated. Under the null hypothesis of no A\/B interactions, the p-values should be drawn from a uniform distribution, with 5% of the p-values satisfying p<0.05, 0.1% of the p-values satisfying p<0.001, etc. [<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/p-values-for-your-p-values-validating-metric-trustworthiness-by-simulated-a-a-tests\/\">6<\/a>]. Accordingly, some were bound to be small, just by chance.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"the-results-few-or-no-interactions\">The results: few or no interactions<\/h4>\n\n\n\n<p> Therefore, to check whether there were A\/B interactions, we looked at the distribution function of p-values, shown here for a single day for a specific product:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/07\/modifiedCdfFigure-1024x1015.png\" alt=\"Cumulative distribution of A\/B test interaction p-values\" class=\"wp-image-957750\" width=\"386\" height=\"383\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/07\/modifiedCdfFigure-1024x1015.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/07\/modifiedCdfFigure-300x297.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/07\/modifiedCdfFigure-150x150.png 150w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/07\/modifiedCdfFigure-768x761.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/07\/modifiedCdfFigure-1536x1522.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/07\/modifiedCdfFigure-180x180.png 180w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/07\/modifiedCdfFigure-182x180.png 182w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/07\/modifiedCdfFigure.png 1852w\" sizes=\"(max-width: 386px) 100vw, 386px\" \/><figcaption class=\"wp-element-caption\">Figure 1: Cumulative distribution of p-values for A\/B interaction tests<\/figcaption><\/figure>\n\n\n\n<p>The graphs for all four products look similar; all are very close to a uniform distribution. We then looked for deviations from a uniform distribution by checking if there were any abnormally small p-values, using a Benjamini-Hochberg false positive rate correction test. For three of the products, we found none, showing that all results were consistent with no A\/B interactions. For one product, we did find a tiny number of abnormally small p-values, corresponding to 0.002%, or 1 in 50,000 A\/B test pair metrics. The detected interactions were checked manually, and there were no cases where the two treatment effects in Table 3 were both statistically significant but moving in opposite directions. In all cases either the two treatment effects were in the same direction but different in magnitude, or one of them was not statistically significant.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"discussion\">Discussion<\/h4>\n\n\n\n<p>It\u2019s possible that there were other A\/B interactions that we just didn\u2019t have the statistical power to detect. If the cross-A\/B test treatment effects were, for example, 10% and 11% for two different cross-A\/B test assignments, we might not have detected that difference, either because the chi-square test returned a high p-value, or because it returned a low p-value that got \u201clost\u201d in the sea of other low p-values that occurred by chance when doing hundreds of thousands of statistical tests.<\/p>\n\n\n\n<p>This is possible, but it raises the question of when we should worry about interaction effects. For most A\/B tests at Microsoft, the purpose of the A\/B test is to produce a binary decision: whether to ship a feature or not. There are some cases where we\u2019re interested in knowing if a treatment effect is 10% or 11%, but those cases are the minority. Usually, we just want to know if key metrics are improving, degrading, or remaining flat. <strong>From that perspective, the scenario with small cross-A\/B test treatment effects is interesting in an academic sense, but not typically a problem for decision-making.<\/strong><\/p>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"conclusion\">Conclusion<\/h2>\n\n\n\n<p>While there are cases where A\/B interaction effects are real and important, in our experience, this is rare issue where people often worry <strong>more<\/strong> than they need to. Overall, the vast majority of A\/B tests either don\u2019t interact or have only relatively weak interactions. Of course, the results depend on the product and the A\/B tests, so don\u2019t just take our word for it: try running your A\/B tests concurrently and perform your own meta-analysis of interaction effects!<\/p>\n\n\n\n<p><em>\u2013 Monwhea Jeng, Microsoft Experimentation Platform<\/em><\/p>\n\n\n\n<p> <\/p>\n\n\n\n<p> <\/p>\n\n\n\n<p> <\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"references\">References<\/h2>\n\n\n\n<p>[1] Kohavi R., Tang D., & Xu Y. (2020).&nbsp;<em>Trustworthy Online Controlled Experiments: A practical Guide to A\/B Testing.<\/em>&nbsp;Cambridge: Cambridge University Press. doi:10.1017\/9781108653985<\/p>\n\n\n\n<p>[2] Fabijan A., Blanarik T., Caughron M., Chen K., Zhang R., Gustafson A., Budumuri V.K. & Hunt S. <em>Diagnosing Sample Ratio Mismatch in A\/B Testing<\/em>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/diagnosing-sample-ratio-mismatch-in-a-b-testing\/\">https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/diagnosing-sample-ratio-mismatch-in-a-b-testing\/<\/a>.<\/p>\n\n\n\n<p>[3] Liu P., Qin W., Ai H. & Jing J. <em>Data Quality: Fundamental Building Blocks for Trustworthy A\/B testing Analysis<\/em>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/data-quality-fundamental-building-blocks-for-trustworthy-a-b-testing-analysis\/\">https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/data-quality-fundamental-building-blocks-for-trustworthy-a-b-testing-analysis\/<\/a> .<\/p>\n\n\n\n<p>[4] Machmouchi&nbsp; W. and Gupta S. <em>Patterns of Trustworthy Experimentation: Post-Experiment Stage<\/em>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/patterns-of-trustworthy-experimentation-post-experiment-stage\/\">https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/patterns-of-trustworthy-experimentation-post-experiment-stage\/<\/a>.<\/p>\n\n\n\n<p>[5] Chan T. <em>Embrace Overlapping A\/B Tests and Avoid the Dangers of Isolating Experiments<\/em>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/blog.statsig.com\/embracing-overlapping-a-b-tests-and-the-danger-of-isolating-experiments-cb0a69e09d3\">https:\/\/blog.statsig.com\/embracing-overlapping-a-b-tests-and-the-danger-of-isolating-experiments-cb0a69e09d3<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<p>[6] Mitchell C., Drake A., Litz J., & Vaz G. <em>p-Values for Your p-Values: Validating Metric Trustworthiness by Simulated A\/A Tests<\/em>. <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/p-values-for-your-p-values-validating-metric-trustworthiness-by-simulated-a-a-tests\/\">https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/p-values-for-your-p-values-validating-metric-trustworthiness-by-simulated-a-a-tests\/<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>If you\u2019re a regular reader of the Experimentation Platform blog, you know that we\u2019re always warning our customers to be vigilant when running A\/B tests. We warn them about the pitfalls of even tiny SRMs (sample ratio mismatches), small bits of lossiness in data joins, and other similar issues that can invalidate their A\/B tests [&hellip;]<\/p>\n","protected":false},"author":41033,"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":651963,"footnotes":""},"research-area":[],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-957738","msr-blog-post","type-msr-blog-post","status-publish","hentry","msr-locale-en_us"],"msr_assoc_parent":{"id":651963,"type":"group"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/957738"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/41033"}],"version-history":[{"count":57,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/957738\/revisions"}],"predecessor-version":[{"id":958431,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/957738\/revisions\/958431"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=957738"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=957738"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=957738"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=957738"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}