{"id":770860,"date":"2021-09-08T11:21:21","date_gmt":"2021-09-08T18:21:21","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=770860"},"modified":"2021-09-08T11:21:21","modified_gmt":"2021-09-08T18:21:21","slug":"a-a-b-testing-evaluating-microsoft-teams-across-build-releases","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/a-a-b-testing-evaluating-microsoft-teams-across-build-releases\/","title":{"rendered":"A\/A\u2019\/B Testing: Evaluating Microsoft Teams across Build Releases"},"content":{"rendered":"
Microsoft Teams<\/a> is a communication platform [1]. It integrates meet, chat, call and collaborate in one place. The application updates multiple times a month<\/a> [2], with additional new features and iterative improvements to existing features. To ensure high quality user experience across frequent updates, the team needs to actively monitor the quality of each new build release.<\/p>\n A\/B testing is the gold standard to compare product variants<\/a> [3]. As the Microsoft Teams Experimentation team, we have run 100s of A\/B tests. The best practice we always follow is to test one feature or a combination of interactive features at a time<\/a> [4]. That said, A\/B testing is like a ‘unit-testing’ tool. In practice, A\/B testing is rarely used for comparison between whole builds. That’s because each build integrates multiple feature changes and it is hard to figure out which features cause regressions, if any. However, we can attempt to use A\/B testing as an integration testing tool for builds comparison.<\/p>\n In this scenario, each user is presented with either current or next build release randomly. We evaluate if the variants generate statistically significant different results in key metrics. During our analysis, we identified two factors which introduce biases. Thus, the comparison is invalid and does not generate useful insights. In this blog post, we talk about why the issue exists. We also introduce an<\/strong> A\/A\u2019\/B testing framework which successfully enables valid builds comparison in Microsoft Teams.<\/strong><\/p>\n Figure 1. Builds comparison<\/p><\/div>\n We started from running an A\/A test<\/a> [5] via the current A\/B testing framework. The test is a sanity check to determine if testing between builds would provide useful insights. The users in the control variant continue using the current build. The users in treatment receive a request to update to the next build, which is identical to current build except for the build version number. Without introducing treatment effect, we expected to see no differences between the results of two variants. But we observed many statistically significant metric movements. Why did those false positives show up? After investigation, we identified two factors which can introduce bias: penetration difference <\/strong>and update effect<\/strong> (reinstall-and-restart effect).<\/p>\n It takes time for the next build to penetrate across the treatment users. When an A\/B test is running, the overall traffic volumes from variants are close. But the compositions are quite different. Let’s take a look at an example in Figure 2. Assume v<\/em> is the current build version for A (the control variant), and v+1<\/em> is the next build released to B (the treatment variant). On day 0, one hundred percent of users in A and B are using build v<\/em>. Starting from day 1, users in B consist of two parts, those using build v<\/em> and those using build v+1<\/em>. The portion of latter part increases as the test runs longer. It will eventually approach 100%. But the time to reach that point will vary depending on how long build v+1<\/em> penetrates across users in B. If most users are daily users, the duration might be short to achieve a high enough portion. Otherwise, it can take weeks or even months.<\/p>\n We have two options to perform the comparison: filtered analysis and standard analysis. Filtered analysis<\/strong> drills down to user activities for target builds. In the figure, those are the activities covered by the blue boxes in A and the grey boxes in B. Standard analysis<\/strong> includes all traffic in both variants, which compares the activities covered by the blue boxes in A with those covered by grey AND blue boxes in B.<\/p>\n Figure 2. The change of traffic composition across time. v is the current build version used in A (the control variant), and v+1 is the next build released to B (the treatment variant).<\/p><\/div>\n Filtered analysis is a direct and intuitive way to compare builds. But selection bias<\/a> [6] exists between current-build users in A and next-build users in B. Therefore, we should not directly compare those user groups. An example is that daily users are very likely to update within 24 hours, while weekly users may take up to a week to upgrade. This means on day 1, the average next-build users will be more active than the average current-build users. Instead of measuring the outcome differences between builds, the comparison can be dominated by the characteristic differences between more engaged and less engaged users.<\/p>\n To resolve the issue, we can use standard analysis instead. As we don\u2019t filter out any users, the average users are identical. In that way, we don\u2019t need to worry about the selection bias anymore. But as the analysis covers non-targeted users, it results in the dilution of treatment effect.<\/p>\n To wrap up, penetration difference can introduce bias to the filtered analysis but not standard analysis<\/strong>. However, standard analysis still does not work because update effect is another key factor introducing bias.<\/p>\n The users must reinstall and restart Microsoft Teams application to update to a new build version. After reinstallation and restart, the application normalizes the memory usage and performance profile. For users in the control variant, memory usage accumulates since the application was launched. It can increase the memory consumed by the application. In contrast, memory usage is significantly reduced after reinstallation and restart in treatment. That difference can further lead to the secondary effects on application performance and user engagement. Therefore, the builds comparison measures not only the differences between the results of two builds, but also the impact of reinstallation and restart<\/strong>.<\/p>\n In the A\/A test mentioned earlier, we observed statistically significant metric movements even when performing standard analysis. This indicates that the update effect<\/strong> was the main reason leading to the gap between builds.<\/p>\n We proposed several methods to mitigate the impact of penetration difference and update effect. The key point is to only include users who have experienced application restart or update in the analysis<\/strong>.<\/p>\nWhy is comparing builds through A\/B testing insufficient?<\/h1>\n
Penetration difference<\/h2>\n
Impact on analysis<\/h3>\n
Update effect<\/h2>\n
Methods considered<\/h1>\n
Triggered analysis<\/h2>\n