Figure 4 Traffic composition change across time. The build version is v for A (Standard Control), v\u2019 for A\u2019 (Custom Control) and v+1 for B (Treatment).<\/p><\/div>\n
We selected A\/A\u2019\/B testing<\/h1>\n
We selected the A\/A\u2019\/B testing proposal due to its simplicity for implementation and analysis.<\/p>\n
Let\u2019s revisit the A\/A test we mentioned at the beginning for which we observed statistically significant differences during analysis. We ran the test again using the proposed framework. During the A versus A\u2019 comparison, we performed standard analysis to get rid of selection bias. About 30% of metrics had highly statistically significant movements. The significance level was 0.001 (much lower than the commonly used 0.05), thus those metric movements were likely to be true positives. Such big gap was mainly caused by update effect. During A\u2019 versus B comparison, the proportion of moved metrics was close to false positive rate (significance level). This A\/A test validated that the framework did work for builds comparison.<\/p>\n
How did we deploy it?<\/h2>\n
We have adopted the framework in a scalable manner and are using it to compare builds regularly. When we deployed it in production, we made a change \u2013 only keeping the A\u2019 and B variants<\/strong>. The reason is that we get limited benefit from A and A\u2019 comparison. If we consider the difference between A and A\u2019 as the baseline, we can only detect an issue in A\u2019 when the metric movements are far away from the baseline. Alternatively, we implemented an automatic process to create a duplicated identical build with a new version whenever there is a new build release. Whenever we start an A\u2019\/B test for a new build, we would send that duplicated build to variant A\u2019. The process ensures that we won\u2019t introduce any issues to A\u2019. One more benefit from not keeping variant A is that we can maximize the traffic allocated to variants A\u2019 and B. In such way, we can increase the metric sensitivity as much as possible.<\/p>\nThe framework did help the team with safe build releases. In a recent A\u2019\/B test for a real build release, we successfully detected a number of statistically significant regressions. These regressions caused the team to halt and investigate the issue before moving forward.<\/p>\n
Summary<\/h1>\n
We were trying to use A\/B testing to compare build releases for Microsoft Teams. We identified that penetration difference and update effect may introduce bias to the A\/B analysis. To mitigate this issue, we introduced an A\/A\u2019\/B testing framework. The framework enables us to regularly perform product builds comparison in a trustworthy way and serves as a gate for safe release of a new build.<\/p>\n
Acknowledgement<\/h1>\n
Special thanks to Microsoft Teams Experimentation team, Microsoft Experimentation Platform team, Microsoft Teams Client Release team, Paola Mejia Minaya, Ketan Lamba, Eduardo Giordano, Peter Wang, Pedro DeRose, Seena Menon, Ulf Knoblich.<\/p>\n
– Robert Kyle, Punit Kishor, Microsoft Teams Experimentation Team<\/em><\/p>\n– Wen Qin, Experimentation Platform<\/em><\/p>\nReferences<\/h1>\n
[1] \u201cMicrosoft Teams.\u201d https:\/\/www.microsoft.com\/en-us\/microsoft-teams\/group-chat-software<\/p>\n
[2] \u201cTeams update process.\u201d https:\/\/docs.microsoft.com\/en-us\/microsoftteams\/teams-client-update<\/p>\n
[3] R. Kohavi and S. Thomke, \u201cThe Surprising Power of Online Experiments.\u201d https:\/\/hbr.org\/2017\/09\/the-surprising-power-of-online-experiments<\/p>\n
[4] R. Kohavi, R. M. Henne, and D. Sommerfield, \u201cPractical guide to controlled experiments on the web: listen to your customers not to the hippo,\u201d in Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining\u00a0 – KDD \u201907<\/em>, San Jose, California, USA, 2007, p. 959. doi: 10.1145\/1281192.1281295.<\/p>\n[5] T. Crook, B. Frasca, R. Kohavi, and R. Longbotham, \u201cSeven pitfalls to avoid when running controlled experiments on the web,\u201d in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining – KDD \u201909<\/em>, Paris, France, 2009, p. 1105. doi: 10.1145\/1557019.1557139.<\/p>\n[6] \u201cSelection bias.\u201d https:\/\/en.wikipedia.org\/wiki\/Selection_bias<\/p>\n
[7] N. Chen, M. Liu, and Y. Xu, \u201cHow A\/B Tests Could Go Wrong: Automatic Diagnosis of Invalid Online Experiments,\u201d p. 9, 2019.<\/p>\n
[8] W. Machmouchi, S. Gupta, R. Zhang, and A. Fabijan, \u201cPatterns of Trustworthy Experimentation: Pre-Experiment Stage.\u201d https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/patterns-of-trustworthy-experimentation-pre-experiment-stage\/<\/p>\n","protected":false},"excerpt":{"rendered":"
Microsoft Teams is a communication platform [1]. It integrates meet, chat, call and collaborate in one place. The application updates multiple times a month [2], with additional new features and iterative improvements to existing features. To ensure high quality user experience across frequent updates, the team needs to actively monitor the quality of each new […]<\/p>\n","protected":false},"author":39246,"featured_media":772504,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":651963,"footnotes":""},"research-area":[],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-770860","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-locale-en_us"],"msr_assoc_parent":{"id":651963,"type":"group"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/770860"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/39246"}],"version-history":[{"count":17,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/770860\/revisions"}],"predecessor-version":[{"id":775801,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/770860\/revisions\/775801"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/772504"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=770860"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=770860"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=770860"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=770860"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}