{"id":713491,"date":"2020-12-18T16:50:11","date_gmt":"2020-12-19T00:50:11","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=713491"},"modified":"2020-12-18T16:50:11","modified_gmt":"2020-12-19T00:50:11","slug":"metric-computation-for-multiple-backends","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/metric-computation-for-multiple-backends\/","title":{"rendered":"Metric computation for multiple backends"},"content":{"rendered":"
If I have data from an A\/B test and a favorite metric, it is easy to run a statistical test to see if the metric has changed significantly. If I need this metric in the future, it\u2019s also easy to re-compute: I can save my code and re-use it later. But what if I need 100 metrics, with variance estimation, confidence intervals, and other stats? Well, this is getting challenging to manage. Still, with some dedication and manual work, I believe I can do this. <\/p>\n
But what if I need to compare these metrics over data slices like day of week, device type, and who knows what else? What if I need to update these slices regularly? How about if people with different background (data scientists, product managers, developers) want to contribute to the metrics development? What if I need to maintain the same level of trustworthiness with every change? What if I work with several products owned by different teams with data living in different fabrics like Azure Data Lake, or Azure Blob Storage, or in a SQL database? And on top of it all, what if these teams migrate their data from one fabric to another every now and again?<\/p>\n
At Microsoft, this is exactly the challenge that Experimentation Platform (ExP) is facing. ExP works with dozens of products across Microsoft (Bing, Office, Xbox, MS Teams, VS Code, Microsoft Support to name just a few). There are hundreds of practitioners implementing and maintaining metrics. Some teams have thousands of metrics, their data is in different fabrics, and these teams run hundreds of A\/B tests a month that all require detailed and trustworthy statistical analyses.<\/p>\n
The questions above are fundamentally hard. There are many approaches an A\/B testing team can take to address them. In this blog post, we will describe one of the key engineering components<\/strong> that we use to address these questions. It has been fourteen years of developing and iterating ExP\u2019s A\/B testing tools. We will share three key learnings obtained during these years and iterations. We hope the reader will find them useful, too.<\/p>\n If we had to distill our engineering<\/em> answer to the challenges described above into just two words, it would be \u201ccode generation.\u201d What does this mean? Let us unpack.<\/p>\n Over the years, the ExP team has developed a Domain Specific Language (DSL) for defining metrics (see also [1]). We will give some examples below, but for now let\u2019s just say that this language allows users to define metrics regardless of where their data lives, or what compute fabric they are using. In other words, we can say this language is fabric-agnostic<\/strong>. Treating the DSL as a \u201cblack box\u201d for now, the overall workflow is as follows.<\/p>\n First, platform users create a metric set (really a git repo) with the DSL code defining the metrics. These metrics are reused between A\/B tests, so eventually users don\u2019t have to change them often. Behind the scenes, the code is compiled into artifacts (“metrics plans<\/strong>“) and published to a \u201cMetric Set Service<\/strong>\u201d.<\/p>\nOverview of our approach<\/h2>\n
Compute pipeline overview<\/h3>\n