{"id":713491,"date":"2020-12-18T16:50:11","date_gmt":"2020-12-19T00:50:11","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=713491"},"modified":"2020-12-18T16:50:11","modified_gmt":"2020-12-19T00:50:11","slug":"metric-computation-for-multiple-backends","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/metric-computation-for-multiple-backends\/","title":{"rendered":"Metric computation for multiple backends"},"content":{"rendered":"

If I have data from an A\/B test and a favorite metric, it is easy to run a statistical test to see if the metric has changed significantly. If I need this metric in the future, it\u2019s also easy to re-compute: I can save my code and re-use it later. But what if I need 100 metrics, with variance estimation, confidence intervals, and other stats? Well, this is getting challenging to manage. Still, with some dedication and manual work, I believe I can do this. <\/p>\n

But what if I need to compare these metrics over data slices like day of week, device type, and who knows what else? What if I need to update these slices regularly? How about if people with different background (data scientists, product managers, developers) want to contribute to the metrics development? What if I need to maintain the same level of trustworthiness with every change? What if I work with several products owned by different teams with data living in different fabrics like Azure Data Lake, or Azure Blob Storage, or in a SQL database? And on top of it all, what if these teams migrate their data from one fabric to another every now and again?<\/p>\n

At Microsoft, this is exactly the challenge that Experimentation Platform (ExP) is facing. ExP works with dozens of products across Microsoft (Bing, Office, Xbox, MS Teams, VS Code, Microsoft Support to name just a few). There are hundreds of practitioners implementing and maintaining metrics. Some teams have thousands of metrics, their data is in different fabrics, and these teams run hundreds of A\/B tests a month that all require detailed and trustworthy statistical analyses.<\/p>\n

The questions above are fundamentally hard. There are many approaches an A\/B testing team can take to address them. In this blog post, we will describe one of the key engineering components<\/strong> that we use to address these questions. It has been fourteen years of developing and iterating ExP\u2019s A\/B testing tools. We will share three key learnings obtained during these years and iterations. We hope the reader will find them useful, too.<\/p>\n

Overview of our approach<\/h2>\n

If we had to distill our engineering<\/em> answer to the challenges described above into just two words, it would be \u201ccode generation.\u201d What does this mean? Let us unpack.<\/p>\n

Compute pipeline overview<\/h3>\n

Over the years, the ExP team has developed a Domain Specific Language (DSL) for defining metrics (see also [1]). We will give some examples below, but for now let\u2019s just say that this language allows users to define metrics regardless of where their data lives, or what compute fabric they are using. In other words, we can say this language is fabric-agnostic<\/strong>. Treating the DSL as a \u201cblack box\u201d for now, the overall workflow is as follows.<\/p>\n

First, platform users create a metric set (really a git repo) with the DSL code defining the metrics. These metrics are reused between A\/B tests, so eventually users don\u2019t have to change them often. Behind the scenes, the code is compiled into artifacts (“metrics plans<\/strong>“) and published to a \u201cMetric Set Service<\/strong>\u201d.<\/p>\n

\"Overview

Overview of ExP compute pipeline<\/p><\/div>\n

What happens when a user requests an analysis? Each request describes which fabric should be used, which metrics to compute, what\u2019s the time range for the analysis, which data slices should be computed, etc. This request is sent to the compute pipeline, which then fetches the correct metrics plan from the Metric Set Service and sends it together with the request information to the Code Gen. From the request, the Code Gen knows which fabric to produce a script for, among other things listed above. Remember that metrics plans are fabric-agnostic, so the same metrics plan can be used for any data fabric. It all depends on the request.<\/p>\n

Once the script is produced, the pipeline runs the corresponding job, fetches the result, and displays it to the user.<\/p>\n

Key learnings<\/h3>\n

Now that we know how the system fits together, let\u2019s explore the learnings we mentioned in the introduction:<\/p>\n

1. Use a DSL for metric definition<\/u>. This makes it easier to implement and maintain even large number of metrics. It also allows to simplify and democratize the metric definition process: you don\u2019t need to be a data scientist or a developer to create metrics.
\n2. Automate code generation system<\/u>. Since the DSL is designed specifically for metric definitions, it is possible to automate statistical analysis, data slicing, etc. Automation means that it is easy to re-configure analyses: changing data slices, metrics or filters is just a few clicks away. Even more importantly, the logic can be robustly tested to ensure trustworthiness. As a result, even people with no stats background will reliably get trustworthy metrics.
\n3. Design components to be fabric-agnostic<\/u>. This separates the concern of what to compute<\/strong> from the concern of exactly how to compute it<\/strong>. This helps with products whose data is in different compute fabrics, or when a data migration is needed. Indeed, you can just re-use your metrics \u201cfor free\u201d instead of re-writing them all in a new language.<\/p>\n

In the following sections we will discuss these three learnings in a bit more depth (see also [2] for a deeper discussion of the motivation).<\/p>\n

Metrics definition language<\/h2>\n

In our DSL, metrics are defined at levels<\/strong>. These levels are just columns in the data whose values can be used as units of analysis. Let\u2019s take an example of a website, and let\u2019s say we have three aggregation levels: Page, Session, and User, represented by columns in the data. Each user can have several sessions on the site, during each session the user can view several pages, and on each of the pages the user can perform several events (e.g., clicks, scrolls, mouse hovers). However, each page view belongs to a single session, and each session belongs to a single user. In our DSL, metrics could be \u201cper page\u201d, \u201cper session\u201d or \u201cper user\u201d in this case.<\/p>\n

Let us consider a simple example: average latency per user. There are better ways to measure latency (e.g., as percentiles across all latency values, not per user). We chose this metric just to illustrate the DSL. In the DSL it could be written as<\/p>\n

LatencyPerUser = Avg(Sum(Latency))<\/code><\/p>\n

Here both User and Latency are columns in the data, and User has been marked as one of the aggregation levels. In the metric, we first sum all the Latency values from all the data rows for each user (via Sum). Then, we compute the average value across all users. Whenever there is no \u201c\u201d near an aggregation, we assume that this aggregation is done across all values. Equivalent SQL code could look something like this:<\/p>\n


\nEvents =
\nSELECT Latency, User, Events
\nFROM Data;<\/code><\/p>\n

UserLevel =
\nSELECT User,
\nSUM(Latency) as UserLatency
\nFROM Events
\nGROUP BY User;<\/code><\/p>\n

OutputLevel =
\nSELECT AVG(UserLatency) as LatencyPerUser
\nFROM UserLevel;<\/code><\/p>\n

What are some key takeaways from this? First, having a dedicated language saves time; it\u2019s more concise than SQL, for example. Second, it makes metrics implementation simpler: it\u2019s (hopefully) easier to pick up and is also less error prone. But what\u2019s maybe even more important is that it allows us to automate a lot of the analysis. The rest of the blog post will describe exactly that.<\/p>\n

Object model for the language<\/h2>\n

Metrics written in the DSL get parsed into syntax trees. Once again, we will give an overview and consider an example to illustrate the main points.<\/p>\n

Metrics as trees<\/h3>\n

So how are metrics represented as trees? There are two kinds of objects: expressions and tables. Both have a collection of \u201cparents\u201d (expressions or tables, respectively), and each expression belongs to a table. Roughly speaking, expressions correspond to the expressions in the metric computation (e.g., arithmetic operations, aggregations, string operations etc.) while tables determine the overall flow of the computation (think of a sequence of SELECT statements with GROUP BY\u2019s, JOINs, UNIONs, etc.).<\/p>\n

Let\u2019s consider another example. A metric computing \u201caverage number of clicks per user\u201d on a website could look in our DSL something like:<\/p>\n

ClicksPerUser = Avg(Sum(Event == \u201cClick\u201d ? 1 : 0))<\/code><\/p>\n

with Event<\/code> and User<\/code>being columns in the data. Let\u2019s walk through this metric. First, for each data row we assign a binary value based on whether the Event in that row was a click or not (via Event == \u201cClick\u201d ? 1 : 0<\/code>). Then, we sum those values across all rows for each user. This would count rows where Event<\/code> was a click. Finally, we average that count across all users (via the outer Avg<\/code> operation).<\/p>\n

Suppose we also have a column in the data called Variant<\/code> containing values \u201cTreatment\u201d and \u201cControl\u201d, and we want to use it as a data slice to compute the value of the metric for T and C separately.
\nThe parsed metrics plan containing the ClicksPerUser<\/code> metric and Variant<\/code> data slice would look as follows:<\/p>\n

\"Diagram

Example of a metrics plan<\/p><\/div>\n

There will be four tables: data extraction, then creating a new column computing Event == \u201cClick\u201d ? 1 : 0<\/code> (we called that table \u201cBase\u201d). Then another table with single value per user (their click count). Finally, the Output table containing the average count across the users. Every data slice and every part of the metric definition are kinds of expressions. For example, Event<\/code> and Variant<\/code> are data source columns expression, ==<\/code> is a binary operation expression, \u201cClick\u201d<\/code>, 1<\/code> and 0<\/code> are constants, Sum<\/code> and Avg<\/code> are aggregation expressions.<\/p>\n

How would these objects convert into, say, SQL code? Tables would roughly correspond to select statements, and expression are just lines in these statements. The data slice Variant<\/code> would be part of the GROUP BY statement. We hope the example below is self-explanatory.<\/p>\n

\"Illustrate

Generated code for ClicksPerUser metric<\/p><\/div>\n

How is this useful<\/h3>\n

What\u2019s the point of it all? Well, the key point is that these trees describe a general computation without any specifics of how this computation will be carried out!<\/p>\n

Since the metric computation is just a tree now, we can programmatically manipulate the tree to modify the computation. This might seem trivial, but it\u2019s actually very powerful! For example, note that the metric definition in the DSL didn\u2019t have any stats, e.g., variance calculation. This is deliberate: since we can change the trees representing the computations, we can automatically<\/strong> add all the necessary expressions to compute the metric\u2019s variance<\/strong>! This means that a person implementing the metrics doesn\u2019t need to think about (or even know about!) variance estimation. This also means that the stats logic is all in one place: inside the code gen. It can thus be thoroughly tested, and the tests are universal for all the customers of ExP. Similar tree manipulations provide automatic data slicing for computing metrics over only a subset of users, or days, or web pages, etc.<\/p>\n

Generating code for many fabrics<\/h2>\n

As we established above, the code gen is generating scripts given two pieces of information: the metrics plan<\/strong> and the analysis request<\/strong> (also called config<\/strong>). Metrics plans consist of syntax trees describing all possible metrics, while configs describe the specifics of the analysis: time range of the analysis, target compute fabric, which of the metrics to compute, which data slices to use, etc. So, what\u2019s happening inside of the code gen?<\/p>\n

Key components of Code Gen<\/h3>\n

There are three key structural components:<\/p>\n

1. Transformers<\/strong>. They take a metrics plan and return a new modified metrics plan.
\n2. Factories<\/strong>. They take the config, and return a list of transformers based on what they see in the config.
\n3. Emitter<\/strong>. It takes a metrics plan and generates code from it.<\/p>\n

The general flow is as follows. At the core of the flow is \u201cmaster coordinator\u201d. It knows the logical transformations that should be applied to the metrics plan before emitting the code, but it doesn\u2019t know exactly which transformers should be used at each step. Logically, \u201cmaster coordinator\u201d has three main parts:<\/p>\n

1. General transformations, e.g., keeping only the requested metrics, removing un-used data sources.
\n2. Smarts, e.g., adding stats and data slicing.
\n3. Fabric-specific stuff, e.g., in U-SQL one can take max of bool but in Azure Data Explorer one needs to convert bool to int first.<\/p>\n

Each part can have many transformers. The exact set of transformers could be different depending on the target fabric and other information in the config. To separate the concern of knowing the logical flow from the concern of knowing the exact transformers, \u201cmaster coordinator\u201d is really describing a list of factories. Each factory is responsible for a single logical step. Given the config, each factory creates the correct chain of transformers to be applied to the metrics plan for this analysis. The last factory is responsible for producing the correct code emitter.<\/p>\n

\"Overview

Overview of the Code Generation service<\/p><\/div>\n

Code emitter is always the last step. Emitters are deliberately na\u00efve, only describing how to write operations in the given fabric. For example, ToString<\/code> operation in U-SQL is column.ToString()<\/code> but in Azure Data Explorer it\u2019s tostring(column)<\/code>.<\/p>\n

Key observations about the design:<\/p>\n

1. Only factories and emitters are fabric-aware. Everything else is generic.
\n2. Master coordinator describes the logical flow with factories, separating what to do from how to do it.
\n3. Emitters are na\u00efve, which makes it easy to add support for new fabrics if needed.<\/p>\n

Summary<\/h2>\n

It took ExP team many years to arrive at the current state of the system. Not every team is ready to go full \u201cautomatic code gen for many fabrics\u201d, or even needs to. Yet we believe that some of the key learnings from ExP would be useful for other teams, too. To emphasize the three key learnings again:<\/p>\n

1. Separate metric writing from metric computation, e.g., via a DSL, interfaces, etc. This helps to democratize metric creation to all employees.
\n2. Automatically enrich the computations with stats, data slicing, and other \u201csmarts\u201d. This simplifies and democratizes the process of defining metrics, as well as makes it more testable and less error prone.
\n3. When creating such automated systems for metric computations, try to make it as fabric-agnostic as possible, and separate fabric-specific components from generic components. This makes the systems flexible and easier to re-use.<\/p>\n

– Craig Boucher, Ulf Knoblich, Dan Miller, Sasha Patotski, Amin Saied, Microsoft Experimentation Platform<\/p>\n

References<\/h2>\n

[1] S. Gupta, L. Ulanova, S. Bhardwaj, P. Dmitriev, P. Raff, and A. Fabijan, \u201cThe Anatomy of a Large-Scale Experimentation Platform,\u201d in 2018 IEEE International Conference on Software Architecture (ICSA), Apr. 2018, no. May, pp. 1\u2013109, doi: 10.1109\/ICSA.2018.00009.
\n[2] C. Boucher, U. Knoblich, D. Miller, S. Patotski, A. Saied, and V. Venkateshaiah, \u201cAutomated metrics calculation in a dynamic heterogeneous environment,\u201d extended abstract for 2019 MIT CODE, Nov. 2019, link on Arxiv: https:\/\/arxiv.org\/abs\/1912.00913<\/p>\n","protected":false},"excerpt":{"rendered":"

How does Microsoft ExP team manage metrics computation for A\/B tests at scale, providing trustworthy analyses of thousands of metrics for customers using different underlying compute infrastructures? In this article we give an overview of the metric computation pipeline at ExP.<\/p>\n","protected":false},"author":39177,"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":651963,"footnotes":""},"research-area":[],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-713491","msr-blog-post","type-msr-blog-post","status-publish","hentry","msr-locale-en_us"],"msr_assoc_parent":{"id":651963,"type":"group"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/713491"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/39177"}],"version-history":[{"count":7,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/713491\/revisions"}],"predecessor-version":[{"id":713590,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/713491\/revisions\/713590"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=713491"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=713491"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=713491"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=713491"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}