Characterizing Experimentation in Continuous Deployment: A Case Study on Bing
- Katja Kevic ,
- Brendan Murphy ,
- Laurie Williams ,
- Jennifer Beckmann
2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP) |
Published by IEEE
The practice of continuous deployment enables product teams to release content to end users within hours or days, rather than months or years. These faster deployment cycles, along with rich product instrumentation, allows product teams to capture and analyze feature usage measurements. Product teams define a hypothesis and a set of metrics to assess how a code or feature change will impact the user. Supported by a framework, a team can deploy that change to subsets of users, enabling randomized controlled experiments. Based on the impact of the change, the product team may decide to modify the change, to deploy the change to all users, or to abandon the change. This experimentation process enables product teams to only deploy the changes that positively impact the user experience. The goal of this research is to aid product teams to improve their deployment process through providing an empirical characterization of an experimentation process when applied to a large-scale and mature service. Through an analysis of 21,220 experiments applied in Bing since 2014, we observed the complexity of the experimental process and characterized the full deployment cycle (from code change to deployment to all users). The analysis identified that the experimentation process takes an average of 42 days, including multiple iterations of one or two week experiment runs. Such iterations typically indicate that problems were found that could have hurt the users or business if the feature was just launched, hence the experiment provided real value to the organization. Further, we discovered that code changes for experiments are four times larger than other code changes. We identify that the code associated with 33.4% of the experiments is eventually shipped to all users. These fully-deployed code changes are significantly larger than the code changes for the other experiments, in terms of files (35.7%), changesets (80.4%) and contributors (20.0%).