{"id":793088,"date":"2021-11-09T08:35:01","date_gmt":"2021-11-09T16:35:01","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=793088"},"modified":"2021-11-09T14:32:28","modified_gmt":"2021-11-09T22:32:28","slug":"privacy-preserving-machine-learning-maintaining-confidentiality-and-preserving-trust","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/privacy-preserving-machine-learning-maintaining-confidentiality-and-preserving-trust\/","title":{"rendered":"Privacy Preserving Machine Learning: Maintaining confidentiality and preserving trust"},"content":{"rendered":"\n
\"A<\/figure>\n\n\n\n

Machine learning (ML) offers tremendous opportunities to increase productivity. However, ML systems are only as good as the quality of the data that informs the training of ML models. And training ML models requires a significant amount of data, more than a single individual or organization can contribute. By sharing data to collaboratively train ML models, we can unlock value and develop powerful language models that are applicable to a wide variety of scenarios, such as text prediction<\/a> and email reply suggestions<\/a>. At the same time, we recognize the need to preserve the confidentiality and privacy of individuals and earn and maintain the trust of the people who use our products. Protecting the confidentiality of our customers\u2019 data is core to our mission. This is why we\u2019re excited to share the work we\u2019re doing as part of the Privacy Preserving Machine Learning<\/a> (PPML) initiative.<\/p>\n\n\n\n

The PPML initiative was started in partnership between Microsoft Research and Microsoft product teams with the objective of protecting the confidentiality and privacy of customer data when training large-capacity language models. The goal of the PPML initiative is to improve existing techniques and develop new ones for protecting sensitive information that work for both individuals and enterprises. This helps ensure that our use of data protects people\u2019s privacy and the data is utilized in a safe fashion, avoiding leakage of confidential and private information.<\/p>\n\n\n\n

This blog post discusses emerging research on combining techniques to ensure privacy and confidentiality when using sensitive data to train ML models. We illustrate how employing PPML can support our ML pipelines in meeting stringent privacy requirements and that our researchers and engineers have the tools they need to meet these requirements. We also discuss how applying best practices in PPML enables us to be transparent about how customer data is applied.<\/p>\n\n\n\n

A holistic approach to PPML<\/h2>\n\n\n\n

Recent research has shown that deploying ML models can, in some cases, implicate privacy in unexpected ways. For example, pretrained public language models that are fine-tuned on private data can be misused to recover private information (opens in new tab)<\/span><\/a>, and very large language models have been shown to memorize training examples (opens in new tab)<\/span><\/a>, potentially encoding personally identifying information (PII). Finally, inferring that a specific user was part of the training data can also impact privacy (opens in new tab)<\/span><\/a>. Therefore, we believe it’s critical to apply multiple techniques to achieve privacy and confidentiality; no single method can address all aspects alone. This is why we take a three-pronged approach to PPML: understanding the risks and requirements around privacy and confidentiality, measuring the risks, and mitigating the potential for breaches of privacy. We explain the details of this multi-faceted approach below.<\/p>\n\n\n\n

Understand<\/strong>: We work to understand the risk of customer data leakage and potential privacy attacks in a way that helps determine confidentiality properties of ML pipelines. In addition, we believe it\u2019s critical to proactively align with policy makers. We take into account local and international laws and guidance regulating data privacy, such as the General Data Protection Regulation (opens in new tab)<\/span><\/a> (GDPR) and the EU\u2019s policy on trustworthy AI (opens in new tab)<\/span><\/a>. We then map these legal principles, our contractual obligations, and responsible AI principles to our technical requirements and develop tools to communicate with policy makers how we meet these requirements.<\/p>\n\n\n\n

Measure<\/strong>: Once we understand the risks to privacy and the requirements we must adhere to, we define metrics that can quantify the identified risks and track success towards mitigating them.<\/p>\n\n\n\n

Mitigate<\/strong>: We then develop and apply mitigation strategies, such as differential privacy (DP), described in more detail later in this blog post. After we apply mitigation strategies, we measure their success and use our findings to refine our PPML approach.<\/p>\n\n\n\n

\"A
PPML is informed by a three-pronged approach: 1) understanding the risk and regulatory requirements, 2) measuring the vulnerability and success of attacks, and 3) mitigating the risk.<\/figcaption><\/figure><\/div>\n\n\n\n

PPML in practice<\/h2>\n\n\n\n

Several different technologies contribute to PPML, and we implement them for a number of different use cases, including threat modeling and preventing the leakage of training data. For example, in the following text-prediction (opens in new tab)<\/span><\/a> scenario, we took a holistic approach to preserving data privacy and collaborated across Microsoft Research and product teams, layering multiple PPML techniques and developing quantitative metrics for risk assessment.<\/p>\n\n\n\n

We recently developed a personalized assistant for composing messages and documents by using the latest natural language generation models, developed by Project Turing (opens in new tab)<\/span><\/a>. Its transformer-based architecture uses attention mechanisms to predict the end of a sentence based on the current text and other features, such as the recipient and subject. Using large transformer models is risky in that individual training examples can be memorized and reproduced when making predictions, and these examples can contain sensitive data. As such, we developed a strategy to both identify and remove potentially sensitive information from the training data, and we took steps to mitigate memorization tendencies in the training process. We combined careful sampling of data, PII scrubbing, and DP model training (discussed in more detail below).<\/p>\n\n\n\n

Mitigating leakage of private information<\/h2>\n\n\n\n

We use security best practices to help protect customer data, including strict eyes-off handling by data scientists and ML engineers. Still, such mitigations cannot prevent subtler methods of privacy leakage, such as training data memorization in a model that could subsequently be extracted and linked to a user. That is why we employ state-of-the-art privacy protections provided by DP and continue to contribute to the cutting-edge research in this field. For privacy-impacting use cases, our policies require a security review, a privacy review, and a compliance review, each including domain-specific quantitative risk assessments and application of appropriate mitigations.<\/p>\n\n\n\n

Differential privacy<\/h3>\n\n\n\n

Microsoft pioneered DP research back in 2006 (opens in new tab)<\/span><\/a>, and DP has since been established as the de facto privacy standard, with a vast body of academic literature and a growing number of large-scale deployments across the industry (e.g., DP in Windows (opens in new tab)<\/span><\/a> telemetry or DP in Microsoft Viva Insights (opens in new tab)<\/span><\/a>) and government. In ML scenarios, DP works by adding small amounts of statistical noise during training, the purpose of which is to conceal the contributions of individual parties whose data is being used. When DP is employed, a mathematical proof validates that the final ML model learns only general trends in the data without acquiring information unique to any specific party. Differentially private computations entail the notion of a privacy budget, \u03f5, which imposes a strict upper bound on information that might leak from the process. This guarantees that no matter what auxiliary information an external adversary may possess, their ability to learn something new about any individual party whose data was used in training from the model is severely limited.<\/p>\n\n\n\n

In recent years, we have been pushing the boundaries in DP research with the overarching goal of providing Microsoft customers with the best possible productivity experiences through improved ML models for natural language processing (NLP) while providing highly robust privacy protections.<\/p>\n\n\n\n