{"id":90903,"date":"2020-04-16T09:00:04","date_gmt":"2020-04-16T16:00:04","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/security\/blog\/\/?p=90903"},"modified":"2023-05-15T23:28:18","modified_gmt":"2023-05-16T06:28:18","slug":"secure-software-development-lifecycle-machine-learning","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/security\/blog\/2020\/04\/16\/secure-software-development-lifecycle-machine-learning\/","title":{"rendered":"Secure the software development lifecycle with machine learning"},"content":{"rendered":"

Every day, software developers stare down a long list of features and bugs that need to be addressed. Security professionals try to help by using automated tools to prioritize security bugs, but too often, engineers waste time on false positives or miss a critical security vulnerability that has been misclassified. To tackle this problem data science and security teams came together to explore how machine learning could help. We discovered that by pairing machine learning models with security experts, we can significantly improve the identification and classification of security bugs.<\/p>\n

At Microsoft, 47,000 developers generate nearly 30 thousand bugs a month. These items get stored across over 100 AzureDevOps and GitHub repositories. To better label and prioritize bugs at that scale, we couldn\u2019t just apply more people to the problem. However, large volumes of semi-curated data are perfect for machine learning. Since 2001 Microsoft has collected 13 million work items and bugs. We used that data to develop a process and machine learning model that correctly distinguishes between security and non-security bugs 99 percent of the time and accurately identifies the critical, high priority security bugs, 97 percent of the time. This is an overview of how we did it.<\/p>\n

Qualifying data for supervised learning<\/h3>\n

Our goal was to build a machine learning system that classifies bugs as security\/non-security and critical\/non-critical with a level of accuracy that is as close as possible to that of a security expert. To accomplish this, we needed a high-volume of good data. In supervised learning, machine learning models learn how to classify data from pre-labeled data. We planned to feed our model lots of bugs that are labeled security and others that aren\u2019t labeled security. Once the model was trained, it would be able to use what it learned to label data that was not pre-classified. To confirm that we had the right data to effectively train the model, we answered four questions:<\/p>\n