{"id":5385,"date":"2020-05-18T09:08:06","date_gmt":"2020-05-18T16:08:06","guid":{"rendered":"https:\/\/www.microsoft.com\/insidetrack\/blog\/?p=5385"},"modified":"2023-06-11T13:28:03","modified_gmt":"2023-06-11T20:28:03","slug":"injecting-intelligence-into-microsofts-internal-network-with-microsoft-azure","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/insidetrack\/blog\/injecting-intelligence-into-microsofts-internal-network-with-microsoft-azure\/","title":{"rendered":"Injecting intelligence into Microsoft\u2019s internal network with Microsoft Azure"},"content":{"rendered":"
This content has been archived, and while it was correct at time of publication, it may no longer be accurate or reflect the current situation at Microsoft.<\/p>\n<\/div>\n<\/div>\n
Microsoft is on its way to a self-healing global network.<\/p>\n
Several teams within Microsoft Digital, the company\u2019s IT and Operations division, have joined forces to use a \u201cnetwork as code\u201d approach to modernize Microsoft\u2019s vast global network of 17,000 devices. Network as code is a subset of \u201cinfrastructure as code.\u201d It adapts the tenets of DevOps to network infrastructure\u2014version control, continuous monitoring, and other DevOps practices\u2014and applies them to infrastructure, such as the Microsoft network.<\/p>\n
In effect, they are connecting to and visualizing every device on Microsoft\u2019s massive internal global network, so they can \u201csee\u201d everything that happens on it.<\/p>\n A four-phase approach underpins the transformation of Microsoft\u2019s corporate network.<\/a> The first phase was automating the network. Once the team automated certain processes, like changing credentials for network devices, the team moved to phase two. Phase two was event response, or capturing events emitted by network devices. Phase three, which is currently underway, is fact collection. The goal of fact collection is to gather, store, and map network device data. Together, these phases create the fourth and final phase: a desired state configuration.<\/p>\n \u201cWhat we\u2019re building is the foundation of a modern approach to network management. It\u2019s moving from manual operational processes to automating the changes and configurations to the network,\u201d says Virginia Melandri, program manager for the Platform Engineering Team in Microsoft Digital.<\/p>\n [<\/em>Join the team at Microsoft build to learn more about their work<\/em>.<\/a> Learn about how Microsoft Digital is implementing a cloud-centric architecture.<\/em><\/a> Read about Microsoft Digital\u2019s implementation of Azure networking and zero trust.<\/em><\/a>]<\/em><\/p>\n Retooling network fact collection<\/strong><\/p>\n Melandri\u2019s team is currently gathering data from network devices as part of phase three. \u201cThis ultimately puts the network in a more deterministic state,\u201d says Dipanjan Nag, a Microsoft software engineer. \u201cWe\u2019ll be able to control it exactly how we want to because we\u2019ll know exactly what\u2019s going on with the network.\u201d<\/p>\n \u201cFact collection lays the groundwork for proactive issue detection and healing capabilities,\u201d Melandri says. \u201cWe\u2019re creating a network that can identify issues and self-heal before they become a widespread problem. We need data to do that.\u201d<\/p>\n Taehyun Kwon, a software engineer on the Platform Engineering Team, says the time had come for a smarter approach to network management.<\/p>\n \u201cWe had a legacy version of fact collection, but it was slow and didn\u2019t give us granular control,\u201d he says. \u201cWith the new platform, engineers can collaborate like developers. They can write a standardized data model of what the facts should look like. Then they can just schedule a request or make an ad hoc request to collect facts. No more waiting for a legacy program that takes hours and hours.\u201d<\/p>\n The new fact collection platform also means the team can collect information from the network at scale. They can gather different types of information, and expand their network read and fact collection to include new commands, according to Melandri. \u201cIt means less manual work, faster validation of changes, and the ability to detect any deviations from our desired state,\u201d she says.<\/p>\n Built on Microsoft Azure, the fact collection platform collects essential facts from the largest set of network devices (Cisco devices) and structures the output in a graph database<\/a>. This graph database visualizes the interconnections of the network devices and their data. It\u2019s also capable of capturing configuration backups to undo any undesired changes.<\/p>\n The legacy application which used to handle fact collection wasn\u2019t designed with scalability in mind. When the team accelerated their fact collection efforts, they hit a bottleneck. Switching to a Python-based application sped things up.<\/p>\n \u201cThe next question was where to host the Python,\u201d Kwon says. \u201cWe decided on Kubernetes. Once we did, we were in business.\u201d Because it\u2019s compatible with the Microsoft Azure microservices architecture, Kubernetes scales up and down easily.<\/p>\n \u201cIf we need to scale up one particular component, we can do that just for that component without scaling up everything else. We wouldn\u2019t be able to do that with a monolithic application,\u201d says Naveenkumar Iyappan, a software engineer on the Microsoft Digital team.<\/p>\n The changes had the desired impact.<\/p>\n “It used to take 18 to 20 hours to complete all fact collection,\u201d Iyappan says. \u201cNow it\u2019s down to an hour, and it\u2019ll get faster as we reduce the latency of individual components.\u201d<\/p>\n The fact collection platform consists of a scheduler (a Golang application), a collector (a Python application), and various Microsoft Azure components. Within the platform, two separate flows collect device information, and facts about those devices, respectively. That data is then combined to graph devices on the network and their relationship to each other.<\/p>\n To collect device information, the scheduler fetches authentication credentials from Microsoft Azure Key Vault<\/a> using Azure Active Directory (AAD) service principal object to connect to a datastore via an internal API. After some computations and validations, the scheduler sends the device data and information on the facts to be collected to the collector via a Microsoft Azure Event Hub<\/a>. Note that the scheduler only sends information about the facts, not the facts themselves.<\/p>\n The scheduler then pushes device data to Graph Builder, a Microsoft Azure Function application<\/a> (Figure 2). This data is the base vertex Graph Builder will use to document the relationships between devices (base vertices attach to other vertices to create relationships).<\/p>\n Having created the base vertex (device data) necessary for the network model, the system collects facts to attach to that base vertex to create the graph. To accomplish this, the scheduler sends scheduled requests to the Microsoft Azure Event Hub, which the collector continually monitors for new events. The collector logs the device and its information and processes the data as facts (Figure 2). The data is then sent to the Microsoft Azure Event Hub for more processing. Microsoft Azure Function then ingests the JSON and attaches it to the data created in the Devices JSON flow to create the relationships between devices. The Graph Builder instance uses the Microsoft Azure Cosmos DB bulk executor<\/a> to achieve 150k RU\/s throughput.<\/p>\n Separately, the platform creates configuration backups via the collector, which pushes the backups to Microsoft Azure File Share as a fallback.<\/p>\n By graphing network devices and the relationships of those devices, the team can see the entire network (Figure 3). \u201cWhen we perform an analysis, we can see the potential impact of a device going down,\u201d Iyappan says. \u201cIf we can\u2019t graph those relationships, we can\u2019t visualize them and can\u2019t make predictions like that.\u201d<\/p>\n Creating a self-healing global network<\/strong><\/p>\n In its current iteration, the platform can run scheduled jobs. Triggered collection is in the pipeline. That function will make diagnostics easier and eventually give the network self-healing capabilities.<\/p>\n \u201cWe\u2019re adding the ability to trigger fact collection any time a change is made to the network, which should make troubleshooting a lot easier,\u201d Melandri says. \u201cWe also plan to expand coverage to all network devices in the fall.\u201d<\/p>\n In expansion, the team plans to take advantage of Microsoft Azure\u2019s global footprint by building out clusters and replicating resources in multiple locations, which will further speed up fact collection by reducing the physical distance between the platform and the network devices.<\/p>\n The team also plans to build an API.<\/p>\n \u201cWith the API, customers will be able to access network data, and we\u2019ll be able to work with them directly to develop models that suit their particular needs,\u201d Melandri says. \u201cWe\u2019re also working on how to detect differences between the current state of the network and the desired state. Writing back to the network with those desired configurations or triggering the right playbook to be able to remediate, that\u2019s the vision.\u201d<\/p>\n The platform, and the data it collects, is also foundational to future artificial intelligence and machine learning plans. Those plans are an extension of the Microsoft Digital data lake strategy<\/a>, which makes previously siloed data available and accessible in data lakes.<\/p>\n \u201cThere are infinite possibilities,\u201d Kwon says, \u201cbut like any machine learning project, you have to collect the data first. This project gets us the data we need.\u201d<\/p>\n In the meantime, network engineers will reap the benefits of the new platform.<\/p>\n \u201cIt\u2019s simple for network engineers to add a job now or to collect different facts,\u201d Kwon says. \u201cIf they want to add a new fact in a local model and collect that data, they can just work with us to create another scheduler config in a Github repo we manage. They don\u2019t have to figure out how to build a scheduler or what platform to deploy to. It\u2019s much more scalable.\u201d<\/p>\n Join the team at Microsoft Build to learn more about their work.<\/a><\/p>\n Learn about how Microsoft Digital is implementing a cloud-centric architecture.<\/a><\/p>\n