Injecting intelligence into Microsoft’s internal network with Microsoft Azure

May 18, 2020   |  

Microsoft is on its way to a self-healing global network.

Several teams within Microsoft Digital, the company’s IT and Operations division, have joined forces to use a “network as code” approach to modernize Microsoft’s vast global network of 17,000 devices. Network as code is a subset of “infrastructure as code.” It adapts the tenets of DevOps to network infrastructure—version control, continuous monitoring, and other DevOps practices—and applies them to infrastructure, such as the Microsoft network.

In effect, they are connecting to and visualizing every device on Microsoft’s massive internal global network, so they can “see” everything that happens on it.

The graphic displays four stages of modern network capabilities. From left to right, Automation, event response, fact collection, desired state.
Figure 1. The four phases of Microsoft’s approach to a modern global network.

A four-phase approach underpins the transformation of Microsoft’s corporate network. The first phase was automating the network. Once the team automated certain processes, like changing credentials for network devices, the team moved to phase two. Phase two was event response, or capturing events emitted by network devices. Phase three, which is currently underway, is fact collection. The goal of fact collection is to gather, store, and map network device data. Together, these phases create the fourth and final phase: a desired state configuration.

“What we’re building is the foundation of a modern approach to network management. It’s moving from manual operational processes to automating the changes and configurations to the network,” says Virginia Melandri, program manager for the Platform Engineering Team in Microsoft Digital.

[Join the team at Microsoft build to learn more about their work. Learn about how Microsoft Digital is implementing a cloud-centric architecture. Read about Microsoft Digital’s implementation of Azure networking and zero trust.]

Retooling network fact collection

Melandri’s team is currently gathering data from network devices as part of phase three. “This ultimately puts the network in a more deterministic state,” says Dipanjan Nag, a Microsoft software engineer. “We’ll be able to control it exactly how we want to because we’ll know exactly what’s going on with the network.”

“Fact collection lays the groundwork for proactive issue detection and healing capabilities,” Melandri says. “We’re creating a network that can identify issues and self-heal before they become a widespread problem. We need data to do that.”

Taehyun Kwon, a software engineer on the Platform Engineering Team, says the time had come for a smarter approach to network management.

“We had a legacy version of fact collection, but it was slow and didn’t give us granular control,” he says. “With the new platform, engineers can collaborate like developers. They can write a standardized data model of what the facts should look like. Then they can just schedule a request or make an ad hoc request to collect facts. No more waiting for a legacy program that takes hours and hours.”

The new fact collection platform also means the team can collect information from the network at scale. They can gather different types of information, and expand their network read and fact collection to include new commands, according to Melandri. “It means less manual work, faster validation of changes, and the ability to detect any deviations from our desired state,” she says.

Built on Microsoft Azure, the fact collection platform collects essential facts from the largest set of network devices (Cisco devices) and structures the output in a graph database. This graph database visualizes the interconnections of the network devices and their data. It’s also capable of capturing configuration backups to undo any undesired changes.

The legacy application which used to handle fact collection wasn’t designed with scalability in mind. When the team accelerated their fact collection efforts, they hit a bottleneck. Switching to a Python-based application sped things up.

“The next question was where to host the Python,” Kwon says. “We decided on Kubernetes. Once we did, we were in business.” Because it’s compatible with the Microsoft Azure microservices architecture, Kubernetes scales up and down easily.

“If we need to scale up one particular component, we can do that just for that component without scaling up everything else. We wouldn’t be able to do that with a monolithic application,” says Naveenkumar Iyappan, a software engineer on the Microsoft Digital team.

The changes had the desired impact.

“It used to take 18 to 20 hours to complete all fact collection,” Iyappan says. “Now it’s down to an hour, and it’ll get faster as we reduce the latency of individual components.”

The fact collection platform consists of a scheduler (a Golang application), a collector (a Python application), and various Microsoft Azure components. Within the platform, two separate flows collect device information, and facts about those devices, respectively. That data is then combined to graph devices on the network and their relationship to each other.

To collect device information, the scheduler fetches authentication credentials from Microsoft Azure Key Vault using Azure Active Directory (AAD) service principal object to connect to a datastore via an internal API. After some computations and validations, the scheduler sends the device data and information on the facts to be collected to the collector via a Microsoft Azure Event Hub. Note that the scheduler only sends information about the facts, not the facts themselves.

Graphic depicts the logical architecture of the facts ingestion pipeline in four sections. From left to right; 1. Source, network devices are shown. 2. Scheduling and collection, the scheduler interacts with the internal API, then initiates the collector to collect facts, which are sent via JSON. 3. Ingestion and storage, Event hub receives facts from collector, passes the facts to a function app, the graph builder. The function app pushes the facts into a Azure Cosmos DB via Gremlin API Config backups are stored in Azure File Share. 4. Serve, a REST API enables access to the data from multiple tools. Ansible, Azure Data Lake, etc.
Figure. 2 The fact collection architecture.

The scheduler then pushes device data to Graph Builder, a Microsoft Azure Function application (Figure 2). This data is the base vertex Graph Builder will use to document the relationships between devices (base vertices attach to other vertices to create relationships).

Having created the base vertex (device data) necessary for the network model, the system collects facts to attach to that base vertex to create the graph. To accomplish this, the scheduler sends scheduled requests to the Microsoft Azure Event Hub, which the collector continually monitors for new events. The collector logs the device and its information and processes the data as facts (Figure 2). The data is then sent to the Microsoft Azure Event Hub for more processing. Microsoft Azure Function then ingests the JSON and attaches it to the data created in the Devices JSON flow to create the relationships between devices. The Graph Builder instance uses the Microsoft Azure Cosmos DB bulk executor to achieve 150k RU/s throughput.

Separately, the platform creates configuration backups via the collector, which pushes the backups to Microsoft Azure File Share as a fallback.

By graphing network devices and the relationships of those devices, the team can see the entire network (Figure 3). “When we perform an analysis, we can see the potential impact of a device going down,” Iyappan says. “If we can’t graph those relationships, we can’t visualize them and can’t make predictions like that.”

Figure 3. a logical map of many network devices are show in a hub and spoke design.
Figure 3. A graph display of network device connections.

Creating a self-healing global network

In its current iteration, the platform can run scheduled jobs. Triggered collection is in the pipeline. That function will make diagnostics easier and eventually give the network self-healing capabilities.

“We’re adding the ability to trigger fact collection any time a change is made to the network, which should make troubleshooting a lot easier,” Melandri says. “We also plan to expand coverage to all network devices in the fall.”

In expansion, the team plans to take advantage of Microsoft Azure’s global footprint by building out clusters and replicating resources in multiple locations, which will further speed up fact collection by reducing the physical distance between the platform and the network devices.

The team also plans to build an API.

“With the API, customers will be able to access network data, and we’ll be able to work with them directly to develop models that suit their particular needs,” Melandri says. “We’re also working on how to detect differences between the current state of the network and the desired state. Writing back to the network with those desired configurations or triggering the right playbook to be able to remediate, that’s the vision.”

The platform, and the data it collects, is also foundational to future artificial intelligence and machine learning plans. Those plans are an extension of the Microsoft Digital data lake strategy, which makes previously siloed data available and accessible in data lakes.

“There are infinite possibilities,” Kwon says, “but like any machine learning project, you have to collect the data first. This project gets us the data we need.”

In the meantime, network engineers will reap the benefits of the new platform.

“It’s simple for network engineers to add a job now or to collect different facts,” Kwon says. “If they want to add a new fact in a local model and collect that data, they can just work with us to create another scheduler config in a Github repo we manage. They don’t have to figure out how to build a scheduler or what platform to deploy to. It’s much more scalable.”

Join the team at Microsoft Build to learn more about their work.

Learn about how Microsoft Digital is implementing a cloud-centric architecture.

Read about Microsoft Digital’s implementation of Azure networking and zero trust.