Azure networking Archives - Inside Track Blog http://approjects.co.za/?big=insidetrack/blog/tag/azure-networking/ How Microsoft does IT Mon, 02 Feb 2026 21:37:26 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.4 137088546 Vuln.AI: Our AI-powered leap into vulnerability management at Microsoft http://approjects.co.za/?big=insidetrack/blog/vuln-ai-our-ai-powered-leap-into-vulnerability-management-at-microsoft/ Thu, 16 Oct 2025 16:05:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=20623 In today’s hyperconnected enterprise landscape, vulnerability management is no longer a back-office function—it’s a frontline defense. With thousands of devices from a multitude of vendors, and a relentless stream of Common Vulnerabilities and Exposures (CVEs), here at Microsoft we faced a challenge familiar to every IT decision maker: how to scale vulnerability response without scaling […]

The post Vuln.AI: Our AI-powered leap into vulnerability management at Microsoft appeared first on Inside Track Blog.

]]>
In today’s hyperconnected enterprise landscape, vulnerability management is no longer a back-office function—it’s a frontline defense. With thousands of devices from a multitude of vendors, and a relentless stream of Common Vulnerabilities and Exposures (CVEs), here at Microsoft we faced a challenge familiar to every IT decision maker: how to scale vulnerability response without scaling cost, complexity, or risk.

A photo of Fielder.

“While AI enables amazing capabilities for knowledge workers, it also increases the threat landscape, since bad actors using AI are constantly probing for vulnerabilities. Vuln.AI helps keep Microsoft safe by identifying and accelerating the mitigation of vulnerabilities in our environment.”

Brian Fielder, vice president, Microsoft Digital 

Enter Vuln.AI, an intelligent agentic system developed by our team in Microsoft Digital—the company’s IT organization—to transform how we identify, prioritize, and resolve vulnerabilities across our enterprise network.

Manual methods can’t keep up

As a company, we detect over 600 million cybersecurity threats every day, according to our latest Digital Defense Report. Some of those signals are bad actors probing our internal network and infrastructure looking for unpatched vulnerabilities. Our infrastructure supports over 300,000 employees and vendors, 25,000 network devices, and over 560 buildings across 102 countries. This scale means we face a constant stream of vulnerabilities—each requiring triage, impact analysis, and remediation.

“While AI enables amazing capabilities for knowledge workers, it also increases the threat landscape, since bad actors using AI are constantly probing for vulnerabilities. Vuln.AI helps keep Microsoft safe by identifying and accelerating the mitigation of vulnerabilities in our environment,” says Brian Fielder, a vice president within Microsoft Digital. 

Historically, our Infrastructure, Networking, and Tenant team here in Microsoft Digital relied on manual assessments to determine which network devices were impacted by new vulnerabilities. Traditional vulnerability scanning tools generate a lot of false positives and false negatives, and a significant amount of analysis still falls to security engineers, requiring manual validation before any vulnerability impact can be communicated to device owners. These manual methods were time-consuming, error-prone, and reactive—our security engineers were spending hours on each vulnerability, at times missing critical threats or sinking too much time into false alarms.

A photo of Bansal.

“AI’s true power lies in the problem it’s applied to. Start by identifying the most time-consuming or painful task in your organization-then explore how AI can augment or improve it. Begin with a small, targeted enhancement and iterate continuously.”

Ankit Bansal, senior product manager, Microsoft Digital

With the vast number of vulnerabilities coming in every day, security engineers needed a scalable way to quickly analyze, prioritize, and respond.

The solution: Vuln.AI

We already achieved dramatic impact with our AI Ops and Network Infrastructure Copilot, which is on track to save us over 11,000 hours of network service management time per year. We built Vuln.AI on top of that investment:

  1. The Research Agent analyzes vulnerability feeds and network metadata from our Infrastructure Data Lakehouse (IDL) built on top of Azure Data Explorer, which regularly ingests data from our device vendors and other sources. Once new vulnerabilities are detected, it automates the identification of impacted devices and integrates with other internal tooling for validation and reporting.
  2. The Interactive Agent acts as a gateway for engineers and device owners to ask follow-up questions and initiate remediation. Through agent-to-agent interaction, it leverages our Network Infrastructure Copilot to query the research agent’s findings. This agentic interface enables real-time decision-making and contextual insights.

Together, these agents are significantly improving our network security operations. The results we’re seeing so far are compelling:

  • A 70% reduction in time to vulnerability insights, enabling faster prioritization and mitigation, minimizing exposure windows.
  • Lower risk of compromise through increased accuracy, quicker detection, and containment of threats.
  • A stronger compliance posture that supports adherence to financial, legal, and regulatory requirements.
  • Higher accuracy in identifying vulnerable devices, reducing false positives and missed threats
  • Engineering hours saved and reduced fatigue, significantly improving productivity.

Our gains translate to lower operational risk, faster response times, and more resilient infrastructure—critical outcomes for any enterprise navigating today’s threat landscape.

“AI’s true power lies in the problem it’s applied to,” says Ankit Bansal, a senior product manager within Microsoft Digital. “Start by identifying the most time-consuming or painful task in your organization-then explore how AI can augment or improve it. Begin with a small, targeted enhancement and iterate continuously.”

How Vuln.AI works

The system continuously ingests our CVE data from our device suppliers’ API feeds and a publicly available database of known cybersecurity vulnerabilities.  It correlates that data with device attributes such as its hardware model and OS to identify the potential impact on the network and surface actionable insights.

Engineers interact with the system via Copilot, Teams, or custom tooling, which allows seamless integration with our network security teams’ daily workflows.

“We built a hybrid approach in Vuln.AI to guide LLMs through complex security advisories,” says Blaze Kotsenburg, a software engineer in Microsoft Digital. “By combining structured function calls, templated prompts, and data validation, we keep the model focused on producing reliable, actionable insights for vulnerability mitigation.”

A photo of Lollis.

“We chose Durable Functions for Vuln.AI because it allowed us to confidently orchestrate complex, stateful research. The reliability and simplicity of the framework meant we could shift our focus to engineering the intelligence behind the agent, especially the prompting strategies used in Vuln.AI’s backend processing.”

Mike Lollis, a senior software engineer in Microsoft Digital.

When it came to building Vuln.AI, we relied heavily on our own technology platforms, including: 

  • Azure AI Foundry for model development and deployment
  • Azure Data Explorer to store device metadata and CVEs
  • Agent to agent interaction with Network Copilotto query our database for device and inventory knowledge
  • Azure OpenAI models for natural language processing and classification
  • Azure Durable Functions for fine-grained orchestration and custom LLM workflows

“We chose Durable Functions for Vuln.AI because it allowed us to confidently orchestrate complex, stateful research,” says Mike Lollis, a senior software engineer in Microsoft Digital.  “The reliability and simplicity of the framework meant we could shift our focus to engineering the intelligence behind the agent, especially the prompting strategies used in Vuln.AI’s backend processing.”

Vuln.AI in action

Consider a common scenario: a new CVE that affects a network switch has just been published. Vuln.AI’s research agent immediately flags the vulnerability, maps it to potentially affected devices in our network inventory, and pushes the findings to an internal database.

A photo of Lee.

“AI is only as good as the data you provide. Much of the success with Vuln.AI came from our dedicated efforts to source comprehensive vulnerability data and device attributes. For effective AI-powered solutions, you really need to invest in a strong data foundation and a strategy for how to integrate into the rest of your infrastructure.”

Linda Lee, product manager II, Microsoft Digital

This data then becomes immediately accessible in our internal tools, where it is validated and approved by security engineers. Following this, network engineers are provided with precise information about their vulnerable devices.

Engineers can prompt Vuln.AI’s interactive agent to instantly retrieve the following information:

“12 devices impacted by CVE-2025-XXXX. Would you like me to suggest some next steps for mitigation or remediation?”

With Vuln.AI, network engineers can now begin vulnerability response operations much more quickly—no spreadsheet wrangling and no delays.

“AI is only as good as the data you provide,” says Linda Lee, a product manager II within Microsoft Digital. “Much of the success with Vuln.AI came from our dedicated efforts to source comprehensive vulnerability data and device attributes. For effective AI-powered solutions, you really need to invest in a strong data foundation and a strategy for how to integrate into the rest of your infrastructure.”

It’s about automating manual workflows and research.

“Vuln.AI has reduced our triage time by over 50%,” says Vincent Bersagol, a principal security engineer in Microsoft Digital.

This is allowing our engineers to focus on deeper analysis.

“The synergy between security and AI engineering has unlocked a new level of precision in vulnerability insights,” Bersagol says. “This is just the beginning.”

The journey ahead

Our journey with AI-powered vulnerability management has only just begun. Looking ahead, our roadmap for Vuln.AI includes:

  • Extending data coverage to include more hardware suppliers
  • Integrating more detailed device profiles for more targeted vulnerability response
  • Supporting autonomous workflows to streamline network engineers’ remediation efforts
  • Incorporating other AI agents to support more security use cases

These enhancements will further reduce risk, accelerate response times, and empower engineers to focus on more strategic initiatives.

“Trust is the foundation of everything we do in Microsoft Digital,” Bansal says. “Securing our network is essential to upholding that trust. Intelligent solutions like Vuln.AI not only help us stay ahead of emerging threats—they also establish the blueprint for integrating AI more deeply into our security operations.”

For IT leaders, Vuln.AI offers a blueprint for modern vulnerability management:

  • Scalable: Handles thousands of devices and vulnerabilities with ease
  • Accurate: Reduces false positives and missed threats
  • Efficient: Saves time, money, and resources
  • Secure: Built on Microsoft’s trusted AI and security frameworks

In a world where every second counts and any threat can be costly, Vuln.AI transforms vulnerability management from a bottleneck into a competitive advantage for Microsoft.

Key takeaways

As your organization looks for ways to improve security and threat response in a fast-changing landscape, consider the following insights on how AI is reshaping vulnerability management at Microsoft:

  • Fight fire with fire: The threat landscape has broadened dramatically due to bad actors using AI. Supplementing your own efforts with AI can help you manage your risk more effectively than traditional vulnerability management.
  • Agility is key: Effective vulnerability response hinges on acting fast. An AI-powered solution like Vuln.AI can cut the time needed to analyze and mitigate vulnerabilities by over 50%, enabling organizations to enhance security operations at scale.
  • The future is now: Looking ahead, Microsoft Digital will integrate agentic workflows into more security operations, boosting efficiency in risk prevention, threat detection and response, thereby enabling security practitioners and developers to focus on more strategic projects.

The post Vuln.AI: Our AI-powered leap into vulnerability management at Microsoft appeared first on Inside Track Blog.

]]>
20623
Keeping our in-house optical network safe with a Zero Trust mentality http://approjects.co.za/?big=insidetrack/blog/keeping-our-in-house-optical-network-safe-with-a-zero-trust-mentality/ Thu, 16 Oct 2025 16:00:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=20611 When it comes to corporate connectivity at Microsoft, a minute of lost connection can lead to catastrophic disruptions for our product teams, sleepless nights for our network engineers, and millions of dollars of lost value for the company. That’s why we built our own optical network at our headquarters in Washington state, and that’s why […]

The post Keeping our in-house optical network safe with a Zero Trust mentality appeared first on Inside Track Blog.

]]>
When it comes to corporate connectivity at Microsoft, a minute of lost connection can lead to catastrophic disruptions for our product teams, sleepless nights for our network engineers, and millions of dollars of lost value for the company.

That’s why we built our own optical network at our headquarters in Washington state, and that’s why we’re building similar networks at other regional campuses around the United States and the rest of the world.

With so much on the line, we need to make sure these in-house networks never go down.

But how are we doing that?

We’re applying the same robust Zero Trust approach we take to security and identity. While our optical networks are extremely reliable, any complex system can be knocked offline. In alignment with the Zero Trust mentality we have as a company, we trusted the integrity of what we’ve built, but we needed a resilient backup system that went beyond redundancy to provide true resilience.

Driven by this goal, we created a Zero Trust Optical Business Continuity Disaster Recovery (BCDR) network that combines two fully independent optical systems designed to sustain uninterrupted services, even during systemic failures. The result is more confidence for our employees and vendors, less pressure on our network engineers, and comprehensive network resilience that will protect us against a major outage.

The urgency of resilience

In 2021, our team in Microsoft Digital, the company’s IT organization, deployed our first next-generation optical network to serve the exclusive network needs of our Puget Sound metro campuses. It offers more bandwidth on less fiber for a lower operational cost than leasing from traditional carriers.

“Puget Sound is a highly concentrated developer network where we need to provide very high throughput,” says Patrick Alverio, principal group software engineering manager for Infrastructure and Engineering Services within Microsoft Digital. “Our optical system is the backbone of all that traffic.”

Our state-of-the-art optical network fulfills our need for fast and reliable connectivity at up to 400 Gbps between core sites, labs, data centers, and the internet edge. We built this network on the Reconfigurable Optical Add/Drop Multiplexer (ROADM) technology, delivering dynamic reconfiguration, colorless, directionless, contentionless (CDC) capabilities, flexible grid support, remote provisioning, and automation. It also features a full-mesh topology that provides a layer of redundancy.

But what if the entire ROADM-based system fails?

There are plenty of operational risks that can derail even the most robust network. Anything from misconfigured automation scripts to policy changes to misaligned software versioning to simple human error can cause outages.

A photo of Elangovan

“We don’t want even a second of downtime. We needed a life raft for when failures occur that could also function as a standby network for core site migrations or platform upgrades.”

Vinoth Elangovan, senior network engineer, Hybrid Core Network Services, Microsoft Digital

To some degree, those kinds of minor disruptions are inevitable. But catastrophic events like fiber cuts, failures in the ROADM operating system, or even natural disasters have the potential for even more wide-ranging disruption.

During a catastrophic outage, thousands of engineers, developers, researchers, and other technical employees who need access to crucial lab environments and data centers could lose connectivity. That can sabotage feature delivery, disrupt product patches, interrupt updates, and halt all kinds of core product functions.

On top of normal software development operations, new AI tools demand massive bandwidth and consistent uptime. Finally, our hybrid networks feature paths integrated with Microsoft Azure that consume on-premises resources, so they also stand to benefit from increased resilience.

A catastrophic network outage can cause incredible damage to all of these business functions. In fact, we experienced exactly that in 2022.

A fiber cut combined with a ROADM system hardware reboot caused a five-minute outage at our Puget Sound metro region. In this environment, every minute of lost connectivity can result in significant financial impact, making network resilience absolutely essential.

“We don’t want even a second of downtime,” says Vinoth Elangovan, senior network engineer, who designed and implemented the Zero Trust Optical BCDR network for Microsoft. “We needed a life raft for when failures occur that could also function as a standby network for core site migrations or platform upgrades.”

Delivering greater network resilience

To ensure we could deliver uninterrupted network connectivity even in the midst of a catastrophic outage, we needed to consider the technical demands of a truly resilient system. Five design pillars helped us assemble our architectural criteria:

  1. Independent optical systems: To provide true resilience, our primary and BCDR platforms needed to operate autonomously.
  2. Physically independent paths: Circuits should avoid shared conduits, fibers, and splices to operate completely independently.
  3. Separate control software: The primary and backup networks should operate through dedicated network management systems (NMSs), automation, and provisioning domains.
  4. Unified client interface: Both systems needed to terminate into the same interface to unify service for clients and applications.
  5. Survivability by design: We couldn’t assume that any system would be immune to failure. Instead, we built for the best possible outcomes.

The result was the Zero Trust Optical BCDR architecture, a layered approach to optical networking. It consists of our primary, ROADM-based transport layer and a secondary, MUX-based transport layer, both terminating into a single logical port channel.

“Our core responsibility is the employee experience, so our main design thrust was making sure service is seamless and uninterrupted—even during an outage.”

Vinoth Elangovan, senior network engineer, Hybrid Core Network Services, Microsoft Digital

Both systems are live and active, which means they deliver production services through their own independent fibers, power supplies, and software stacks. By layering fully independent optical domains and logically unifying them at the Ethernet edge, the network can sustain a complete failure of one system and maintain continuity.

That physical and operational independence is the difference between simple redundancy and robust resilience.

“Our core responsibility is the employee experience, so our main design thrust was making sure it’s seamless and uninterrupted—even during an outage,” Elangovan says.

Optical network backed by a BCDR network

A schematic of an optical network running between different nodes and backed up by a BCDR network.
The optical network in our Puget Sound region connects core sites to labs, datacenters, and the internet edge, while the BCDR network provides backup connections to deliver resilience in case of a catastrophic network failure.

A typical ROADM optical network connects campus and data center sites to the internet edge. Our design features three interconnected optical rings, with two internet edges as multi-directional nodes, while other sites operate as dual-degree nodes with bidirectional redundancy. Meanwhile, our campuses and datacenters are designated as critical sites and equipped with Optical BCDR links to ensure enhanced resiliency. In the event of a complete Optical ROADM line failure, these critical sites retain connectivity.

In the event of an outage on the primary network, the port channel handles forward continuity automatically, shifting WAN traffic between optical paths in real time.

The transition occurs seamlessly and transparently, with no noticeable impact to clients.

A photo of Martin

“Our initial goal was to provide high-throughput connectivity for major labs, with less than six minutes of downtime per year. That represents a service level of 99.999% network continuity, and we’re aiming for even better moving forward.”

Blaine Martin, principal engineering manager, Hybrid Core Network Services, Microsoft Digital

Coupling at the Ethernet layer provides clients and applications with one logical interface, automatic load balancing and traffic distribution, and seamless failover, regardless of which optical domain is providing service.

“Our initial goal was to provide high-throughput connectivity for major labs, with less than six minutes of downtime per year,” says Blaine Martin, principal engineering manager for Hybrid Core Network Services in Microsoft Digital. “That represents a service level of 99.999% network continuity, and we’re aiming for even better moving forward.”

A new era of confidence for network engineers

For the network engineers who keep Microsoft employees and resources connected, the Zero Trust Optical BCDR network relieves much of the pressure that comes from resolving outages.

“Before, we were dependent on a single system, even with redundancies, so the human experience was like firefighting. Now, if the primary optical network is having a problem, I don’t even see it.”

Kevin Bullard, principal cloud network engineering manager, Microsoft Digital

When a network goes down, engineers have an enormous set of responsibilities to manage: processing the incident report, assigning severity, performing checks, notifying internal teams, providing updates, and engaging with physical support teams—all with a profound urgency to restore productivity.

Dialing those pressures back has been a huge benefit.

“Before, we were dependent on a single system, even with redundancies, so the human experience was like firefighting,” says Kevin Bullard, Microsoft Digital principal cloud network engineering manager responsible for maintaining WAN interconnectivity between labs. “Now, if the primary optical network is having a problem, I don’t even see it.”

There will always be pressure on network engineers to restore connectivity during an outage, but they can breathe easier knowing it won’t cost the company millions of dollars as the time to resolve ticks away. And in non-emergency situations like core site migrations, the BCDR network provides a much easier way to shunt services while the main network is offline.

“Our internal users have become more confident that they can stay connected, no matter what,” says Chakri Thammineni, principal cloud network engineer for Infrastructure and Engineering Services in Microsoft Digital. “That gives the people responsible for maintaining our enterprise networks incredible peace of mind.”

Fortunately, there hasn’t been a substantial network outage in the Puget Sound metro area since 2022. But our network engineering teams know that if and when it happens, the BCDR network will be ready to maintain service continuity.

A photo of Alverio.

“We’re always looking ahead into industry trends to stay at the bleeding edge, whether that’s in the technology we provide for our customers or the networks we use to do our own work.”

Patrick Alverio, principal group software engineering manager, Infrastructure and Engineering Services, Microsoft Digital

With our Puget Sound network protected, we have plans in place to extend this model to other metro areas. Naturally, we have to balance population, criticality, and the knowledge that elevated reliability and availability come with a cost.

Our selection criteria for new BCDR networks have largely centered around two factors: expansions of AI-critical infrastructure and concentrations of secure access workspaces (SAWs) for technical employees. With these criteria in mind, we’re planning new BCDR networks first in the Bay Area and Dublin, then in Virginia, Atlanta, and London.

Zero Trust optical BCDR architecture represents a paradigm shift in enterprise network resilience, and we’re committed to expanding the model to benefit both conventional workloads and the expanding infrastructure demands of AI.

“We’re always looking ahead into industry trends to stay at the bleeding edge, whether that’s in the technology we provide for our customers or the networks we use to do our own work,” Alverio says. “We refuse to accept the status quo, and we’re elevating the experience for employees across Puget Sound and Microsoft as a whole.”

Driving AI innovation in optical network resilience

Our journey towards an AI-driven optical network is gaining momentum.

As part of our Secure Future initiative, we’ve automated our Optical Management Platform credential rotation and are actively developing intelligent incident management ticket enrichment, auto-remediation, link provisioning, deployment validation, and capacity planning.

AI plays a central role in this transformation.

With Microsoft 365 Copilot and GitHub Copilot integrated into our engineering workflows, we’re accelerating development cycles, improving code accuracy, and uncovering optimization opportunities that would otherwise take hours of manual effort.

These Copilots are also helping our engineers analyze network patterns, simulate outcomes, and validate deployment logic before execution, reducing human error and strengthening our Zero Trust posture. Over time, we’re evolving toward a system where AI not only assists but proactively predicts potential disruptions, recommends remediations, and continuously learns from operational telemetry.

These advancements are paving the way for a future where our optical infrastructure can anticipate issues, recover faster, and operate with the agility and assurance expected in a Zero Trust environment.

Key takeaways

If you’re considering implementing your own optical and BCDR networks, consider these tips:

  • Understand the technical components of resilience: Independent optical systems, physically independent paths, separate control software, a unified client interface, and survivability by design are the key technical components of true resilience.
  • Plan from a preparedness and value perspective: Evaluate the critical points in your infrastructure and determine where you can get the most value out of resilient connectivity.
  • Ensure your teams have the right skillset: Carefully consider the right workforce to run those systems and be accountable for their operation.

The post Keeping our in-house optical network safe with a Zero Trust mentality appeared first on Inside Track Blog.

]]>
20611
Unleashing API-powered agents at Microsoft: Our internal learnings and a step-by-step guide http://approjects.co.za/?big=insidetrack/blog/unleashing-api-powered-agents-at-microsoft-our-internal-learnings-and-a-step-by-step-guide/ Thu, 02 Oct 2025 16:05:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=19793 Agentic AI is the frontier of the AI landscape. These tools show enormous promise, but harnessing their power isn’t always as straightforward as prompting a model or accessing data from Microsoft 365 apps. To reach their full potential in the enterprise, agents sometimes need access to data beyond Microsoft Graph. But giving them access to […]

The post Unleashing API-powered agents at Microsoft: Our internal learnings and a step-by-step guide appeared first on Inside Track Blog.

]]>
Agentic AI is the frontier of the AI landscape. These tools show enormous promise, but harnessing their power isn’t always as straightforward as prompting a model or accessing data from Microsoft 365 apps. To reach their full potential in the enterprise, agents sometimes need access to data beyond Microsoft Graph. But giving them access to that data relies on an extra layer of extensibility.

To meet these demands, many of our teams within Microsoft Digital, the company’s IT organization, have been experimenting with API-based agents. This approach combines the best of two worlds: accessing diverse apps and data repositories and eliminating the need to build an agent from the ground up.

We want to empower every organization to unlock the full power of agents through APIs. The lessons we’ve learned on our journey can help you get there.

The need for API-based agents

The vision for Microsoft 365 Copilot is to serve as the enterprise UX. Within that framework, agents serve as the background applications that streamline workflows and save our employees time.

For many users, the out-of-the-box access Copilot provides to Microsoft Graph is enough to support their work. It surfaces the data and content they need while providing a foundational orchestration layer with built-in capabilities around compliance, responsible AI, and more.

But there are plenty of scenarios that require access to other data sources.

“Copilot provides you with data that’s fairly static as it stands in Microsoft Graph,” says Shadab Beg, principal software engineering manager on our International Sovereign Cloud Expansion team. “If you need to query from a data store or want to make changes to the data, you’ll need an API layer.”

By using APIs to extend agents built on the Copilot orchestration layer, organizations can apply its reasoning capabilities to new data without the need to fine-tune their models or create new ones from scratch. The possibilities these capabilities unlock are driving a boom in API-based agents for key functions and processes.

“Cost is one of the most critical dimensions in how we design, deploy, and scale our solutions. Declarative API-driven agents in Microsoft 365 Copilot offer a path to unify agentic experiences while leveraging shared AI compute and infrastructure.”

A photo of Nasir.
Faisal Nasir, AI Center of Excellence and Data Council lead, Microsoft Employee Experience

In many ways, IT organizations like ours are the ideal places to implement API-based agents. Our teams are adept at creating and deploying internal solutions to solve technical challenges, and IT work is often about enablement and efficiency—exactly what agents do best.

“Cost is one of the most critical dimensions in how we design, deploy, and scale our solutions,” says Faisal Nasir, AI Center of Excellence and Data Council lead in Microsoft Employee Experience. “Declarative API-driven agents in Microsoft 365 Copilot offer a path to unify agentic experiences while leveraging shared AI compute and infrastructure. By aligning with core architectural principles such as efficiency, scalability, and sustainability, we can ensure these agents not only drive intelligent outcomes but also maximize value across service areas with minimal overhead.”

{Learn more about our vision and strategy around deploying agents internally at Microsoft.}

The Azure FinOps Budget Agent

Our Azure FinOps Budget Agent is a perfect example of a scenario for API-based agents.

The team responsible for managing our Microsoft Azure budget for IT services was looking for ways to reduce costs by 10–20 percent. To do that effectively, service and finance managers needed the ability to track their spending quickly, accurately, and easily.

The conventional approach to solving this problem would be creating a dashboard with access to the relevant data. The problem with a UI-based approach is that it tends to cater to more specific personas by providing data only they need while oversaturating others with information that’s irrelevant to their work.

Azure spend is basically the lifeline for our services,” says Faris Mango, principal software engineering manager for infrastructure and engineering services within Microsoft Digital. “Getting the information you need in a concise format that provides a nice, holistic view can be challenging.”

With the advent of generative AI and Microsoft 365 Copilot, the team knew that a natural language interface would be much more intuitive. The result was the Azure FinOps Budget Agent.

The team created the agent and the necessary APIs using Microsoft Visual Studio Code. Its tables and functions run on Azure Data Explorer, allowing the APIs and their consumers to access data almost instantaneously, thanks to its low latency and rapid read speeds.

The tool retrieves data by running Azure Data Factory pipelines that pull and transform data from three sources:

  • Our SQL Server for service budget and forecast data
  • Azure Spend for the actual spending amounts
  • Projected spending, a separate service stored in other Azure Data Explorer tables

Processing the information relies on our business logic’s join operations, followed by aggregations by fiscal year and service tree levels. These summarize the data per service, team group, service group, and organization.

After the back end processes the day’s data, it ingests the information into our Azure Data Explorer tables, which the agent accesses by calling via Kusto functions (the query language for Azure Data Explorer). The outcome is very low latency. Typically, the agent returns results in under 500 milliseconds.

For users, the tool is stunningly simple. They simply access Copilot and navigate to the Azure FinOps Budget Agent.

The agent provides three core prompts at the very top of the interface: “My budgets,” “Service budget information,” and “Service group budget information.” Clicking on one of these pre-loaded prompts returns role-specific information around budget, forecasts, actuals, projections, and variance, all at a single glance. The interface even includes graphs to help people track spending visually.

If users are looking for more specific information, they can input their own queries. For example:

  • “Get me the monthly breakdown of service Azure Optimization Assessment analytics.”
  • “Find me the service in this tree with the highest budget.”
  • “Show me the Azure budget for our facilities reporting portal.”
  • “Which service deviates most from its budget forecasts?”

The Azure FinOps Budget Agent primarily serves two groups: service managers who directly oversee spend for Azure-based services and FinOps managers responsible for larger budget silos.

Mango is responsible for the internal UI that helps network employees access parts of the Microsoft network. With 18–20K users per month, budgeting and forecasting are highly dynamic due to traffic fluctuations and the resourcing that supports them. He also oversees the internal portal that helps service engineers manage our networks. The tool is growing rapidly as we onboard more teams, so forecasting is anything but linear.

For both of these services, keeping close track of spending is essential. Mango finds himself checking the Azure FinOps Budget Agent about twice a month to gauge how his services are trending.

“It’s taking me less time to do analysis and come up with accurate numbers. And the enhanced user experience just feels more natural, like you’re asking questions conversationally rather than engaging with a dashboard.”

A photo of Mango.
Faris Mango, principal software engineering manager for infrastructure and engineering services, Microsoft Digital

For FinOps managers, the value is more high-level. They are responsible for overseeing tens of services featuring vast volumes of Azure usage across storage and compute while managing strict budgets. That requires constant vigilance.

Switching context from one dashboard to another to track different Azure management groups was a constant hassle for them. Now, they use the Azure FinOps Budget Agent to get an up-to-date view of the overall spend picture. It gives them a place to start. From there, they can drill down if he sees any abnormalities.

“It’s taking me less time to do analysis and come up with accurate numbers,” Mango says. “And the enhanced user experience just feels more natural, like you’re asking questions conversationally rather than engaging with a dashboard.”

The arrival of the Azure FinOps Budget Agent is just one example of how agents take your context and get your people the answers they care about faster at less cost.

Benefits like these are spreading across teams throughout Microsoft. Overall, we’ve been able to save 10–12 percent of our overall Azure cost footprint for Microsoft Digital, and individual users are thrilled at the amount of time and effort they’re saving.

“Now the info is at people’s fingertips. The advantage of an agent is that users don’t have to understand a complex UI, so they can get quick answers and get back to work.”

A photo of Beg.
Shadab Beg, principal software engineering manager, International Sovereign Cloud Expansion

Five key strategies for building an API-based agent

After seeing what we’ve accomplished with API-based agents, you might be wondering how to put them into action at your organization. This step-by-step guide can help you get there.

Building an API-based agent needs to fulfill multiple requirements. It has to expose APIs, align with real user needs, integrate seamlessly with Microsoft 365 Copilot, and work reliably, efficiently, and scalably. Achieving those outcomes depends on five key strategies.

Start with user intent, not the API

Start by asking a simple but powerful question: What will users actually ask your agent? Instead of designing the API first, flip the process:

  • Gather real user queries to understand actual use cases.
  • Refine the queries using prompt engineering techniques to align them with expected AI behavior.
  • Design the API to provide structured responses to those refined queries.

By starting with user intent, you ensure your agent answers real user questions directly, avoids over-engineering unnecessary endpoints, and delivers meaningful results without excessive back-end processing.

“Now the info is at people’s fingertips,” Beg says. “The advantage of an agent is that users don’t have to understand a complex UI, so they can get quick answers and get back to work.”

The advantage of an agent is that users don’t have to understand a complex UI, so they can get quick answers and get back to work.”

Key learning: An API that doesn’t align with user intent won’t be effective—even if you design it well.

Design APIs for Microsoft 365 Copilot Integration

It’s important to build an API schema that returns precise and structured data to make it easy for Copilot to consume. This ensures your APIs return data in a format that directly answers user queries. Copilot expects responses in under three seconds, so focus on optimizing API responses for low latency.

Once you have your list of key questions, design your API schema to return the exact data you need to answer those questions. Your goal should be to ensure every API response has a structure that makes it easy for Copilot to understand.

Teach Microsoft 365 Copilot to call your API

Copilot needs to know how to call your API. Manifests and OpenAPI descriptions accomplish that training.

Create detailed OpenAPI documentation and plugin manifests so Copilot knows what your API does, how to invoke it, and what responses to expect. You’ll likely need to adjust to these files through a process of trial and error.  

Scale APIs for performance and reliability

Once you have your schema and integration in place, it’s time to move on to the primary engineering challenge: making your API scalable, efficient, and reliable.

Prioritize the following goals:

  • Fast response times: Copilot expects quick answers.
  • High scalability: This ensures seamless performance at scale.
  • Reliable uptime: The system needs to remain robust.

We recommend setting a very strict latency limit while implementing your API to retrieve data, since Copilot needs time to generate its response. Existing API endpoints often involve complex data joins rather than simply returning rows from data tables. This complexity can lead to longer processing times, particularly with intricate queries that involve multiple data stores.

To address these potential delays, pre-cache results to significantly enhance performance. This can help overcome the latency requirements imposed by Copilot.

At this point, you’ll see why starting with user intent and iteratively refining API design is important. By grounding your work in user behaviors, you’ll align with the following best practices:

  • Structure your response to directly address user queries.
    Instead of just returning raw data, the API should provide meaningful insights Copilot can interpret. Prompt engineering marries user intent with the most understandable API schema.
  • Keep your API flexible enough to adapt to evolving business needs.
    Real-world workflows change over time, and an API should be able to support those changes without massive refactoring.
  • Avoid performance bottlenecks caused by unnecessary complexity.
    Understanding the exact data requirements up front prevents heavy joins, excessive filtering, and inefficient data retrieval logic.
  • Optimize for Copilot’s real-time response constraints.
    With a strict limit on latency, consider pre-optimization techniques like pre-caching results and simplifying query logic from the very beginning of your API implementation.

If you attempt to build a scalable, reliable API without first understanding how users will interact with your agent, you’ll spend months reworking the schema, debugging inefficiencies, and struggling with integration challenges.

Key learning: A fast, scalable, and reliable API isn’t just about technical optimization. It starts with a deep understanding of the questions it needs to answer and how to structure responses so Copilot can interpret them correctly.

Consider compliance and responsible AI

Unlike custom agents or OpenAI API integrations, knowledge-only agents require far less effort to meet Microsoft’s Responsible AI Standard. Microsoft tools’ built-in compliance capabilities handle much of the complexity. As a result, you can focus on efficiency and optimization rather than regulatory hurdles.

“Agent-based automation must balance speed with responsibility,” Nasir says. “We embed compliance, cost control, and telemetry from the start, so our systems don’t just scale, they mature.”

Key learning: It’s helpful to revisit your existing compliance, governance, and responsible AI processes and policies before implementing AI solutions. Copilot adheres to protective structures within your Microsoft technology ecosystem, so this process will ensure you’re starting from the most secure position.

APIs and the agentic future

Building API-based agents is more than just an integration exercise. It’s about creating scalable, intelligent, and compliant AI-driven workflows. By aligning your API design with user intent, you set Microsoft 365 Copilot free to retrieve and interpret information accurately. That leads to a seamless AI experience for your employees.

Thanks to Copilot’s built-in security and compliance features, API-based Copilot agents are some of the most efficient, compliant, and enterprise-ready ways to deploy AI solutions. They represent another step into an AI-first future tailored to your employees’ and organization’s needs.

Tools like API-based agents democratize the information we all need to do our jobs better, because we’re all getting the same data from the same place. This is why an AI-first mindset is actually human-first.

Key takeaways

Here are some things to keep in mind when designing agent-powered experiences that are fast, reliable, and aligned with user expectations.

  • Response time is key. Choose single APIs that have low latency to facilitate both the technical requirements of Copilot and users’ needs.
  • Consider the source. Data has to be high-quality on the backend. It’s worth reviewing your data and ensuring the hygiene is good.
  • Agents and APIs need to align. Design with task-centric, well-structured agents. Determine your high-level goals, then use the OpenAI standard, OpenAPI, or graph schemas to describe task endpoints. Define each API’s capability, input schema, and expected outcome very clearly.
  • Plan ahead to avoid surprises. Design your APIs to minimize potential side effects, especially through enabling natural-language-to-API mapping, because that’s the biggest change in methodology.
  • Design for visibility. Agents need to be observable and explainable, so implement metrics-driven monitoring. Having API-level telemetry in addition to Copilot-level telemetry enables continuous improvement.

The post Unleashing API-powered agents at Microsoft: Our internal learnings and a step-by-step guide appeared first on Inside Track Blog.

]]>
19793
Transforming our VPN with Global Secure Access at Microsoft http://approjects.co.za/?big=insidetrack/blog/transforming-our-vpn-with-global-secure-access-at-microsoft/ Thu, 25 Sep 2025 16:00:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=20360 Ensuring safe and secure access to resources in the enterprise has always been a delicate balance. Protecting corporate assets from intrusions and misuse is paramount. But a system that neglects usability for employees creates frustration and inefficiencies. At Microsoft, we’re in the midst of a major transformation in how we manage access to our corporate […]

The post Transforming our VPN with Global Secure Access at Microsoft appeared first on Inside Track Blog.

]]>
Ensuring safe and secure access to resources in the enterprise has always been a delicate balance. Protecting corporate assets from intrusions and misuse is paramount. But a system that neglects usability for employees creates frustration and inefficiencies.

At Microsoft, we’re in the midst of a major transformation in how we manage access to our corporate resources. The cornerstone of this change is Microsoft Global Secure Access (GSA), a security service edge (SSE) solution that replaces traditional VPNs with a modern, identity-centric model. GSA provides three core services integrated into a unified framework: Microsoft 365 Access, Internet Access, and Private Access. This approach not only strengthens our enterprise security posture but also simplifies connectivity for both users and administrators.

A photo of Apple.

“Years ago, the concept of a VPN was simple: a single virtual private network gave employees access to the company’s entire internal network. Today, this model presents serious risks.

Pete Apple, principal cloud network engineer, Microsoft Digital

Over 158,000 of our employees are already using the GSA client and Microsoft 365, with full rollout of private and internet access planned in the coming months. Here’s how we’re building a more secure, seamless, and future-ready access experience across Microsoft’s ecosystems.

Beyond VPNs: the future of secure access

The idea that an internal network is inherently safer than the open internet has always been risky, and modern threats make that assumption dangerous. This is why we’ve embraced the Zero Trust model, shifting away from blanket access and moving toward least-privilege access—ensuring users only get what they need, when they need it, and nothing more.

Adopting a Zero Trust approach across the enterprise makes moving beyond traditional VPNs imperative. For years, we’ve relied on Microsoft VPN and Azure VPN to access internal resources. While effective, these traditional models operate on an “all-or-nothing” basis: once connected, employees gain broad access, regardless of role or security context.

“Years ago, the concept of a VPN was simple: a single virtual private network gave employees access to the company’s entire internal network,” says Pete Apple, a principal cloud network engineer in Microsoft Digital, the company’s IT organization. “Today, this model presents serious risks. If a user’s identity or device is compromised—or if a man-in-the-middle attack occurs—the attacker can connect through the VPN and gain broad access to sensitive data, soft targets, and critical systems.”

A photo of Triv.

“One of the primary reasons for this shift to GSA is that we get more granularity within this identity-based security solution that we can control access on a very fine level.”

Gary Triv, principal network engineer, Microsoft Digital

This creates challenges for organizations like ours—and yours.

That’s where GSA can help.

It shifts the paradigm by introducing fine-grained, identity-based controls. Through deep integration with Microsoft Entra, administrators can enforce policies that adapt in real time, ensuring only the right users, devices, and conditions grant access to sensitive resources.

“One of the primary reasons for this shift to GSA is that we get more granularity within this identity-based security solution that we can control access on a very fine level,” says Gary Triv, a principal network engineer in Microsoft Digital.

The four pillars of GSA security

Our focus on security is built into everything we do.

“Conditional access, identity-centric controls, and other core elements of Zero Trust are built directly into the solution,” says Lalitha Mahajan, global technical program manager for Global Secure Access.

At the heart of GSA are four foundational security features:

  1. Conditional Access (CA): Unlike VPNs, which provide blanket access, CA enforces contextual rules to ensure role-appropriate access at all times. For example, an engineer may be allowed access to a security portal, while another user may only see Power BI dashboards.
  2. Continuous Access Evaluation (CAE): Access control doesn’t stop at login. CAE evaluates user context in real time. If an employee’s role changes, their credentials are revoked, or they leave the company, their active sessions are immediately terminated.
  3. Network Filtering: GSA allows administrators to define exactly where users can go on the internet or within corporate networks. This ensures employees have access only to approved destinations, reducing exposure to threats.
  4. Compliant Network (CN): Access is tied to the source network. For instance, a device in Redmond may be allowed, but the same device in an untrusted region could be blocked automatically.

Together, these pillars make GSA a secure and adaptive solution, fully aligned with the principles of Zero Trust.

“With the Zero Trust model, our goal is to enforce least-privilege access. That means locking down internal resources, improving segmentation, and using firewalls and other controls so users can’t reach everything by default,” Apple says. “Instead of relying on a blanket VPN network, we’re moving to the Entra Global Secure Access model, which combines network and identity. Instead of granting broad visibility into the entire internal network, access can now be scoped to a user’s identity—so employees only connect to the resources defined for them.”

A photo of Mahajan.

“Unlike traditional VPNs, GSA delivers both client-side and server-side insights, all of which we own. This gives us deeper visibility and allows us to make the data more actionable for our use cases.”

Lalitha Mahajan, program manager, Microsoft Digital

A perfect example is a Microsoft developer—one of our most common employee roles.

Our developers may need access to specific source code, certain labs, and designated file shares. With GSA, we can grant access only to those resources—and nothing else. This shift from a blanket “once connected, you can see everything” approach, to a tightly defined, identity-based model is a major security improvement and one of the most exciting reasons we’re moving forward with this product.

A key differentiator and critical Zero Trust enabler is GSA’s rich telemetry, which provides real-time visibility into user activity, device health, and network traffic. This continuous stream of data enables early detection of threats, anomaly detection, and precise policy enforcement—strengthening Zero Trust in practice.

“Unlike traditional VPNs, GSA delivers both client-side and server-side insights, all of which we own,” Mahajan says. “This gives us deeper visibility and allows us to make the data more actionable for our use cases.”

The key components of GSA

Private Access is just one of three offerings that make up GSA. Together, these offerings are unified under a single client that creates three dedicated tunnels—one for each service—while administrators centrally define routing and policy rules. GSA consists of:

  • Microsoft 365 Access: Optimized, policy-controlled connectivity for Office apps and services.
  • Internet Access: Secure browsing with TLS inspection, URL filtering, and content controls.
  • Private Access: A modern replacement for legacy VPNs that enable granular access to internal resources.

For Internet Access, GSA supports two deployment models: branch connectivity, where IPSec tunnels secure traffic from devices without a client (like printers), and client connectivity, where the GSA client routes laptop or desktop traffic directly to the GSA Edge. Both approaches enforce consistent policies, differing only in how traffic reaches the framework.

Advanced features and monitoring

Unlike fragmented VPN and firewall logs, GSA provides consistent visibility through unified logging, which consolidates session data—including user identity, device, source, destination, and applied policies—into a single view. We can now easily validate whether security features are working as intended and forward logs to Microsoft Sentinel for extended monitoring.

This holistic view provides us with a major advantage against cyber threats, enabling faster investigations and clearer correlations between user behavior and network activity.

Our rollout of GSA is well underway internally at Microsoft. With more than 158,000 GSA client and Microsoft 365 users already onboard, the next phase will expand private access company-wide, followed by broader adoption of internet access. Early pilots have demonstrated strong results, with positive feedback on both usability and the ability to solve unique access challenges.

By delivering a complete, identity-based secure access solution—spanning Microsoft 365, internet, and private connectivity—Microsoft is redefining enterprise access for the cloud-first era. The result is a future where connectivity is not only seamless but also secure, adaptive, and tightly aligned with user identity and context.

Key takeaways

Our experience transitioning to GSA Private Access has left us with several key insights that other enterprises can apply to their own efforts to modernize remote access:

  • Adopt least-privilege access: Move away from blanket network access to ensure employees only reach the resources they need.
  • Reduce risk from compromised accounts: Limit the blast radius of identity or device breaches by segmenting and scoping access.
  • Continuously evaluate trust: Treat access as dynamic, adapting in real time to changes in user roles, device health, or network conditions.
  • Improve visibility through telemetry: Use detailed activity and traffic data to spot anomalies early and strengthen security decisions.
  • Unify security and connectivity: Align access with identity and context, creating a balance between strong protection and seamless user experience.

The post Transforming our VPN with Global Secure Access at Microsoft appeared first on Inside Track Blog.

]]>
20360
Modernizing IT infrastructure at Microsoft: A cloud-native journey with Azure http://approjects.co.za/?big=insidetrack/blog/modernizing-it-infrastructure-at-microsoft-a-cloud-native-journey-with-azure/ Thu, 04 Sep 2025 16:00:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=20125 Engage with our experts! Customers or Microsoft account team representatives from Fortune 500 companies are welcome to request a virtual engagement on this topic with experts from our Microsoft Digital team. At Microsoft, we are proudly a cloud-first organization: Today, 98% of our IT infrastructure—which serves more than 200,000 employees and incorporates over 750,000 managed […]

The post Modernizing IT infrastructure at Microsoft: A cloud-native journey with Azure appeared first on Inside Track Blog.

]]>

Engage with our experts!

Customers or Microsoft account team representatives from Fortune 500 companies are welcome to request a virtual engagement on this topic with experts from our Microsoft Digital team.

At Microsoft, we are proudly a cloud-first organization: Today, 98% of our IT infrastructure—which serves more than 200,000 employees and incorporates over 750,000 managed devices—runs on the Microsoft Azure cloud.

The company’s massive transition from traditional datacenters to a cloud-native infrastructure on Azure has fundamentally reshaped our IT operations. By adopting a cloud-first, DevOps-driven model, we’ve realized significant gains in agility, scalability, reliability, operational efficiency, and cost savings.

“We’ve created a customer-focused, self-serve management environment centered around Azure DevOps and modern engineering principles,” says Pete Apple, a technical program manager and cloud architect in Microsoft Digital, the company’s IT organization. “It has really transformed how we do IT at Microsoft.”

“Our service teams don’t have to worry about the operating system. They just go to a website, fill in their info, add their data, and away they go. That’s a big advantage in terms of flexibility.”

Apple is shown in a portrait photo.
Pete Apple, technical program manager and cloud architect, Microsoft Digital

What it means to move from the datacenter to the cloud

Historically, our IT environment was anchored in centralized, on-premises datacenters. The initial phase of our cloud transition involved a lift-and-shift approach, migrating workloads to Azure’s infrastructure as a service (IaaS) offerings. Over time, the company evolved toward more of a decentralized, platform as a service (PaaS) DevOps model.

“In the last six or seven years we’ve seen a lot more focus on PaaS and serverless offerings,” says Faisal Nasir, a principal architect in Microsoft Digital. “The evolution is also marked by extensibility—the ability to create enterprise-grade applications in the cloud—and how we can design well-architected end-to-end services.”

Because we’ve moved nearly all our systems to the cloud, we have a very high level of visibility into our network operations, according to Nasir. We can now leverage Azure’s native observability platforms, extending them to enable end-to-end monitoring, debugging, and data collection on service usage and performance. This capability supports high-quality operations and continuous improvement of cloud services.

“Observability means having complete oversight in terms of monitoring, assessments, compliance, and actionability,” Nasir says. “It’s about being able to see across all aspects of our systems and our environments, and even from a customer lens.”

Decentralizing our IT services with Azure

As Microsoft was becoming a cloud-first organization, the nature of the cloud and how we use it changed. As Microsoft Azure matured and more of our infrastructure and services moved to the cloud, we began to move away from IT-owned applications and services.

The strengths of the Azure self-service and management features means that individual business groups can handle many of the duties that Microsoft Digital formerly offered as an IT service provider—which enables each group to build agile solutions to match their specific needs.

“Our goal with our modern cloud infrastructure continues to be a solution that transforms IT tasks into self-service, native cloud solutions for monitoring, management, backup, and security across our entire environment,” Apple says. “This way, our business groups and service lines have reliable, standardized management tools, and we can still maintain control over and visibility into security and compliance for our entire organization.”

The benefits to our businesses of this decentralized model of IT services include:

  • Empowered, flexible DevOps teams
  • A native cloud experience: subscription owners can use features as soon as they’re available
  • Freedom to choose from marketplace solutions
  • Minimal subscription limit issues
  • Greater control over groups and permissions
  • Better insights into Microsoft Azure provisioning and subscriptions
  • Business group ownership of billing and capacity management

“With the PaaS model, and SaaS (software as a service), it’s more DIY,” Apple says. “Our service teams don’t have to worry about the operating system. They just go to a website, fill in their info, add their data, and away they go. That’s a big advantage in terms of flexibility.”

“The idea of centralized monitoring is gone. The new approach is that service teams monitor their own applications, and they know best how to do that.”

Delamarter is shown in a portrait photo.
Cory Delamarter, principal software engineering manager, Microsoft Digital

Leveraging the power of Azure Monitor

Microsoft Azure Monitor is a comprehensive monitoring solution for collecting, analyzing, and responding to monitoring data from cloud and on-premises environments. Across Microsoft, we use Azure Monitor to ensure the highest level of reliability for our services and applications.

Specifically, we rely on Azure Monitor to:

Create visibility. There’s instant access to fundamental metrics, alerts, and notifications across core Azure services for all business units. Azure Monitor also covers production and non-production environments as well as native monitoring support across Microsoft Azure DevOps.

Provide insight. Business groups and service lines can view rich analytics and diagnostics across applications and their compute, storage, and network resources, including anomaly detection and proactive alerting.

Enable optimization. Monitoring results help our business groups and service lines understand how users are engaging with their applications, identify sticking points, develop cohorts, and optimize the business impact of their solutions.

Deliver extensibility. Azure Monitor is designed for extensibility to enable support for custom event ingestion and broader analytics scenarios.

Because we’ve moved to a decentralized IT model, much of the monitoring work has moved to the service team level as well.

“The idea of centralized monitoring is gone,” says Cory Delamarter, a principal software engineering manager in Microsoft Digital. “The new approach is that service teams monitor their own applications, and they know best how to do that.”

Patching and updating, simplified

Moving our operations to the cloud also means a simpler and more automated approach to patching and updating. The shift to PaaS and serverless networking has allowed us to manage infrastructure patching centrally, which is much more scalable and efficient. The extensibility of our cloud platforms reduces integration complexity and accelerates deployment.

“It depends on the model you’re using,” Nasir says. “With the PaaS and serverless networks, the service teams don’t need to worry about patching. With hybrid infrastructure systems, being in the cloud helps with automation of patching and updating. There’s a lot of reusable automation layers that help us build end-to-end patching processes in a faster and more reliable manner.”

Apple stresses the flexibility that this offers across a large organization when it comes to allowing teams to choose how they do their patching and updating.

“In the datacenter days, we ran our own centralized patching service, and we picked the patching windows for the entire company,” Apple says. “By moving to more automated self-service, we provide the tools and the teams can pick their own patching windows. That also allowed us to have better conversations, asking the teams if they want to keep doing the patching or if they want to move up the stack and hand it off to us. So, we continue to empower the service teams to do more and give them that flexibility.”

Securing our infrastructure in a cloud-first environment

As security has become an absolute priority for Microsoft, it’s also been a foundational element of our cloud strategy.

Being a cloud-first company has made it easier to be a security-first organization as well.

“The cloud enables us to embed security by design into everything we build,” Nasir says. “At enterprise scale, adopting Zero Trust and strong governance becomes seamless, with controls engineered in from the start, not retrofitted later. That same foundation also prepares us for an AI-first future, where resilience, compliance, and automation are built into every system.”

Cloud-native security features combined with integrated observability allow for better compliance and risk management. Delamarter agrees that the cloud has had huge benefits when it comes to enhancing network security.

“Our code lives in repositories now, and so there’s a tremendous amount of security governance that we’ve shifted upstream, which is huge,” Delamarter says. “There are studies that show that the earlier you can find defects and address them, the less expensive they are to deal with. We’re able to catch security issues much earlier than before.”

“There are less and less manual actions required, and we’re automating a lot of business processes. It basically gives us a huge scale of automation on top of the cloud.”

Nasir is shown in a portrait photo.
Faisal Nasir, principal architect, Microsoft Digital

We use Azure Policy, which helps enforce organizational standards and assess compliance at scale using dashboards and other monitoring tools.

“Azure Policy was a key part of our security approach, because it essentially offers guardrails—a set of rules that says, ‘Here’s the defaults you must use,’” Apple says. “You have to use a strong password, for example, and it has to be tied to an Azure Active Directory ID. We can dictate really strong standards for everything and mandate that all our service teams follow these rules.”

AI-driven operations in the cloud

Just like its impact on the rest of the technology world, AI is in the process of transforming infrastructure management at Microsoft. Tasks that used to be manual and laborious are being automated in many areas of the company, including network operations.

“AI is creating a new interface of agents that allow users to interact with large ecosystems of applications, and there’s much easier and more scalable integration,” says Nasir. “There are less and less manual actions required, and we’re automating a lot of business processes. Microsoft 365 Copilot, Security Copilot, and other AI tools are giving us shared compute and extensibility to produce different agents. It basically gives us a huge scale of automation on top of the cloud.”

Apple notes that powerful AI tools can be combined with the incredible amount of data that the Microsoft IT infrastructure generates to gain insights that simply weren’t possible before.

“We can integrate AI with our infrastructure data lakes and use tools like Network Copilot to query the data using natural language,” Apple says. “I can ask questions like, ‘How many of our virtual machines need to be patched?’ and get an answer. It’s early, and we’re still experimenting, but the potential to interact with this data in a more automated fashion is exciting.”

Ultimately, Microsoft has become a cloud-first company, and that has allowed us to work toward an AI-first mentality in everything we do.

“Having a complete observability strategy across our infrastructure modernization helps us to make sure that whatever changes we’re making, we have a design-first approach and a cloud-first mindset,” Nasir says. “And now that focus is shifting towards an AI-first mindset as well.”

Key takeaways

Here are some of the benefits we’ve accrued by becoming a cloud-first IT organization at Microsoft:

  • Transformed operations: By moving from our legacy on-premises datacenters, through Azure’s infrastructure as a service (IaaS) offerings, and eventually to a platform as a service (PaaS) DevOps model, we’ve reaped great gains in reliability, efficiency, scalability, and cost savings.
  • A clear view: With 98% of our organization’s IT infrastructure running in the Azure cloud, we have a huge level of observability into our systems—complete oversight into network assessment, monitoring, compliance, patching/updating, and many other aspects of operations.
  • Empowered teams: Operating a cloud-first environment allows us to have a more decentralized approach to IT infrastructure. This means we can offer our business groups and service lines more self-service, cloud-native solutions for monitoring, management, patching, and backup while still maintaining control over and visibility into security and compliance for our entire organization.
  • Seamless updates: The shift to PaaS and serverless networking has enabled a more planned and automated approach to patching and updating our infrastructure, which produces greater efficiency, integration, and speed of deployment.
  • Dependable security: Our cloud environment has allowed us to implement security by design, including tighter control over code repositories and the use of standard security policies across the organization with Azure Policy.
  • Future-proof infrastructure: As we shift to an AI-first mindset across Microsoft, we’re using AI-driven tools to enhance and maintain our native cloud infrastructure and adopt new workflows that will continue to reap dividends for our employees and our organization.  

The post Modernizing IT infrastructure at Microsoft: A cloud-native journey with Azure appeared first on Inside Track Blog.

]]>
20125
Smarter labs, faster fixes: How we’re using AI to provision our virtual labs more effectively http://approjects.co.za/?big=insidetrack/blog/smarter-labs-faster-fixes-how-were-using-ai-to-provision-our-virtual-labs-more-effectively/ Thu, 24 Jul 2025 16:00:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=19628 Microsoft Digital stories Providing technical support at an enterprise of our size here at Microsoft is a constant balancing act between speed, quality, and scalability. Systems grow more complex, user expectations continue to rise, and traditional support models often struggle to keep up. Beyond the usual apps and systems everyone uses, many of our employees […]

The post Smarter labs, faster fixes: How we’re using AI to provision our virtual labs more effectively appeared first on Inside Track Blog.

]]>

Microsoft Digital stories

Providing technical support at an enterprise of our size here at Microsoft is a constant balancing act between speed, quality, and scalability. Systems grow more complex, user expectations continue to rise, and traditional support models often struggle to keep up. Beyond the usual apps and systems everyone uses, many of our employees require virtual provisioning for diverse tasks in many of our businesses. Supporting these virtualized environments is a special challenge.

To meet the growing demand for virtual lab usage across the organization, we turned to AI, not just to automate support responses but to fundamentally rethink how user issues are identified and resolved. This vision came to life through the MyWorkspace platform, where we in Microsoft Digital, the company’s IT organization, introduced a domain-specific AI assistant to streamline how we empower our employees to deploy new virtual labs.

The results have been dramatic: what was once a slow, manual process is now fast, efficient, and frictionless.

But the benefits extend beyond faster resolution times. This transformation represents a new approach to enterprise support—one that uses AI not just as a tool for efficiency, but as a strategic enabler. By embedding intelligence into the support experience, we’re turning complexity into a competitive advantage.

Scaling support in a high-demand environment

MyWorkspace is our internal platform for provisioning virtual labs. Originally developed to support internal testing, diagnostics, and development environments, it has since grown into a critical resource used by thousands of engineers and support personnel across the company.

Scaling the platform infrastructure was straightforward—adding capacity for tens of thousands of virtual labs was a technical challenge we could solve with ease, thanks to our Microsoft Azure backbone. As usage grew, the real strain didn’t show up in CPU load or storage limit, but rather in the support queue—every few months, a new wave of users was onboarded to MyWorkspace: partner teams, internal engineers, and external vendors. These new users, often with minimal experience of the platform, needed fast access and guidance from support.

The questions, though simple, piled up quickly.

Several Tier 1 support engineers repeatedly encountered the same questions from users, such as how to start a lab, what an error meant, and which lab to use for a particular test. These weren’t complex technical issues—they were basic, repetitive onboarding questions that represented a huge opportunity to introduce automation.

“We also found that there were a lot of users who found more niche issues, and those issues had been solved either by our community or by ourselves. In fact, we had a dedicated Teams channel specific to customer issues, and we found that there was a lot of repetition and that other customers were facing similar issues, and we did have a bit of a knowledge base in terms of how to solve these issues.”

A photo of Deans.
Joshua Deans, software engineer, Microsoft Digital

Unblocking a bottleneck with AI

Our support team set out to tackle a familiar but costly challenge: high volumes of low-complexity tickets that consumed valuable time without delivering meaningful impact. Instead of treating this as an unavoidable burden, we saw an opportunity to turn it into a self-scaling solution. If the same questions were being asked repeatedly—and the answers already existed in documentation, internal threads, or institutional knowledge—then an AI system should be able to surface those answers instantly, without human intervention.

“We also found that there were a lot of users who found more niche issues, and those issues had been solved either by our community or by ourselves,” says Joshua Deans, a software engineer within Microsoft Digital. “In fact, we had a dedicated Teams channel specific to customer issues, and we found that there was a lot of repetition and that other customers were facing similar issues, and we did have a bit of a knowledge base in terms of how to solve these issues.”

That insight led the MyWorkspace team to begin building what would become a transformative AI assistant: an automated support layer purpose-built for the MyWorkspace platform. Unlike traditional chatbots that rely on scripted responses or rigid decision trees, this assistant would leverage generative AI trained on a rich dataset of real-world support conversations, internal FAQs, and official documentation.

“So that’s where we found this opportunity to turn this scaling challenge into a scaling advantage, with help from AI. We took all those historical conversations of tier one staff helping new users—trained our AI to provide user education based on that training—and saved our Tier 1 staff from answering potential tickets.”

Vikram Dadwal, principal software engineering manager, Microsoft Digital

The result was a context-aware, responsive system capable of resolving common issues in seconds—not hours or days—dramatically easing the load on support teams while improving the user experience.

Built on Azure and Semantic Kernel

MyWorkspace’s core infrastructure is fully built on Azure services. At any given moment, it manages tens of thousands of virtual machines, scaling up and down with demand. That elasticity, combined with our internal developer tooling and AI orchestration capabilities, provided the perfect environment for an AI-powered support layer.

“So that’s where we found this opportunity to turn this scaling challenge into a scaling advantage, with help from AI,” says Vikram Dadwal, a principal software engineering manager within Microsoft Digital. “We took all those historical conversations of tier one staff helping new users—trained our AI to provide user education based on that training—and saved our Tier 1 staff from answering potential tickets.”

To build the assistant, the team used our Microsoft open-source framework, Semantic Kernel. Designed for generative AI integration, Semantic Kernel allows engineers to create prompt-driven, modular systems that can interact with large language models (LLM) without vendor lock-in.

This approach gave the team several advantages:

  • Flexibility in choosing and switching between LLM providers.
  • Fine-grained control over how prompts were structured and updated.
  • Extensibility through plugins and actions that tie the assistant into the broader ecosystem.

Crucially, the assistant was designed to be part of the platform’s architecture, capable of operating at the same level of scale and responsiveness as the labs it supported. Also, the assistant was initialized with a well-scoped system prompt, limiting its responses strictly to the MyWorkspace domain.

“On average, we measured these interactions at around 20 minutes from ticket submission to problem resolution. Now compare that with a 30-second AI interaction for resolving the same class of issues—that’s a 98% reduction in resolution time, a number we’ve validated with our support team and continue to track.”

Nathan Prentice, senior product manager, Microsoft Digital

Shifting from tickets to conversations

Whether users had questions about lab types, needed clarification on configuration details, or sought guidance during onboarding, the AI provided accurate, interactive responses without requiring human escalation. The experience was both faster and significantly better. Support engineers saw a noticeable reduction in repeat tickets, as common issues were resolved on the spot. Onboarding friction decreased, and users were confident that they could get the answers they needed instantly—no ticket, no delay, no need to track a support contact.

“On average, we measured these interactions at around 20 minutes from ticket submission to problem resolution,” says Nathan Prentice, a senior product manager within Microsoft Digital. “Now compare that with a 30-second AI interaction for resolving the same class of issues—that’s a 98% reduction in resolution time, a number we’ve validated with our support team and continue to track.”

Smart, interactive, and intuitive

Our Microsoft Digital team has recently implemented a new version of the MyWorkspace AI assistant that includes several major enhancements. The assistant now features adaptive cards, polished layouts, and a Microsoft 365 Copilot-aligned user experience, making it feel familiar and trustworthy for internal teams. The assistant can now distinguish between a question and an action. If a user says, “Start a SharePoint lab,” it responds with an interactive card and begins provisioning, bridging the gap between passive support and active enablement.

“One of the primary bottlenecks we previously faced in creating an AI solution to address frequently asked user questions was the lack of technology capable of generating accurate answers for complex technical queries and understanding nuanced user input. With the availability of Azure OpenAI models, we were able to effectively overcome this challenge, enabling our AI solution to deliver precise and context-aware responses at scale.”

A photo of Nair.
Anjali Nair, senior software engineer, Microsoft Digital

To guide our employees and improve discoverability, the assistant offers recommended prompts—just like Copilot does—helping new users understand what they can ask and how to get started.

Users can now rate responses, giving a thumbs up or down. These signals are aggregated and reviewed by the engineering team, ensuring continuous improvement and fine tuning over time.

Intelligent provisioning with multi-agent orchestration 

At Microsoft Digital, we’re reimagining how labs are provisioned by integrating AI-driven intelligence into the process. Traditionally, users are expected to know exactly what kind of lab environment they need. But in complex virtualization and troubleshooting scenarios, these assumptions often fall short. Should a user troubleshooting hybrid issues with Microsoft Exchange spin up a basic Exchange lab, or one that includes Azure AD integration, conditional access policies, and hybrid connectors? To eliminate this guesswork, our team is building a multi-agent system powered by the Semantic Kernel SDK multi-agent framework. This system interprets the user’s support context—often expressed in natural language—and automatically provisions the most relevant lab environment.

For example, a user might say, “I’m seeing sync issues between SharePoint Online and on-prem,” and the assistant will orchestrate the creation of a tailored lab that replicates that exact scenario, enabling faster diagnosis and resolution. With agent orchestration, each agent in the system is specialized: one might handle context interpretation, another lab configuration, and another cost optimization. These agents collaborate to ensure that the lab not only meets technical requirements but is also cost-effective. By leveraging telemetry and historical usage data, the system can recommend leaner configurations—such as using ephemeral VMs, auto-pausing idle resources, or selecting lower-cost SKUs—without compromising diagnostic fidelity. This intelligent provisioning framework is designed to scale, adapt, and continuously learn from usage patterns.

“One of the primary bottlenecks we previously faced in creating an AI solution to address frequently asked user questions was the lack of technology capable of generating accurate answers for complex technical queries and understanding nuanced user input,” says Anjali Nair, a senior software engineer within Microsoft Digital. “With the availability of Azure OpenAI models, we were able to effectively overcome this challenge, enabling our AI solution to deliver precise and context-aware responses at scale.”

With multi-agent orchestration, we’re taking a step towards a future where lab environments are not just automated, but intelligently orchestrated, context-aware, and cost-optimized—empowering engineers to focus on solving problems, not setting up infrastructure.

Scaling support without scaling headcount

The MyWorkspace assistant is a powerful example of how enterprise support can evolve through intelligence. By embedding AI into the support experience, we’ve turned complexity into a competitive edge—reshaping knowledge work and operations through AI’s problem-solving capabilities. As Microsoft advances as a Frontier Firm, MyWorkspace shows how we can scale support on demand, with intelligence built in. Routine queries are offloaded to AI, freeing Tier 1 teams to focus on critical issues and giving Tier 2 engineers space to innovate. Most importantly, support now scales with user demand—not headcount.

But this system does more than just respond—it learns. Every interaction becomes a data point. Each resolved issue feeds back to the assistant, sharpening its accuracy and expanding its knowledge. What started as a reactive Q&A tool is now growing into a proactive orchestrator that surfaces insights and points users to solutions, resolving issues before they ever become tickets.

“We have a lot more telemetry now, so users can provide feedback to our responses—for example, thumbs up or thumbs down feedback,” Deans says. “And we can actually view where the model is giving incorrect or inappropriate information, and we can use that to make adjustments to the prompt provided to the model.”

In this model, support becomes a seamless extension of the user experience. With the right AI architecture in place, it transforms a cost center into a strategic asset. The MyWorkspace assistant fulfills its role as an embedded, intelligent teammate—delivering answers, driving actions, and continuously improving over time.

Ultimately, our journey with MyWorkspace shows that meaningful AI adoption doesn’t have to begin with sweeping transformation. Sometimes, it starts with a helpdesk queue, a recurring issue, and the choice to build something smarter—something that learns, adapts, and empowers at every step.

Key takeaways

Here are some of our top insights from boosting our internal deployment of MyWorkspace with AI and continuous improvement.

  • Start small and specific. Focus on a defined domain—like MyWorkspace—and use existing support logs to train your assistant.
  • Invest in AI infrastructure. Tools like Semantic Kernel provide flexibility, especially in enterprise settings where vendor neutrality and customization matter.
  • Design for trust. Align your assistant’s UI with well-known systems like Microsoft Copilot to build user confidence.
  • Don’t wait for perfection. Launch a V1, gather feedback, and make improvements. AI assistants get better over time if you let them learn.
  • Think outside the ticket queue. The future isn’t just faster support—it’s intelligent, anticipatory systems that eliminate friction before it begins.

The post Smarter labs, faster fixes: How we’re using AI to provision our virtual labs more effectively appeared first on Inside Track Blog.

]]>
19628
Securing the borderless enterprise: How we’re using AI to reinvent our network security http://approjects.co.za/?big=insidetrack/blog/securing-the-borderless-enterprise-how-were-using-ai-to-reinvent-our-network-security/ Thu, 10 Jul 2025 16:00:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=19504 The modern enterprise network is complex, to say the least. Enterprises like ours are increasingly adopting hybrid infrastructures that span on-premises data centers, multiple cloud environments, and a diverse array of remote users. In this context, traditional security tools are still playing checkers while the malicious actors are playing chess. To make matters worse, attacks […]

The post Securing the borderless enterprise: How we’re using AI to reinvent our network security appeared first on Inside Track Blog.

]]>
The modern enterprise network is complex, to say the least.

Enterprises like ours are increasingly adopting hybrid infrastructures that span on-premises data centers, multiple cloud environments, and a diverse array of remote users. In this context, traditional security tools are still playing checkers while the malicious actors are playing chess. To make matters worse, attacks are increasingly enabled by AI tools.

That’s why here in Microsoft Digital, the company’s IT organization, we’re using a modern approach and toolset—including AI—to secure our network environment, turning complexity into clarity, one approach, tool, and insight at a time.

Leaving traditional network security behind

For years, traditional network security relied on a simple but increasingly outdated assumption: everything inside the corporate perimeter can be trusted. This model made sense when networks were static, users were on-premises, and applications lived in a centralized data center.

But that world is gone.

A photo of Venkatraman.

“Implicit trust must be replaced with explicit verification. That means rethinking how we monitor, how we respond, and how we design for resilience from the start.”

Raghavendran Venkatraman, principal cloud network engineering manager, Microsoft Digital

Today’s enterprise is dynamic, decentralized, and borderless. Hybrid work has become the norm. Cloud adoption is accelerating. Teams are globally distributed. Devices and data move constantly across environments. In this new reality, the network perimeter hasn’t just shifted—it has effectively vanished.

That’s where the cracks in legacy security models become impossible to ignore.

Visibility becomes fragmented. Security teams struggle to track what’s happening across a sprawling digital estate. Traditional monitoring tools focus on infrastructure uptime or device health—not on the actual experience of the people using the network. That disconnect creates blind spots, and blind spots create risk.

We know that this model no longer meets the needs of a modern, AI-powered enterprise. Every enterprise needs a new approach—one that assumes breach, enforces least-privilege access, and continuously verifies trust.

“Implicit trust must be replaced with explicit verification,” says Raghavendran Venkatraman, a principal cloud network engineering manager in Microsoft Digital. “That means rethinking how we monitor, how we respond, and how we design for resilience from the start.”

This shift is foundational to our security strategy. It’s not just about securing infrastructure—it’s about securing the experience. Because in a world where users, data, and threats are everywhere, trust has to be proved, not assumed.

Building a resilient and adaptive security strategy

To secure hybrid corporate networks effectively, organizations must go beyond traditional perimeter defenses. They need a comprehensive and adaptive security strategy—one that evolves with the threat landscape and aligns with the complexity of modern enterprise environments. The diversity of hybrid networks introduces new vulnerabilities and expands the attack surface. A static, one-size-fits-all approach simply doesn’t work anymore.

At Microsoft Digital, we’ve embraced a layered, cloud-first security model that integrates identity, access, encryption, and monitoring across every layer of the network. It’s embedded in everything we do. This model includes these key strategies, which we’ll expand upon in the following sections:

  • Adopting Zero Trust principles
  • Establishing identity as the new perimeter 
  • Integrating AI and machine learning
  • Enforcing network segmentation
  • Embracing continuous monitoring

Adopting Zero Trust principles

Zero Trust Architecture (ZTA) operates on a strict principle: “never trust, always verify.” That means no user, device, or application—whether it’s inside or outside the corporate network—is inherently trusted as they are in the traditional network security model.

A photo of McCleery.

“Zero Trust isn’t a product—it’s a mindset. It’s about assuming breach and designing defenses that minimize impact and maximize resilience.”

Tom McCleery, principal group cloud network engineer, Microsoft Digital

Every access request is evaluated against dynamic policies. These policies consider several factors—like user identity, device health, location, and how sensitive the data being accessed is. For example, if an employee tries to access a financial report from a corporate laptop at the office, they might get in, no problem. But that same request from a personal device in another country could get blocked or trigger extra authentication steps.

At the heart of ZTA are policy enforcement points that authorize every data flow. These checkpoints only grant access when all conditions are met, and they log every interaction for auditing and threat detection. This kind of granular control reduces the attack surface and limits lateral movement if there is a breach.

Adopting Zero Trust isn’t just a technical upgrade—it’s a strategic must. It boosts an organization’s ability to defend against modern threats like ransomware, insider attacks, and supply chain compromises.

“Zero Trust isn’t a product—it’s a mindset,” says Tom McCleery, a principal group cloud network engineer in Microsoft Digital. “It’s about assuming breach and designing defenses that minimize impact and maximize resilience.”

By embracing Zero Trust, we strengthen our security posture, lowers the risk of data breaches, and responds more effectively to emerging threats.

Establishing identity as the new perimeter

Identity is no longer just a component of security—it has become the new perimeter. Traditional security models focused on defending the network edge, assuming that everything inside the perimeter could be trusted. But in today’s hybrid and cloud-first environments, the perimeter has dissolved and that assumption is outdated and dangerous. Users, devices, and applications now operate across diverse locations and platforms, making perimeter-based defenses insufficient.

Identity-first security shifts the focus from securing the physical network to securing the identities—both human and machine—that interact with the network. This means every access request is treated as though it originates from an untrusted source, regardless of where it comes from. Whether it’s a remote employee logging in from a personal device or an automated workload accessing cloud resources, the system must verify who or what is making the request, assess the risk, and enforce least-privilege access across the user experience.

This approach enables organizations to implement more granular access controls. For example, a developer might be allowed to access a code repository but not production systems, and only during business hours from a managed device.  Similarly, a service account used by a Continuous Integration and Continuous Deployment CI/CD pipeline might be restricted to specific APIs and monitored for anomalous behavior. A CI/CD pipeline is an automated workflow that takes code from development through testing and into production.

By anchoring network security around verified identities, organizations reduce their attack surface and improve their ability to detect and respond to threats. This identity-centric model is not just a security enhancement—it’s a strategic shift that aligns with how modern enterprises operate.

Integrating AI and machine learning 

AI and machine learning (ML) are foundational pillars in our network security strategy. Intelligent automation and advanced analytics help us not only detect and respond to threats, but also continuously improve our security posture in an ever-changing landscape. Here’s how we’re using AI and ML in some critical aspects of our approach to modern network security:

  • Threat detection and intelligence. We deploy AI-powered monitoring tools that sift through billions of network signals and logs across our hybrid infrastructure. By applying sophisticated ML algorithms, we can identify abnormal behaviors such as unusual login attempts or unexpected data transfers that could indicate a potential breach. These insights allow our security teams to focus on the most critical alerts, reducing noise and accelerating incident investigation.
  • Automated response and containment. Through automation, our security systems can respond to threats in real time. For example, if our AI models detect suspicious activity on a device, automated workflows can immediately isolate the affected endpoint, block malicious traffic, or revoke access privileges, all without waiting for manual intervention. This rapid response capability is essential for minimizing the potential impact of attacks and protecting our critical assets.
  • Predictive analysis and proactive defense. We use predictive analytics to forecast emerging vulnerabilities before they can be exploited. By continuously training our models on the latest threat intelligence and attack patterns, we can anticipate risks and strengthen our defenses proactively—whether that means patching vulnerable systems, adjusting access controls, or updating our security policies.
  • User experience monitoring. We use AI to assess the real experience of our users, a critical measurement in a network environment where identity is the perimeter. By correlating performance metrics with security signals, we ensure that our security mechanisms don’t degrade productivity and that any anomalies impacting user experience are promptly addressed.
  • Continuous learning and improvement. Our AI and ML systems are designed to learn from every incident, adapt to new attack techniques, and evolve with the threat landscape. This continuous improvement loop enables our teams to stay ahead of sophisticated adversaries and maintain robust, resilient network security.

Advanced threats require advanced responses. By integrating AI and ML into our network security strategies, we’re enhancing our ability to detect and respond to threats swiftly, minimize potential damage, and foster a secure environment for innovation and collaboration across our global hybrid infrastructure.

Isolating networks to minimize risk

In a hybrid infrastructure, isolating network segments is a foundational security principle. By segmenting networks, we limit the scope of potential breaches and reduce the risk of lateral movement by attackers. For example, separating employee productivity networks from customer-facing systems ensures that if a vulnerability is exploited in one area, it doesn’t cascade across the entire environment.

This is especially critical in environments where sensitive customer data and internal development systems coexist. Our testing and development environments must remain completely isolated—not only from customer-facing services but also from internal productivity tools like email, collaboration platforms, and identity systems. This prevents test code or experimental configurations from inadvertently exposing production systems to risk.

We also establish policy enforcement points (PEPs) within each network segment. These act as control gates, inspecting and filtering traffic between zones. By placing PEPs at strategic boundaries, we can tightly control what moves between segments and detect anomalies early. This architecture ensures that, if a breach occurs, the “blast radius”—the scope of impact—is minimal and contained.

This layered approach to segmentation and isolation is essential for maintaining the integrity of our production systems, minimizing risk, and ensuring that our hybrid infrastructure remains resilient in the face of evolving threats.

Embracing continuous monitoring 

We’ve stopped thinking of monitoring as a one-time check. Now, it’s a continuous conversation with our network.

A photo of Singh.

“Conventional network performance monitoring—monitoring the systems and infrastructure that support our network—can only tell part of the story. To truly understand and meet our requirements, we must monitor user experiences directly.”

Ragini Singh, partner group engineering manager in Microsoft Digital

Continuous monitoring is how we stay ahead of issues before they impact our people. It’s how we keep our hybrid infrastructure resilient, performant, and secure—every second of every day.

We’ve built a monitoring ecosystem that spans our entire global network from on-premises offices to cloud-based services in Azure and software-as-a-service (SaaS) platforms. With the mindset that identity is the new perimeter, we’re using signals from all aspects of our environment and focusing on the user experience.

“Conventional network performance monitoring—monitoring the systems and infrastructure that support our network—can only tell part of the story,” says Ragini Singh, a partner group engineering manager in Microsoft Digital. “To truly understand and meet our requirements, we must monitor user experiences directly.”

This isn’t just about tools and dashboards. It’s about insight. We’re using synthetic and native metrics to build a hop-by-hop view of the user experience. That lets us pinpoint where things go wrong—and fix them fast. We’re even layering in automation to enable self-healing responses when thresholds are breached.

Continuous monitoring is a strategic shift that helps us protect our people, power our services, and deliver the seamless experience our employees expect.

Looking to the future

As enterprises continue to navigate the complexities of hybrid infrastructures, securing enterprise networks requires an agile, multifaceted approach that integrates Zero Trust principles, identity-first security, and advanced technologies like AI and ML. By shifting the focus from traditional perimeter defenses to a more holistic and adaptive security model, organizations can better protect their assets, maintain operational continuity, and foster innovation in an increasingly interconnected world.

Implementing these strategies not only enhances security but also positions organizations to leverage the full potential of their hybrid infrastructures, driving growth and success in the digital age.

Key takeaways

Here are five key actions you can take to strengthen your organization’s network security and embrace a modern approach to network security:

  • Adopt an identity-first security model. Shift your focus from traditional perimeter-based defenses to verifying and securing every user and device identity—regardless of location or network.
  • Integrate AI and machine learning into your security strategy. Continuously improve your security posture by using intelligent automation and analytics to detect, respond to, and predict threats more effectively.
  • Isolate network segments to minimize risk. Separate critical business functions, customer-facing services, and development environments to contain threats and ensure that any potential breach remains limited in scope.
  • Implement continuous monitoring across your hybrid infrastructure. Move beyond periodic checks by establishing real-time, user-centric monitoring to maintain resilience, performance, and rapid incident response.
  • Embrace a proactive, adaptive mindset. Regularly update your security policies, train your teams, and stay agile to address emerging threats and support innovation as your organization evolves.

The post Securing the borderless enterprise: How we’re using AI to reinvent our network security appeared first on Inside Track Blog.

]]>
19504
The $500-billion challenge: Inside the modernization of Microsoft Treasury’s backend infrastructure http://approjects.co.za/?big=insidetrack/blog/the-500-billion-challenge-inside-the-modernization-of-microsoft-treasurys-backend-infrastructure/ Thu, 19 Jun 2025 16:00:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=19379 Editor’s note: This story was created with the help of artificial intelligence. To learn more about how Inside Track is using the power of generative AI to augment our human staff, see our story, Reimagining content creation with our Azure AI-powered Inside Track story bot. Engage with our experts! Customers or Microsoft account team representatives […]

The post The $500-billion challenge: Inside the modernization of Microsoft Treasury’s backend infrastructure appeared first on Inside Track Blog.

]]>
Editor’s note: This story was created with the help of artificial intelligence. To learn more about how Inside Track is using the power of generative AI to augment our human staff, see our story, Reimagining content creation with our Azure AI-powered Inside Track story bot.

When you’re responsible for processing over $500 billion in transactions every year, modernization becomes a mission-critical and highly delicate undertaking, where even the smallest misstep can carry serious financial consequences.

This was the challenge facing the Microsoft Treasury group, whose global operations relied on a patchwork of aging on-premises infrastructure, legacy systems, and leased lines.

Even the most mission-critical and entrenched systems must eventually evolve to meet the modern demands of speed, security, and scale.

A photo of Manikala.

“Modernizing the Treasury Service was not just about adopting new technology. It was about ensuring the uninterrupted operation of vital financial services while collaborating with various teams and meeting all security checks.”

Srinubabu Manikala, principal network engineering manager, Microsoft Digital

For Microsoft Treasury, that time had come—but instead of a straightforward infrastructure upgrade, the company seized the opportunity to go further.

What followed was a bold, strategic transformation that reinvented Microsoft Treasury’s core financial services, phasing out its aging infrastructure and migrating one of the world’s most complex treasury operations to the cloud.

“Modernizing the Treasury Service was not just about adopting new technology,” says Srinubabu Manikala, a principal network engineering manager in Microsoft Digital, the company’s IT organization. “It was about ensuring the uninterrupted operation of vital financial services while collaborating with various teams and meeting all security checks.”

A complex web of legacy, on-premises dependencies

Microsoft Treasury’s legacy infrastructure was initially built around a model where physical presence and dedicated hardware were the norm.

A photo of Shah.

“The legacy network architecture was heavily dependent on on-premises infrastructure and leased lines from third-party partners. This introduced constraints, making the environment complex and difficult to scale.”

Harsh Shah, senior service engineer, Microsoft Digital

It supported a vast network of over 80 banking partners across more than 110 countries, enabling essential financial functions like bank guarantees, supply chain financing, ledger updates, and global cash visibility.

This infrastructure complexity made modernization a highly challenging, costly, and risky endeavor—especially with an architecture that relied heavily on leased lines, aging hardware, and on-premises access methods.

“The legacy network architecture was heavily dependent on on-premises infrastructure and leased lines from third-party partners,” says Harsh Shah, a senior service engineer for Microsoft Digital. “This introduced constraints, making the environment complex and difficult to scale.”

For instance, the “Trading Room” required traders to be on-site to access treasury systems, a model that was quickly disrupted during the COVID-19 pandemic. The growing need for secure remote access only intensified these pressures, especially as seamless integration with cloud-first partners became critical, and downtime was a non-starter. Even brief outages risked financial penalties and could disrupt transactions worth billions.

Navigating the challenge

The Microsoft Digital team responsible for overseeing Microsoft Treasury’s network infrastructure proposed two potential architectural solutions that would meet their modernization requirements while enhancing network infrastructure.

A photo of Griffin.

“Our partners in Treasury ultimately chose the second option, transitioning to a hybrid network. They have a long-term goal of moving entirely to the cloud using Azure.”

Justin Griffin, principal group network engineering manager, Microsoft Digital

The first solution involved refreshing all on-premises infrastructure and implementing robust measures to ensure continuity of services during the transition—a costly but safe bet.

The second, more ambitious solution called for a phased transition to a hybrid network with a long-term goal: go fully cloud-native using Microsoft Azure.

“Our partners in Treasury ultimately chose the second option, transitioning to a hybrid network,” says Justin Griffin, a principal group network engineering manager in Microsoft Digital, who led the team responsible for getting the project off the ground. “They have a long-term goal of moving entirely to the cloud using Azure.”

The decision was influenced by several factors, including the need to eliminate costly hardware and the desire for streamlined network management processes, including the use of Azure for seamless integration with internal and external systems.

Implementing the solution

With the second option chosen, the implementation goals were to eliminate on-premises hardware, cut costs, simplify management, and empower team members and partners to access Microsoft Treasury’s network from anywhere, securely. To this end, Azure would become the new backbone for Microsoft Treasury’s infrastructure.

The modernization effort centered around two cornerstone projects—the SWIFT Alliance SaaS migration and the migration of BlackRock’s Aladdin platform into Azure. The projects would leverage services like Azure VPN for secure remote access, Azure Firewall for enhanced protection, and Azure Virtual WAN (vWAN) for seamless global connectivity.

Modernizing the SWIFT integration

Microsoft Treasury relies on SWIFT for secure international payments. Previously, access to SWIFT required the use of on-premises hardware security modules (HSMs) for attestation and encryption.

The modernization efforts followed a phased migration path:

  • Transitioning connectivity to Azure using vWAN and Site-to-Site VPNs
  • Maintaining security by peering cloud networks with on-prem HSMs
  • Eventually replacing on-premises HSMs with SWIFT’s SaaS-based attestation solution

The result was the retirement of leased lines and aging hardware, a reduced data center footprint, and cost savings of hundreds of thousands of dollars.

Aladdin secure remote access

To enable secure remote access to Aladdin—BlackRock’s investment management platform—Microsoft Digital collaborated with BlackRock and internal finance teams to implement a cloud-native Azure solution based on the following:

  • Azure vWAN hubs with Point-to-Site VPNs for private user access
  • Palo Alto Network virtual appliances for deep traffic monitoring
  • BGP peering over IPsec for encrypted data transfers
  • Geo-redundant routing for automatic failover in case of outages

Before the migration, outages caused by link failures, power surges, and WAN disruptions were not uncommon. But with the new infrastructure in place, Treasury Services users gained secure, uninterrupted access to Aladdin from anywhere. The move to the cloud, reinforced by availability zones and built-in high availability, effectively put an end to those disruptions.

A team effort: Reducing project risks and bolstering communications

During the transition, the Microsoft Digital team, the Treasury Services team, and their financial partners all played critical roles in executing a highly coordinated and technically demanding transformation.

To maintain continuity, the Treasury Services team temporarily increased its budget to support parallel operations across both the legacy on-premises environment and the new Azure-based infrastructure.

A photo of Ramirez.

“We needed to make sure the communications were clear and acknowledged by each responsible individual to make sure no errors were made that compromised the availability of the system.”

Lionel Ramirez, senior technical program manager, Microsoft Digital Services

They also deployed new VPN clients to enable secure remote access and eventually migrated their HSMs handling critical SWIFT services to the SWIFT-hosted SaaS platform.

For financial partners, the migration meant shifting from traditional on-premises circuits to modern, cloud-based integrations with Azure. This required close collaboration across multiple internal and external teams. To support this shift, Microsoft Digital built a new Azure network infrastructure that integrated with legacy systems while laying the foundation for the fully cloud-hosted Treasury Services infrastructure.

“We needed to make sure the communications were clear and acknowledged by each responsible individual to make sure no errors were made that compromised the availability of the system,” says Lionel Ramirez, a senior technical program manager for Microsoft Digital Services.

Throughout the migration, the Microsoft Digital team ensured clear, continuous communication and required explicit acknowledgements for every critical step to minimize the risk of error and maintain service availability. All changes were carefully timed to occur after market hours and before trading activity resumed, further reducing the risk of disruption or financial penalties. The project team also adhered to stringent security and compliance requirements at every phase of the transition.

The results: Transformations that drive efficiency, security, and savings

By modernizing Microsoft Treasury Services’ network infrastructure—through migrating Aladdin to Azure and transitioning to SWIFT Alliance’s SaaS platform—the teams’ collaborative efforts achieved clear, measurable success

These initiatives boosted operational efficiency, strengthened security, and unlocked greater flexibility, all while bringing significantly reduced costs:

  • Substantial cost savings: Over $1 million saved by eliminating the need for new network hardware and licenses.
  • Enhanced operational continuity: Azure’s dynamic failover eliminated outages caused by power surges or link failures.
  • Remote accessibility: Employees no longer need to be physically present in the Trading Room, with secure VPN access enabling global remote work.
  • Greater scalability and agility: Treasury services can now scale in real time to meet evolving partner demands.
  • Lower partner costs: Key financial partners like BlackRock were able to terminate expensive contracts for on-premises circuits, realizing further savings.
  • Lower environmental footprint: A smaller data center footprint reduced energy consumption and maintenance overhead.

By using Azure’s powerful capabilities, Treasury Services is well-prepared to navigate the complexities of today’s financial landscape, ensuring resilience and agility in a rapidly evolving, dynamic environment.

Looking ahead

The modernization of Microsoft Treasury’s network infrastructure is a powerful example of what digital transformation can achieve. While the immediate gains—cost savings, improved reliability, and increased efficiency—were substantial, the true value lies in what this transformation made possible.

“The transition to a cloud-based network using Azure has empowered the Treasury team with the ability to scale efficiently in response to partner-related changes or enhancements, thanks to being fully hosted in the cloud.”

Justin Griffin, principal group network engineering manager, Microsoft Digital

By migrating to Azure and retiring legacy systems, the Treasury Services group, in partnership with the Microsoft Digital team, is now equipped to navigate the evolving financial landscape with greater agility, resilience, and confidence. The project not only addressed technical debt but also laid the groundwork for future innovation.

With a fully cloud-hosted treasury network, Treasury Services can more easily onboard new financial services and partners, scale operations on demand, and take full advantage of Azure’s built-in monitoring and security tools.

“The transition to a cloud-based network using Azure has empowered the Treasury team with the ability to scale efficiently in response to partner-related changes or enhancements, thanks to being fully hosted in the cloud,” Griffin says. “My team can now seamlessly adjust the Azure cloud network infrastructure to meet the Treasury team’s evolving demands and business needs.”

This success story also illustrates the impact of strategic collaboration, deliberate planning, and cutting-edge technology. It proves that even the most complex, deeply embedded financial systems—ones that move hundreds of billions of dollars—can be reinvented. What began as a high-stakes infrastructure challenge has become a model for future transformation.

Microsoft Treasury’s network infrastructure modernization isn’t just a technical achievement; it’s a blueprint for how organizations can evolve. The ultimate goal is a world where eliminating the legacy burden, embracing the cloud, and meeting high standards for speed, security, and scalability is the norm, not the exception.

Key takeaways

Here are some of our top insights from moving Microsoft Treasury Services network infrastructure to Azure:

  • Embrace cloud migration as an achievable goal: Microsoft Treasury Services, in partnership with the Microsoft Digital team, overcame a significant IT challenge by transitioning from an on-premises system to a cloud-based network using Azure.
  • Untangle complexity: Microsoft Treasury Services Azure, in partnership with the Microsoft Digital team, eliminated the need for on-premises hardware, significantly reducing system complexity and network maintenance requirements.
  • Creating an adaptable partner ecosystem: In an environment where partners and providers increasingly operate in the cloud, the transition bolstered service continuity for critical financial functions and enabled remote access to financial services.
  • Modernization saves time and money: The modernization resulted in substantial cost savings exceeding $1 million, and annual savings of approximately 200 hours in management time.
  • Embrace migration challenges as opportunities: Microsoft Treasury Services looks forward to using Azure’s robust infrastructure to boost agility, cut costs, and fuel future innovation. Each opportunity to upgrade is a chance to innovate.

The post The $500-billion challenge: Inside the modernization of Microsoft Treasury’s backend infrastructure appeared first on Inside Track Blog.

]]>
19379
How Microsoft kept its underwater datacenter connected while retrieving it from the ocean http://approjects.co.za/?big=insidetrack/blog/how-microsoft-kept-its-underwater-datacenter-connected-while-retrieving-it-from-the-ocean/ Mon, 21 Apr 2025 14:05:03 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=5878 Editor’s note: This story was first published in 2020. We periodically update our stories, but we can’t verify that they represent the full picture of our current situation at Microsoft. We leave them on the site so you can see what our thinking and experience was at the time. When Microsoft announced its plan to […]

The post How Microsoft kept its underwater datacenter connected while retrieving it from the ocean appeared first on Inside Track Blog.

]]>
Microsoft Digital storiesEditor’s note: This story was first published in 2020. We periodically update our stories, but we can’t verify that they represent the full picture of our current situation at Microsoft. We leave them on the site so you can see what our thinking and experience was at the time.

When Microsoft announced its plan to build an underwater datacenter, Lathish Kumar Chaparala was excited.

“During the initial rollout of Project Natick, I used to log on to their website and watch the live feed of the underwater camera that was mounted on the datacenter,” says Chaparala, a senior program manager on the networking team in Microsoft Digital, the engineering organization at Microsoft that builds and manages the products, processes, and services that Microsoft runs on.

Little did he know that he and his team would later be brought in to extend the network connectivity of this underwater datacenter so it could be safely fished out of the sea.

But the story begins much earlier than that.

We saw the potential benefit [of developing an underwater datacenter] to the industry and Microsoft. People responded to our work as if we were going to the moon. In our eyes, we were just fulfilling our charter—taking on challenging problems and coming up with solutions.

– Mike Shepperd, senior research and development engineer on the Microsoft Research team

The idea of an underwater datacenter came out of ThinkWeek, a Microsoft event where employees shared out-of-the-box ideas that they thought the company should pursue. One creative idea was put forth by employees Sean James and Todd Rawlings, who proposed building an underwater datacenter powered by renewable ocean energy that would provide super-fast cloud services to crowded coastal populations.

Their idea appealed to Norm Whitaker, who led special projects for Microsoft Research at the time.

Out of this, Project Natick was born.

Mike Shepperd and Samuel Ogden stand in the power substation.
Shepperd (right) and Samuel Ogden test the underwater datacenter from the power substation where the datacenter connects to land, just off the coast of the Orkney Islands. (Photo by Scott Eklund | Red Box Pictures)

“Norm’s team was responsible for making the impossible possible, so he started exploring the viability of an underwater datacenter that could be powered by renewable energy,” says Mike Shepperd, a senior research and development engineer on the Microsoft Research team who was brought on to support research on the feasibility of underwater datacenters.

It quickly became a Microsoft-wide effort that spanned engineering, research, and IT.

“We saw the potential benefit to the industry and Microsoft,” Shepperd says. “People responded to our work as if we were going to the moon. In our eyes, we were just fulfilling our charter—taking on challenging problems and coming up with solutions.”

Researchers on the project hypothesized that having a sealed container on the ocean floor with a low-humidity nitrogen environment and cold, stable temperatures would better protect the servers and increase reliability.

“Once you’re down 20 to 30 meters into the water, you’re out of the weather,” Shepperd says. “You could have a hurricane raging above you, and an underwater datacenter will be none the wiser.”

Internal engineering team steps up

The Project Natick team partnered with networking and security teams in Microsoft Digital and Arista to create a secure wide-area network (WAN) connection from the underwater datacenter to the corporate network.

“We needed the connectivity that they provided to finish off our project in the right way,” Shepperd says. “We also needed that connectivity to support the actual decommissioning process, which was very challenging because we had deployed the datacenter in such a remote location.”

In the spring of 2018, they deployed a fully connected and secure datacenter 117 feet below sea level in the Orkney Islands, just off the coast of Scotland. After it was designed, set up, and gently lowered onto the seabed, the goal was to leave it untouched for two years. Chakri Thammineni, a network engineer in Microsoft Digital, supported these efforts.

Chakri Thammineni sits next to his desk and smiles at the camera. His monitor reads “Project Natick– Network Solution.”
Chakri Thammineni, a network engineer at Microsoft Digital, and his team came up with a network redesign to extend the network connectivity of the underwater datacenter. (Photo submitted by Chakri Thammineni | Inside Track)

“Project Natick was my first engagement after I joined Microsoft, and it was a great opportunity to collaborate with many folks to come up with a network solution,” Thammineni says.

Earlier this year, the experiment concluded without interruption. And yes, the team learned that placing a datacenter underwater is indeed a more sustainable and efficient way to bring the cloud to coastal areas, providing better datacenter responsiveness.

With the experiment ending, the team needed to recover the datacenter so it could analyze all the data collected during its time underwater.

That’s where Microsoft’s internal engineering teams came in.

“To make sure we didn’t lose any data, we needed to keep the datacenter connected to Microsoft’s corporate network during our extraction,” Shepperd says. “We accomplished this with a leased line dedicated to our use, one that we used to connect the datacenter with our Microsoft facility in London.”

The extraction also had to be timed just right for the same reasons.

“The seas in Orkney throw up waves that can be as much as 9 to 10 meters high for most of the year,” he says. “The team chose this location because of the extreme conditions, reasoning it was a good place to demonstrate the ability to deploy Natick datacenters just about anywhere.”

And then, like it has for so many other projects, COVID-19 forced the team to change its plans. In the process of coming up with a new datacenter recovery plan, the team realized that the corporate connectivity was being shut down at the end of May 2020 and couldn’t be extended.

“Ordering the gear would’ve taken two to three months, and we were on a much shorter timeline,” Chaparala says.

Shepperd called on the team in Platform Engineering, a division of Microsoft Digital, to quickly remodel the corporate connectivity from the Microsoft London facility to the Natick shore area, all while ensuring that the connection was secured.

The mission?

Ensure that servers were online until the datacenter could be retrieved from the water, all without additional hardware.

Lathish Chaparala sits with his laptop in front of him and looks at the camera.
Lathish Kumar Chaparala, a senior program manager on the networking team in Microsoft Digital, helped extend network connectivity of Microsoft’s underwater datacenter so it could be safely retrieved from the sea. (Photo submitted by Lathish Kumar Chaparala | Inside Track)

“My role was to make sure I understood the criticality of the request in terms of timeline, and to pull in the teams and expertise needed to keep the datacenter online until it was safely pulled out of the water,” Chaparala says.

The stakes were high, especially with the research that was on the line.

“If we lost connectivity and shut down the datacenter, it could have compromised the viability of the research we had done up until that point,” Shepperd says.

A seamless collaboration across Microsoft Research and IT

To solve this problem, the teams in Core Platform Engineering and Microsoft Research had to align their vision and workflows.

“Teams in IT might plan their work out for months or years in advance,” Shepperd says. “Our research is on a different timeline because we don’t know where technology will take us, so we needed to work together, and fast.”

Because they couldn’t bring any hardware to the datacenter site, Chaparala, Thammineni, and the Microsoft Research team needed to come up with a network redesign. This led to the implementation of software-based encryption using a virtual network operating system on Windows virtual machines.

It’s exciting to play a role in bringing the right engineers and program managers together for a common goal, especially so quickly. Once we had the right team, we knew there was nothing we couldn’t handle.

– Chakri Thammineni, a network engineer in Microsoft Digital

With this solution in tow, the team could extend the network connectivity from the Microsoft Docklands facility in London to the Natick datacenter off the coast of Scotland.

“Chakri and Lathish have consistently engaged with us to fill the gaps between what our research team knew and what these networking experts at Microsoft needed in order to take action on the needs of this project,” Shepperd says. “Without help from their teams, we would not have been able to deliver on our research goals as quickly and efficiently as we did.”

Lessons learned from the world’s second underwater datacenter

The research on Project Natick pays dividends in Microsoft’s future work, particularly around running more sustainable datacenters that could power Microsoft Azure cloud services.

“Whether a datacenter is on land or in water, the size and scale of Project Natick is a viable blueprint for datacenters of the future,” Shepperd says. “Instead of putting down acres of land for datacenters, our customers and competitors are all looking for ways to power their compute and to house storage in a more sustainable way.”

This experience taught Chaparala to assess the needs of his partner teams.

“We work with customers to understand their requirements and come up with objectives and key results that align,” Chaparala says.

Ultimately, Project Natick’s story is one of cross-disciplinary collaboration – and just in the nick of time.

“It’s exciting to play a role in bringing the right engineers and program managers together for a common goal, especially so quickly,” Chaparala says. “Once we had the right team, we knew there was nothing we couldn’t handle.”

Related links

The post How Microsoft kept its underwater datacenter connected while retrieving it from the ocean appeared first on Inside Track Blog.

]]>
5878
Enhancing VPN performance at Microsoft http://approjects.co.za/?big=insidetrack/blog/enhancing-vpn-performance-at-microsoft/ Sun, 26 Jan 2025 17:00:13 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=8569 [Editor’s note: This content was written to highlight a particular event or moment in time. Although that moment has passed, we’re republishing it here so you can see what our thinking and experience was like at the time.] Modern workers are increasingly mobile and require the flexibility to get work done outside of the office. […]

The post Enhancing VPN performance at Microsoft appeared first on Inside Track Blog.

]]>
Microsoft Digital technical stories[Editor’s note: This content was written to highlight a particular event or moment in time. Although that moment has passed, we’re republishing it here so you can see what our thinking and experience was like at the time.]

Modern workers are increasingly mobile and require the flexibility to get work done outside of the office. Here at Microsoft headquarters in the Puget Sound area of Washington State, every weekday an average of 45,000 to 55,000 Microsoft employees use a virtual private network (VPN) connection to remotely connect to the corporate network. As part of our overall Zero Trust Strategy, we have redesigned our VPN infrastructure, something that has simplified our design and let us consolidate our access points. This has enabled us to increase capacity and reliability, while also reducing reliance on VPN by moving services and applications to the cloud.

Providing a seamless remote access experience

Remote access at Microsoft is reliant on the VPN client, our VPN infrastructure, and public cloud services. We have had several iterative designs of the VPN service inside Microsoft. Regional weather events in the past required large increases in employees working from home, heavily taxing the VPN infrastructure and requiring a completely new design. Three years ago, we built an entirely new VPN infrastructure, a hybrid design, using Microsoft Azure Active Directory (Azure AD) load balancing and identity services with gateway appliances across our global sites.

Key to our success in the remote access experience was our decision to deploy a split-tunneled configuration for the majority of employees. We have migrated nearly 100% of previously on-premises resources into Microsoft Azure and Microsoft Office 365. Our continued efforts in application modernization are reducing the traffic on our private corporate networks as cloud-native architectures allow direct internet connections. The shift to internet-accessable applications and a split-tunneled VPN design has dramatically reduced the load on VPN servers in most areas of the world.

Using VPN profiles to improve the user experience

We use Microsoft Endpoint Manager to manage our domain-joined and Microsoft Azure AD–joined computers and mobile devices that have enrolled in the service. In our configuration, VPN profiles are replicated through Microsoft Intune and applied to enrolled devices; these include certificate issuance that we create in Configuration Manager for Windows 10 devices. We support Mac and Linux device VPN connectivity with a third-party client using SAML-based authentication.

We use certificate-based authentication (public key infrastructure, or PKI) and multi‑factor authentication solutions. When employees first use the Auto-On VPN connection profile, they are prompted to authenticate strongly. Our VPN infrastructure supports Windows Hello for Business and Multi-Factor Authentication. It stores a cryptographically protected certificate upon successful authentication that allows for either persistent or automatic connection.

For more information about how we use Microsoft Intune and Endpoint Manager as part of our device management strategy, see Managing Windows 10 devices with Microsoft Intune.

Configuring and installing VPN connection profiles

We created VPN profiles that contain all the information a device requires to connect to the corporate network, including the supported authentication methods and the VPN gateways that the device should connect to. We created the connection profiles for domain-joined and Microsoft Intune–managed devices using Microsoft Endpoint Manager.

For more information about creating VPN profiles, see VPN profiles in Configuration Manager and How to Create VPN Profiles in Configuration Manager.

The Microsoft Intune custom profile for Intune-managed devices uses Open Mobile Alliance Uniform Resource Identifier (OMA-URI) settings with XML data type, as illustrated below.

Creating a Profile XML and editing the OMA-URI settings to create a connection profile in System Center Configuration Manager.
Creating a Profile XML and editing the OMA-URI settings to create a connection profile in System Center Configuration Manager.

Installing the VPN connection profile

The VPN connection profile is installed using a script on domain-joined computers running Windows 10, through a policy in Endpoint Manager.

For more information about how we use Microsoft Intune as part of our mobile device management strategy, see Mobile device management at Microsoft.

Conditional Access

We use an optional feature that checks the device health and corporate policies before allowing it to connect. Conditional Access is supported with connection profiles, and we’ve started using this feature in our environment.

Rather than just relying on the managed device certificate for a “pass” or “fail” for VPN connection, Conditional Access places machines in a quarantined state while checking for the latest required security updates and antivirus definitions to help ensure that the system isn’t introducing risk. On every connection attempt, the system health check looks for a certificate that the device is still compliant with corporate policy.

Certificate and device enrollment

We use an Azure AD certificate for single sign-on to the VPN connection profile. And we currently use Simple Certificate Enrollment Protocol (SCEP) and Network Device Enrollment Service (NDES) to deploy certificates to our mobile devices via Microsoft Endpoint Manager. The SCEP certificate we use is for wireless and VPN. NDES allows software on routers and other network devices running without domain credentials to obtain certificates based on the SCEP.

NDES performs the following functions:

  1. It generates and provides one-time enrollment passwords to administrators.
  2. It submits enrollment requests to the certificate authority (CA).
  3. It retrieves enrolled certificates from the CA and forwards them to the network device.

For more information about deploying NDES, including best practices, see Securing and Hardening Network Device Enrollment Service for Microsoft Intune and System Center Configuration Manager.

VPN client connection flow

The diagram below illustrates the VPN client-side connection flow.

A graphic representation of the client connection workflow. Sections shown are client components, Azure components, and site components.
The client-side VPN connection flow.

When a device-compliance–enabled VPN connection profile is triggered (either manually or automatically):

  1. The VPN client calls into the Windows 10 Azure AD Token Broker on the local device and identifies itself as a VPN client.
  2. The Azure AD Token Broker authenticates to Azure AD and provides it with information about the device trying to connect. A device check is performed by Azure AD to determine whether the device complies with our VPN policies.
  3. If the device is compliant, Azure AD requests a short-lived certificate. If the device isn’t compliant, we perform remediation steps.
  4. Azure AD pushes down a short-lived certificate to the Certificate Store via the Token Broker. The Token Broker then returns control back over to the VPN client for further connection processing.
  5. The VPN client uses the Azure AD–issued certificate to authenticate with the VPN gateway.

Remote access infrastructure

At Microsoft, we have designed and deployed a hybrid infrastructure to provide remote access for all the supported operating systems—using Azure for load balancing and identity services and specialized VPN appliances. We had several considerations when designing the platform:

  • Redundancy. The service needed to be highly resilient so that it could continue to operate if a single appliance, site, or even large region failed.
  • Capacity. As a worldwide service meant to be used by the entire company and to handle the expected growth of VPN, the solution had to be sized with enough capacity to handle 200,000 concurrent VPN sessions.
  • Homogenized site configuration. A standard hardware and configuration stamp was a necessity both for initial deployment and operational simplicity.
  • Central management and monitoring. We ensured end-to-end visibility through centralized data stores and reporting.
  • Azure AD­–based authentication. We moved away from on-premises Active Directory and used Azure AD to authenticate and authorize users.
  • Multi-device support. We had to build a service that could be used by as much of the ecosystem as possible, including Windows, OSX, Linux, and appliances.
  • Automation. Being able to programmatically administer the service was critical. It needed to work with existing automation and monitoring tools.

When we were designing the VPN topology, we considered the location of the resources that employees were accessing when they were connected to the corporate network. If most of the connections from employees at a remote site were to resources located in central datacenters, more consideration was given to bandwidth availability and connection health between that remote site and the destination. In some cases, additional network bandwidth infrastructure has been deployed as needed. The illustration below provides an overview of our remote access infrastructure.

VPN infrastructure. Diagram shows the connection from the internet to Azure traffic manager profiles, then to the VPN site.
Microsoft remote access infrastructure.

VPN tunnel types

Our VPN solution provides network transport over Secure Sockets Layer (SSL). The VPN appliances force Transport Layer Security (TLS) 1.2 for SSL session initiation, and the strongest possible cipher suite negotiated is used for the VPN tunnel encryption. We use several tunnel configurations depending on the locations of users and level of security needed.

Split tunneling

Split tunneling allows only the traffic destined for the Microsoft corporate network to be routed through the VPN tunnel, and all internet traffic goes directly through the internet without traversing the VPN tunnel or infrastructure. Our migration to Office 365 and Azure has dramatically reduced the need for connections to the corporate network. We rely on the security controls of applications hosted in Azure and services of Office 365 to help secure this traffic. For end point protection, we use Microsoft Defender Advanced Threat Protection on all clients. In our VPN connection profile, split tunneling is enabled by default and used by the majority of Microsoft employees. Learn more about Office 365 split tunnel configuration.

Full tunneling

Full tunneling routes and encrypts all traffic through the VPN. There are some countries and business requirements that make full tunneling necessary. This is accomplished by running a distinct VPN configuration on the same infrastructure as the rest of the VPN service. A separate VPN profile is pushed to the clients who require it, and this profile points to the full-tunnel gateways.

Full tunnel with high security

Our IT employees and some developers access company infrastructure or extremely sensitive data. These users are given Privileged Access Workstations, which are secured, limited, and connect to a separate highly controlled infrastructure.

Applying and enforcing policies

In Microsoft Digital, the Conditional Access administrator is responsible for defining the VPN Compliance Policy for domain-joined Windows 10 desktops, including enterprise laptops and tablets, within the Microsoft Azure Portal administrative experience. This policy is then published so that the enforcement of the applied policy can be managed through Microsoft Endpoint Manager. Microsoft Endpoint Manager provides policy enforcement, as well as certificate enrollment and deployment, on behalf of the client device.

For more information about policies, see VPN and Conditional Access.

Early adopters help validate new policies

With every new Windows 10 update, we rolled out a pre-release version to a group of about 15,000 early adopters a few months before its release. Early adopters validated the new credential functionality and used remote access connection scenarios to provide valuable feedback that we could take back to the product development team. Using early adopters helped validate and improve features and functionality, influenced how we prepared for the broader deployment across Microsoft, and helped us prepare support channels for the types of issues that employees might experience.

Measuring service health

We measure many aspects of the VPN service and report on the number of unique users that connect every month, the number of daily users, and the duration of connections. We have invested heavily in telemetry and automation throughout the Microsoft network environment. Telemetry allows for data-driven decisions in making infrastructure investments and identifying potential bandwidth issues ahead of saturation.

Using Power BI to customize operational insight dashboards

Our service health reporting is centralized using Power BI dashboards to display consolidated data views of VPN performance. Data is aggregated into an SQL Azure data warehouse from VPN appliance logging, network device telemetry, and anonymized device performance data. These dashboards, shown in the next two graphics below, are tailored for the teams using them.

A map is shown with icons depicting the status of each VPN site globally. All are in a good state.
Global VPN status dashboard.

Six graphs are shown to share VPN performance reporting dashboards. They include peak internet usage, peak VPN bandwidth, Peak VPN concurrent sessions.
Microsoft Power BI reporting dashboards.

Key Takeaways

With our optimizations in VPN connection profiles and improvements in the infrastructure, we have seen significant benefits:

  • Reduced VPN requirements. By moving to cloud-based services and applications and implementing split tunneling configurations, we have dramatically reduced our reliance on VPN connections for many users at Microsoft.
  • Auto-connection for improved user experience. The VPN connection profile automatically configured for connection and authentication types have improved mobile productivity. They also improve the user experience by providing employees the option to stay connected to VPN—without additional interaction after signing in.
  • Increased capacity and reliability. Reducing the quantity of VPN sites and investing in dedicated VPN hardware has increased our capacity and reliability, now supporting over 500,000 simultaneous connections.
  • Service health visibility. By aggregating data sources and building a single pane of glass in Microsoft Power BI, we have visibility into every aspect of the VPN experience.

Related links

The post Enhancing VPN performance at Microsoft appeared first on Inside Track Blog.

]]>
8569
Finding and fixing network outages in minutes—not hours—with real-time telemetry at Microsoft http://approjects.co.za/?big=insidetrack/blog/finding-and-fixing-network-outages-in-minutes-not-hours-with-real-time-telemetry-at-microsoft/ Thu, 29 Aug 2024 15:00:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=16333 With more than 600 physical worksites around the world, Microsoft has one of the largest network infrastructure footprints on the planet. Managing the thousands of devices that keep those locations connected demands constant attention from a global team of network engineers. It’s their job to monitor and maintain those devices. And when outages occur, they […]

The post Finding and fixing network outages in minutes—not hours—with real-time telemetry at Microsoft appeared first on Inside Track Blog.

]]>

With more than 600 physical worksites around the world, Microsoft has one of the largest network infrastructure footprints on the planet.

Managing the thousands of devices that keep those locations connected demands constant attention from a global team of network engineers. It’s their job to monitor and maintain those devices. And when outages occur, they lead the charge to repair and remediate the situation.

To support their work, our Real Time Telemetry team at Microsoft Digital, the company’s IT organization, has introduced new capabilities that help engineers identify network device outages and capture data faster and more extensively than ever before. Through real-time telemetry, network engineers can isolate and remediate issues in minutes—not hours—to keep their colleagues productive and our technology running smoothly.

Immediacy is everything

Dave, Sinha, Vijay, and Menten pose for pictures that have been assembled into a collage.
Aayush Dave, Astha Sinha, Abhijit Vijay, Daniel Menten, and Martin O’Flaherty (not pictured) are part of the Microsoft Digital Real Time Telemetry team enabling more up-to-date and extensive network device data.

Conventional network monitoring uses the Simple Network Management Protocol (SNMP) architecture, which retrieves network telemetry through periodic, pull-based polls and other legacy technologies. At Microsoft, that polling interval typically ranges between five minutes and six hours.

SNMP is a foundational telemetry architecture with decades of legacy. It’s ubiquitous, but it doesn’t allow for the most up-to-date data possible.

“The biggest pain point we’ve always heard from network engineers is latency in the data,” says Astha Sinha, senior product manager for the Infrastructure and Engineering Services team in Microsoft Digital. “When data is stale, engineers can’t react quickly to outages, and that has implications for security and productivity.”

Serious vulnerabilities and liabilities arise when a network device outage occurs. But because of lags between polling intervals, a network engineer might not receive information or alerts about the situation until long after it happens.

We assembled the Real Time Telemetry team as part of our Infrastructure and Engineering Services to close that gap.

“We build the tools and automations that network engineers use to better manage their networks,” says Martin O’Flaherty, principal product manager for the Infrastructure and Engineering Services team in Microsoft Digital. “To do that, we need to make sure they have the right signals as early and as consistently as possible.”

The technology that powers these possibilities is known as streaming telemetry. It relies on network devices compatible with the more modern gRPC Network Management Interface (gNMI) telemetry protocol and other technologies to support a push-based approach to network monitoring where network devices stream data constantly.

This architecture isn’t new, but our team is scaling and programmatizing how that data becomes available by creating a real-time telemetry apparatus that collects, stores, and delivers network information to service engineers. These capabilities offer several benefits.

The advantages of real-time network device telemetry

Superior anomaly detection, reduced intent and configuration drift, the foundation for large-scale automation and less network downtime.

Better detection of breaches, vulnerabilities, and bugs through automated scans of OS stalls, lateral device hijacking, malware, and other common vulnerabilities.

Visibility into real-time utilization data on network device stats, as well as steady replacement of current data collection technology and more scalable network growth and evolution.

More rapid network fixes, leading to a reduction in the baselines for time-to-detection and time-to-migration for incidents.

“Devices are proactively sending data without having to wait for requests, so they function more efficiently and facilitate timely troubleshooting and optimization,” says Abhijit Vijay, principal software engineering manager with the Infrastructure and Engineering Services team in Microsoft Digital. “Since this approach pushes data continuously rather than at specific intervals, it also reduces the additional network traffic and scales better in larger, more complex environments.

At any given time, Microsoft operates 25,000 to 30,000 network devices, managed by engineers working across 10 different service lines. Accounting for all their needs while keeping data collection manageable and efficient requires extensive collaboration and prioritization.

We also had to account for compatibility. With so many network devices in operation, replacement lifecycles vary. Not all of them are currently gNMI-compatible.

Working with our service lines, we identified the use cases that would provide the best possible ROI, largely based on where we would find the greatest benefits for security and where networks offered a meaningful number of gNMI-compatible devices. We also zeroed in on the types of data that would be the most broadly useful. Being selective helped us preserve resources and avoid overwhelming engineers with too much data.

We built our internal solution entirely using Azure components, including Azure Functions and Azure Kubernetes Service (AKS), Azure Cosmos DB, Redis, and Azure Data Lake. The result is a platform that network engineers can use to access real-time telemetry data.

With key service lines, use cases, and a base of technology in place, we worked with network engineers to onboard the relevant devices. From there, their service lines were free to experiment with our solution on real-world incidents.

Better response times, greater network reliability

Service lines are already experiencing big wins.

In one case, a heating and cooling system went offline for a building in the company’s Millennium Campus in Redmond, Washington. A lack of environmental management has the potential to cause structural damage to buildings if left unchecked, so it was important to resolve this issue as quickly as possible. The service line for wired onsite connections sprang into action as soon as they received a network support ticket.

With real-time telemetry enabled, the team created a Kusto query to compare DOT1X access-session data for the day of the outage with a period before the outage started. Almost immediately, they spotted problematic VLAN switching, including the exact time and duration of the outage. By correlating the timestamps, they determined that the RADIUS registrations of the device owner had expired, which caused the devices to switch into the guest network as part of the zero-trust network implementation.

As a result, the team was able to resolve the registration issues and restore the heating and cooling systems in 10 minutes—a process that might have taken hours using other collection methods due to the lag-time between polling intervals.

“This has the potential to improve alerting, reduce outages, and enhance security,” says Daniel Menten, senior cloud network engineer for site infrastructure management on the Site Wired team. “One of the benefits of real-time telemetry is that it lets us capture information that wasn’t previously available—or that we received too slowly to take action.”

It’s about speeding up how we identify issues and how we then respond to them.  

“With this level of observability, engineers that monitor issues and outages benefit from enhanced experiences,” says Aayush Dave, a product manager on the Infrastructure and Engineering Services team in Microsoft Digital. “And that’s going to make our network more reliable and performant in a world where security issues and outages can have a global impact.”

The future is in real time

Now that real-time telemetry has demonstrated its value, our efforts are focused on broadening and deepening the experience.

“More devices mean more impact,” Dave says. “By increasing the number of network devices that facilitate real-time telemetry, we’re giving our engineers the tools to accelerate their response to these incidents and outages, all leading to enhanced performance and a more robust network reliability posture.”

It’s also about layering on new ways of accessing and using the data.

We’ve just released a preview UI that provides a quick look at essential data, as well as an all-up view of devices in an engineer’s service line. This dashboard will enable a self-service model that makes it even easier to isolate essential telemetry without the need for engineers to create or integrate their own interfaces.

That kind of observability isn’t only about outages. It also enables optimization by helping engineers understand and influence how devices work together.

The depth and quality of real-time telemetry data also provides a wealth of information for training AI models. With enough data spread across enough devices, predictive analysis might be able to provide preemptive alerts when the kinds of network signals that tend to accompany outages appear.

“We’re paving the way for an AIOps future where the system won’t just predict potential issues, but initiate self-healing actions,” says Rob Beneson, partner director of software engineering on the Infrastructure and Engineering Services team in Microsoft Digital.

It’s work that aligns with our company mission.

“This transformation is enhancing our internal user experience and maintaining the network connectivity that’s critical for our ultimate goal,” Beneson says. “We want to empower every person and organization on the planet to achieve more.”

Key Takeaways

Here are some tips for getting started with real-time telemetry at your company:

  • Start with your users. Ask them about pain points, what scares them, and what they need.
  • Start small and go step by step to get the core architecture in place, then work up to the glossier UI and UX elements.
  • Be mindful of onboarding challenges like bugs in vendor hardware and software, especially around security controls.
  • You’ll find plenty of edge cases and code fails, so be prepared to invest in revisiting challenges and fixing problems that arise.
  • Make sure you have a use case and a problem to solve. Have a plan to guide your adoption and use before you turn on real-time telemetry.
  • Make sure you have the proper data infrastructure in place and an apparatus for storing your data.
  • Communicate and demonstrate the value of this solution to the teams who need to invest resources into onboarding it.
  • Prioritize visibility into the devices and data you’ve onboarded through pilots and hero scenarios, then scale onboarding further according to your teams’ needs.
  • Integrate as much as possible. Consider visualizations and pushing into existing network graphs and tools to surface data where engineers already work.

The post Finding and fixing network outages in minutes—not hours—with real-time telemetry at Microsoft appeared first on Inside Track Blog.

]]>
16333
Enhancing space management internally at Microsoft with Wi-Fi data http://approjects.co.za/?big=insidetrack/blog/enhancing-space-management-internally-at-microsoft-with-wi-fi-data/ Thu, 18 Jul 2024 16:00:55 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=15346 Space management and employee engagement are two critical aspects of any modern workplace, including internally here at Microsoft. Figuring out how to get both right leads to important questions: How can organizations understand the best use of their building spaces, including offices and common spaces, while providing better experiences for their employees? How can they […]

The post Enhancing space management internally at Microsoft with Wi-Fi data appeared first on Inside Track Blog.

]]>
Microsoft Digital technical storiesSpace management and employee engagement are two critical aspects of any modern workplace, including internally here at Microsoft.

Figuring out how to get both right leads to important questions:

How can organizations understand the best use of their building spaces, including offices and common spaces, while providing better experiences for their employees? How can they reduce the cost and complexity of installing and maintaining IoT sensors to measure people density in different areas? How can they protect the privacy of employees and their devices and comply with privacy regulations?

This is what we asked ourselves when we set out to enhance both space utilization and the experience our employees have when they go into the office in our brand-new buildings here at Microsoft headquarters in Redmond, Washington.

We in Microsoft Digital, the company’s IT organization, knew that each new building would come with a wireless access point (WAP) system that employees use to access Wi-Fi. We knew the data from the access points could be used to measure the people density in different areas. The question was, how could we use this data to provide real-time insights to employees and facility managers privately and securely?

We identified an opportunity to reuse the existing devices and the data that we already had from these devices. It was a cost-optimized way of handling our requirements.

— Nritya Reddy, senior product manager, Microsoft Digital

Using WAP data to measure space utilization

From left to right, a composite image of Reddy, Lee, Chimbili, Kothamasu, Sadasivuni, and Kumar.
Improving our space management using Microsoft Azure and AI is the focus for (top row, from left) Nritya Reddy, Daniel Lee, and Veeren Kumar Chimbili; and (bottom row, from left) Lakshmi Kothamasu, Sudhakar Sadasivuni, and Bharath Kumar.

Our solution, Space Busyness Insights, uses our standard Wi-Fi WAP devices located throughout each building to calculate data on space utilization. This data includes identifying unused areas, occupied spaces and the crowd density, and the availability and use of common areas. By analyzing this data, we can make informed decisions about how to best allocate additional space or repurpose existing areas for more effective use. Additionally, we can plan for future real estate requirements.

We identified an opportunity to reuse the existing devices and the data that we already had from these devices,” says Nritya Reddy, a senior product manager on the Microsoft Digital team. “It was a cost-optimized way of handling our requirements.”

Our employees’ benefit from this solution is being able to view, in real time, the availability and activity in shared spaces such as kitchenettes and conference rooms. To implement this solution, we collaborated with our Infrastructure and security team, Innerspace (a third-party vendor), and Microsoft facilities managers. We integrated AI to enhance our data measurement and analysis capabilities, enabling us to create actionable plans for space management.

The era of modern smart experiences with IoT hardware demands innovative solutions that can be stitched across multiple devices and protocols with a cost-efficient design and architecture. I consider this as an opportunity to use the signals from two ecosystems to build secure, privacy-protected, smart building experiences. This gives us further opportunities to explore various use cases with WAP technology without additional hardware integrations,” says Sudhakar Sadasivuni, principal group engineering manager, Microsoft Digital.  

Our innovative approach of repurposing existing devices for new requirements emphasized cost optimization and helped us be frugal with our resources.

“We have an existing Wi-Fi infrastructure in all our buildings, provisioned via WAP devices by different vendors. They can provide a list of devices that are Wi-Fi- connectable and in the discoverable range of the given WAP device,” says Reddy. “By employing artificial intelligence and machine learning on this raw data, we can triangulate people density. Meaning, you would know how many people are in that specific area based on some of the devices that these people are carrying, either a laptop or a mobile phone, which are discovered by these WAP data points.”

We sell primarily to the largest enterprises, so we needed to build on a robust, highly secure, highly scalable, and universally trusted cloud infrastructure.

— Matt MacGillivray, co-founder and VP of Research and Development at InnerSpace

We partnered with InnerSpace, a vendor who has the logic and AI ML capabilities in their system to understand and make sense of the raw data that came from the WAP points and then provide meaningful people-count data.

“We sell primarily to the largest enterprises, so we needed to build on a robust, highly secure, highly scalable, and universally trusted cloud infrastructure,” says Matt MacGillivray, co-founder and VP of Research and Development at InnerSpace.

He shared how they used Microsoft Azure services to run their logic and provide the output.

“We used Azure Kubernetes to provide elastic capacity for our ingest pipeline and datastore, Azure App Services to run our client-facing web-based tooling, and Azure Container Instances to deploy containerized subsystems without needing to manage the machines running them,” MacGillivray says.

InnerSpace also uses proprietary AI logic to ensure that people aren’t counted double because they might be carrying more than one device. Based on the proximity of those devices and other logic and rules in place, they can help us determine space usage.

The device identifiers are shared between systems in a hashed way. This ensures that specific devices discovered cannot be identified and personal identification information (PII) is protected. We performed stringent Microsoft architectural and data privacy reviews to ensure that no private data is being leaked at any stage. In addition to privacy, scaling and security are other key aspects considered when exchanging data with external systems.

— Lakshmi Kothamasu, principal software engineering manager, Microsoft Digital

We implemented this solution in our Redmond East Campus buildings, and through this process, we get the information we need for space utilization with these two goals in mind for our employees:

  • Protect our employees’ personal information and privacy
  • Comply with privacy regulations

To make our solution work with these two goals in mind, we hash the media access control (MAC) addresses of the devices to anonymize the data we send to the third parties, and we perform Microsoft privacy reviews. We only provide InnerSpace with information that they need to analyze the data and make sure to protect everything else. Any data that can be identifiable and linked to a specific device or person(s) is hashed.

After encoding or hashing the data, we get the data that is pushed from our internal Microsoft team to our Device Management Services (DMS) Azure Event Hub.

From there, we have a federated authentication mechanism in place for our vendor, InnerSpace, for them to access the anonymized data from our Azure Event Hub. InnerSpace then runs their logic over that data and provides the people count in a space context back to Microsoft.

We also ensure that InnerSpace has our building maps with the access point (AP) IDs and locations on them so they can run their triangulation algorithms to pinpoint the number of people in any space at a given point in time.

When we get that information back, we can then use that data to review and analyze the information and make space utilization decisions.

“The device identifiers are shared between systems in a hashed way. This ensures that specific devices discovered cannot be identified and personal identification information (PII) is protected. We performed stringent Microsoft architectural and data privacy reviews to ensure that no private data is being leaked at any stage. In addition to privacy, scaling and security are other key aspects considered when exchanging data with external systems,” says Lakshmi Kothamasu, principal software engineering manager, Microsoft Digital.

An example of the network traffic architecture flow from the WAP system to InnerSpace.
A diagram of how our network traffic architecture flows from the WAP system to InnerSpace.

“We underwent a review where the subject matter experts on the privacy team reviewed the entire architecture and made sure that no device identifier or personally identifiable information of the employee was directly or indirectly being passed on,” Lakshmi says.

An example of the flow we use to get data from the WAP system to InnerSpace.
A diagram of how we follow the process flow to get information from the WAP system to InnerSpace.

Using our solution to plan for future needs

The benefit of this solution is that it enables real estate and facilities managers to optimize the space utilization and plan for future needs, and it empowers employees to make informed decisions about where and when to use common areas, such as the kitchenettes and the meeting rooms, in real time. We also use a smart building kiosk that allows employees to access the data in a simple and intuitive way.

“The smart building kiosk can be used to open an app, look at a map on the web, or go to a kiosk to see a map. When the employee zooms in, they can see if a space is busy or not,” Reddy says.

Maximizing cost savings

By using the existing WAP system instead of installing new sensors, we saved around $3 million in hardware costs for the East Campus buildings. Because the WAP system exists in all buildings, we can easily enable this solution in other buildings without additional hardware costs.

The cost avoidance isn’t just about not having to buy those IoT sensors and installing them, but also the continued maintenance and security of those devices. You have firmware updates and security updates in the future, so the life cycle costs come down to quite a bit of savings from not having to implement duplicative infrastructure.

— Daniel Lee, regional lead, Center of Innovation, Microsoft Global Workplaces Services

The cost savings go beyond just the hardware. In every building, the fundamental IT infrastructure includes WAP, which is essential for providing Wi-Fi connectivity. Our Microsoft internal team has developed a highly configurable solution that doesn’t require any code changes. To integrate a new building, we simply need to update the configuration with the new AP layout, and the system operates seamlessly. While the initial implementation at the East Campus took about three months, the process has been significantly streamlined for other locations and can now be completed in just a week or two.

“The cost avoidance isn’t just about not having to buy those IoT sensors and installing them, but also the continued maintenance and security of those devices. You have firmware updates and security updates in the future, so the life cycle costs come down to quite a bit of savings from not having to implement duplicative infrastructure,” says Daniel Lee, a regional lead on the Center of Innovation team in Microsoft Global Workplaces Services.

When considering a building, whether it’s a leased space for customers or a company’s own property, optimizing the use of space is crucial. Real estate comes with significant costs, not only in acquisition but also in ongoing maintenance. We need to ensure that employees are making effective use of these spaces. If they’re not, it’s important to understand why, so that we can address any issues and improve space utilization.

Gaining additional benefits

We’ve talked a lot about space benefits for planning for space and cost reduction, but let’s also talk about other benefits of using the solution:

  • Data-driven decisions: Removing emotional guesswork from space planning with clear-cut data on actual space usage.
  • Holistic analysis: Combining WAP data with other sensor signals like lighting and air quality for comprehensive space planning.
  • Rapid deployment: Streamlined process for implementing the solution in new locations.

By gathering and using the WAP device data, we can optimize space utilization, but also gain insights about what our employees need from us to optimize their experience.

How other companies can benefit from our solution

We have an aspiration of rolling our solution into the upcoming product Microsoft Places and making it self-sustained and scalable. Places is a product that aims to provide a holistic view of the physical and digital spaces in an organization and how they’re used by the employees.

I believe the key advantages of our solution are, first, the enhanced security that comes with not having to add extra hardware or devices. Second, we’ve managed to reduce the number of devices installed across the buildings. And third, because of these improvements, we’ve achieved additional cost savings for Microsoft. That’s the significant impact this solution has delivered to Microsoft.

— Bharath Kumar, principal PM manager, Microsoft Digital

We’re currently using this solution in seven buildings and our goal is to continue implementing this solution in our other buildings.

“I believe the key advantages of our solution are, first, the enhanced security that comes with not having to add extra hardware or devices. Second, we’ve managed to reduce the number of devices installed across the buildings. And third, because of these improvements, we’ve achieved additional cost savings for Microsoft. That’s the significant impact this solution has delivered to Microsoft,” says Bharath Kumar, principal PM manager, Microsoft Digital.  

Other companies that have similar space management and employee engagement needs could benefit from Microsoft’s solution, because it uses existing Wi-Fi infrastructure, reduces the dependency on external sensors, protects the privacy of the employees, and provides a simple and intuitive way to access the data.

“Our aspiration, as we productize this solution, is to eliminate the dependency on anything but the actual product itself. One product we used is Azure Digital Twins, which gets the whole experience lighted by making sense of people count against the space and processing that information,” Reddy says.

Key Takeaways

Here are some tips on getting started at your company:

  • Consider implementing a similar solution to optimize space utilization and improve employee experience in your own buildings.
  • Use existing Wi-Fi infrastructure to reduce cost and dependency on external sensors and vendors.
  • Ensure that the solution protects employee privacy and complies with privacy regulations.
  • Stay informed about the latest developments and best practices in the field of space utilization and employee experience.
Try it out

Create your own Azure free account today on the Microsoft Azure product page.

Related links
We'd like to hear from you!

Want more information? Email us and include a link to this story and we’ll get back to you.

The post Enhancing space management internally at Microsoft with Wi-Fi data appeared first on Inside Track Blog.

]]>
15346