Microsoft Azure Archives - Inside Track Blog http://approjects.co.za/?big=insidetrack/blog/tag/microsoft-azure/ How Microsoft does IT Wed, 10 Jun 2026 23:57:01 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.4 137088546 Microsoft CISO advice: Securing AI with full stack red teaming http://approjects.co.za/?big=insidetrack/blog/microsoft-ciso-advice-securing-ai-with-full-stack-red-teaming/ Thu, 04 Jun 2026 15:30:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=23971 At Microsoft, we approach security for AI systems holistically using a full stack red teaming that goes beyond just testing an AI model. Corporate Vice President of red teaming at Microsoft Craig Nelson describes what he looks for with this method, “I’m interested in the model, but I’m also interested in how that model connects […]

The post Microsoft CISO advice: Securing AI with full stack red teaming appeared first on Inside Track Blog.

]]>
At Microsoft, we approach security for AI systems holistically using a full stack red teaming that goes beyond just testing an AI model.

Corporate Vice President of red teaming at Microsoft Craig Nelson describes what he looks for with this method, “I’m interested in the model, but I’m also interested in how that model connects with underlying additional data. And then how that model also executes automation from the back end.”

In this video, Nelson explains why securing AI requires more than testing the model alone.

Watch this video to see Craig Nelson describe how Microsoft approaches full stack red teaming. (For a transcript, please view the video on YouTube: https://www.youtube.com/watch?v=68MmP084rXA.)

Key takeaways

When you apply full stack red teaming to AI, here are some key questions to answer:

  • How are AI models connecting to data sources?
  • What backend automation do we allow AI to execute?
  • What security credentials do we require?
  • Do we have logs you need to understand how the model works with our backend infrastructure?

The post Microsoft CISO advice: Securing AI with full stack red teaming appeared first on Inside Track Blog.

]]>
23971
Transforming our approach to sensitivity labels at Microsoft with Microsoft Entra http://approjects.co.za/?big=insidetrack/blog/transforming-our-approach-to-sensitivity-labels-at-microsoft-with-microsoft-entra/ Thu, 28 May 2026 17:30:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=22681 Security groups serve as the backbone of our approach to access control across the Microsoft corporate tenant. These groups determine who has access to different resources across our network, including Azure subscriptions, Power BI reports, SharePoint sites, and more. For years, our security groups operated without consistent, policy‑based guardrails. As a result, we couldn’t uniformly […]

The post Transforming our approach to sensitivity labels at Microsoft with Microsoft Entra appeared first on Inside Track Blog.

]]>
Security groups serve as the backbone of our approach to access control across the Microsoft corporate tenant. These groups determine who has access to different resources across our network, including Azure subscriptions, Power BI reports, SharePoint sites, and more.

For years, our security groups operated without consistent, policy‑based guardrails. As a result, we couldn’t uniformly control guest access to sensitive resources or apply governance consistently across different group types.

Addressing this required a complex, coordinated effort by our team here in Microsoft Digital, the company’s IT organization, and the Microsoft Entra product team.

A photo of Johnson.

“Because IT security is our highest priority at Microsoft, we knew we needed a better approach to limiting access to groups within our tenant. And we realized that Microsoft Entra was a powerful in-house solution that represented our best path forward to solve for this challenge.”

David Johnson, principal product manager architect, Microsoft Digital

The result is a new approach to sensitivity labels across the organization that strengthens our security posture, which benefits Microsoft and our customers.

“Because IT security is our highest priority at Microsoft, we knew we needed a better approach to limiting access to groups within our tenant,” says David Johnson, a principal product manager architect in Microsoft Digital. “And we realized that Microsoft Entra was a powerful in-house solution that represented our best path forward to solve for this challenge.”

Closing the security gap

Sensitivity labels for Microsoft 365 groups are labels that govern join and access restrictions for membership and sharing. They have been a product feature since 2020. But sensitivity labels for security groups—labels that enforce rules about who can join a group—had no equivalent.

This meant that organizations that wanted to govern who could join a security group or determine if guests are permitted and how group membership is managed had to either lock down the group creation process entirely, or rely on reactive scanning after the fact.

“Security groups are a key piece of our efforts to secure sensitive resources,” says Mohit Bhargava, a principal product manager on the Microsoft Entra team, which manages the Entra family of identity and network access products. “We wanted to apply policies to protect who could be in security groups so that the sensitive resources in those groups would remain secure.”

A photo of Kakumani.

“Whoever gets into an Azure security group can have access to all the resources associated with the Azure subscription. That’s a potential high-severity threat.”

Basanth Kakumani, software engineer II, Microsoft Digital

The security risk is real. If an unauthorized guest account ends up as a member of a security group that governs access to an Azure subscription, that guest gains access to every resource inside that subscription.

“Whoever gets into an Azure security group can have access to all the resources associated with the Azure subscription,” says Basanth Kakumani, a software engineer II in Microsoft Digital. “That’s a potential high-severity threat.”

Another priority was the need for consistency across experiences.

“Microsoft 365 groups have supported labeling for a very long time,” Bhargava says. “Customers have an expectation that there’s parity across group types, so that they can govern them uniformly. That was another driving factor for this work.”

Security groups reuse the same sensitivity labels already configured for Microsoft 365 groups and SharePoint sites in Microsoft Purview—so admins don’t need to create or manage a separate set of labels. This reuse reduces configuration overhead and supports a more consistent governance model across group types.

Security workarounds, and why they fell short

Without sensitivity label support, we had to make do with alternative solutions. The most common one was simply preventing certain users from creating any security groups at all.

In the Microsoft tenant, this meant that employees who needed a security group had to fill out a form that had custom business logic behind it.

“We had on-premises, Active Directory, synchronization, tooling, and customization,” Johnson says. “This caused latency, from the time you created your group to the time it would show cloud membership. If you wanted to manage your membership, you had to do it on premises, AD, and then wait for it to sync to Entra.”

Neither centralized control nor reactive governance was a satisfying solution to prevent policy violations.

“This is really about making reactive things more proactive. We want to catch problems before they occur.”

John Begley, principal software engineer, Microsoft Digital

Typically, IT is going to manage this in one of two ways: Either we turn off self-service and manage everything on behalf of users, or we do reactive governance, which includes scanning groups and looking for policy violations.

Those aren’t super effective at preempting violations.

“This is really about making reactive things more proactive,” says John Begley, a principal software engineer in Microsoft Digital. “We want to catch problems before they occur.”

A collaborative solution

Coming up with a solution to this challenge required a genuine partnership.

We at Microsoft Digital approached the Entra product team and explained the problem we were trying to solve. Rather than simply handling this as a feature request, the two teams agreed to a co-development arrangement.

“Having access to a very large customer who cares deeply about security was extremely helpful. If it works for Microsoft, which is so complicated and huge, it’s going to work for smaller-sized tenants too.”

Mohit Bhargava, principal product manager, Microsoft Entra

Microsoft Digital team members would work alongside Entra engineers as the feature was built, serving simultaneously as implementation partner, design critic, and test environment—what we like to call our Customer Zero role.

Bhargava found the partnership equally illuminating from the product side.

“Having access to a very large customer who cares deeply about security was extremely helpful,” he says. “If it works for Microsoft, which is so complicated and huge, it’s going to work for smaller-sized tenants too.”

For Begley and his team, working closely with the product team revealed how complex the solution actually was.

“Both the product team and Microsoft Digital walked into this thinking a fix was going to be simpler than what it turned out to be,” Begley says. “It’s been eye-opening to see how the product is built, how it runs, what all the moving parts are. We learned early on that there was significant co‑development happening within Entra itself, across teams with very different areas of expertise.”

That dynamic played out in specific feature decisions. The team’s original plan did not include support for agent access controls and didn’t include the ability to prevent AI agents from joining sensitive security groups. This is something the product group quickly addressed and resolved after our team in Microsoft Digital raised it as a concern.

“One of the first customers who raised it was Microsoft Digital,” Bhargava says. “They said we needed need to start thinking about it ahead of time to get ahead of the problem.”

Sensitivity labels for Microsoft Entra cloud security groups are now in public preview. The same labels you publish in Microsoft Purview for Microsoft 365 groups and sites now apply to Entra security groups. Visit Microsoft Learn for scope, supported scenarios, and current preview behaviors.

Changes afoot for IT admins and employees

The practical impact of this solution lands on both sides of the relationship between Microsoft Digital and the company’s employees.

“Now I can’t accidentally have guests in an internal-only group, which changes the dynamic. Employees can create their own Entra security groups now, without us having to worry that they’ll be inviting guests where they shouldn’t be.”

David Johnson, principal product manager architect, Microsoft Digital

For IT admins, the shift is from reactive remediation to proactive prevention. For employees, it means self-service action with security groups become viable again, without the security risks that made organizations reluctant to enable it before.

“Now I can’t accidentally have guests in an internal-only group, which changes the dynamic,” Johnson says. “Employees can create their own Entra security groups now, without us having to worry that they’ll be inviting guests where they shouldn’t be.”

Johnson underscores the broader ambition behind the shift, which is to allow employees to create and manage groups directly in Entra.

“A company that can unblock self-service action by its employees with confidence, knowing that there’s an additional level of protection—that’s very important,” he says.

Looking ahead: AI and the expanding policy surface

Labeling support for security groups is already being extended across the organization, with AI governance in mind.

Adding the ability to block agents from joining sensitive security groups is our next logical step. Guest membership is enforced via allow-to-add guest policy, but agents won’t join in the same way. Rather, we will set policies in Purview and then use labels to control if an agent can join a group.

The longer-term vision involves extending oversharing prevention beyond Entra itself. This will make it impossible (not just detectable) to accidentally assign a highly confidential resource to an unlabeled or inappropriately scoped security group. The foundation we’ve built with labeling in Entra is what makes this vital step possible.

“We want to get into the preventative aspect,” Johnson says. “The goal is to make it so it’s not possible to overshare in the first place.”

Key takeaways

Here are some tips as you consider ways to address how you manage your own security labeling practices:  

  • Reuse existing labels—no extra setup required. Security groups reuse the same sensitivity labels already configured for Microsoft 365 Groups and SharePoint sites in Microsoft Purview, eliminating duplicate configuration and helping admins apply a consistent governance model across group types.
  • Understand label immutability at launch. Unlike Microsoft 365 Groups, sensitivity labels on security groups are initially immutable—a deliberate design choice to ensure protections are enforced from the moment a group is created. Controlled label mutability will be introduced in a subsequent update.
  • Know what’s in scope today. Labeling currently applies to static, non–mail-enabled security groups. Dynamic membership groups, mail-enabled security groups, and distribution lists aren’t supported at launch, so admins should plan accordingly.
  • Shift from reactive cleanup to proactive protection. Label-driven membership controls prevent policy violations—such as unintended guest access—before they occur, reducing the need for post-creation audits and remediation.
  • Enable safe self-service with guardrails. With labels enforcing access rules automatically, employees can create and manage security groups without increasing risk, restoring self-service without sacrificing control.
  • Lay the foundation for future governance scenarios. Using sensitivity labels as the backbone of access policy creates a scalable framework that can extend to additional protections over time, including broader enforcement and emerging governance needs.

The post Transforming our approach to sensitivity labels at Microsoft with Microsoft Entra appeared first on Inside Track Blog.

]]>
22681
Reinventing hybrid cloud integration at Microsoft—from months to one day http://approjects.co.za/?big=insidetrack/blog/reinventing-hybrid-cloud-integration-at-microsoft-from-months-to-one-day/ Thu, 28 May 2026 16:00:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=23843 For years, network engineering teams at Microsoft have faced a paradox: They can spin up a full Microsoft Azure cloud environment in a matter of hours but connecting that environment to on-premises labs and private networks can take up to nine months. Now, a team in Microsoft Digital—the company’s IT organization—is working to shrink that […]

The post Reinventing hybrid cloud integration at Microsoft—from months to one day appeared first on Inside Track Blog.

]]>
For years, network engineering teams at Microsoft have faced a paradox: They can spin up a full Microsoft Azure cloud environment in a matter of hours but connecting that environment to on-premises labs and private networks can take up to nine months.

Now, a team in Microsoft Digital—the company’s IT organization—is working to shrink that lengthy nine-month timeline to a single day.

The problem is architectural.

As our cloud footprint has grown, it has evolved into something richly segmented, tightly secured, and increasingly automated—a far cry from the relatively flat, monolithic corporate network that we originally extended into the cloud.

Getting those two worlds—on-premises and the cloud—to talk to each other securely and efficiently has become one of our most stubborn infrastructure challenges.

The solution we’re building is a fundamentally new operating model for hybrid cloud integration. It’s powered by AI-driven intake, end-to-end automation, and a set of repeatable patterns that treat the cloud as the new core of the network, rather than a distant branch of the old one.

The gap between cloud speed and network complexity

To understand the problem our team in Microsoft Digital set out to solve, it helps to understand how our company’s network architecture evolved over the past decade. When Microsoft first embraced Azure, the cloud was conceived as an extension of the existing corporate network.

A photo of McCleery.

“We have a development assembly line, and our goal is to give engineers the most efficient, frictionless experience doing software development for the company. Every day we delay solving this issue systemically is another day for the problem to get bigger.”

Tom McCleery, principal group cloud network engineering manager, Microsoft Digital

But the cloud grew faster than anyone anticipated.

Product engineering teams, drawn by the speed and flexibility of cloud-native tooling, began self-organizing their systems in Azure. They built segmented, purpose-built environments optimized for security and automation that looked nothing like the sprawling on-premises network they were supposed to connect to.

This shift had real consequences for Microsoft developers.

A software engineer sitting in building 32 on campus, for example, might have her Azure environment provisioned in half a day. But if she needed network connectivity to a physical Azure Stack lab down the hallway, getting that connection established—through firewalls, virtual routing frameworks, access control lists, and cross-team coordination—could take weeks or months.

“We have a development assembly line, and our goal is to give engineers the most efficient, frictionless experience doing software development for the company,” says Tom McCleery, principal group cloud network engineering manager in Microsoft Digital. “Every day we delay on solving this issue systemically is another day for the problem to get bigger.”

Why on-premises networks aren’t going away

Why not simply move everything to the cloud?

For Microsoft, the answer comes in many forms. As a company we build physical hardware, requiring hundreds of on-premises labs for software and hardware testing. We operate conference rooms, badge readers, thermostats, and wireless access points that will always require a physical network presence.

More fundamentally, Microsoft as a company hosts the cloud itself. If Azure were ever to go offline, our engineers responsible for recovery would need robust on-premises access that doesn’t rely on the very infrastructure they’re trying to restore.

Compounding all of these challenges are security requirements introduced by our Secure Future Initiative (SFI). The drive to reduce lateral threat movement across our network—limiting how far an attacker could reach if they compromised a single identity or device—has pushed our teams toward increasingly segmented environments. For our developers, that segmentation has meant navigating multiple networks, maintaining multiple identities, and juggling Yubikeys, smart cards, and authenticator apps just to move from one system to another.

The challenge, in short, is not that our network has too many pieces to be easily connected, it’s that those pieces weren’t designed to talk to each other efficiently.

This is what we had to fix.

Automation, patterns, and the path to ‘A Customer a Day’

Raghavendran Venkatraman is the principal engineering manager in Microsoft Digital who first pitched the vision of delivering a hybrid infrastructure in a single day.

A photo of Venkatraman.

“If we are not fast enough, our customers are going to outpace us and do it themselves—and they may not be adhering to all our enterprise security standards. The faster we deliver reliable infrastructure, the higher their confidence in us.”

Raghavendran Venkatraman, principal engineering manager, Microsoft Digital

The concept, which the team calls “A Customer a Day,” is built around the idea that it’s possible to deliver hybrid connectivity within 24 hours of finalizing requirements. Gathering, validating, and completing those requirements is where the team had to put their focus.

“If we are not fast enough, our customers are going to outpace us and do it themselves—and they may not be adhering to all our enterprise security standards,” Venkatraman says. “The faster we deliver reliable infrastructure, the higher their confidence in us.”

Three sequential domains of opportunity were identified, each a distinct bottleneck in the process. They all boasted impressive potential for improvement:

AI-driven unified intake

Customer describes requirements once. AI interprets and routes to the right pattern—no human coordination needed.

Replaces: Weeks of cross-team meetings before any build begins.

Predefined network patterns

A catalog of validated blueprints matches each request to a proven solution—no custom work from scratch.

Replaces: One-off negotiations restarted for every customer engagement.

End-to-end automation

A single workflow deploys from Azure all the way to the on-premises endpoint—no manual handoffs between teams.

Replaces: Days or weeks of manual steps after the cloud build is finished.

The result of these three innovations was the ability to make hybrid infrastructure live in one day, not months.

AI-driven unified intake. Today, when an engineering team needs hybrid connectivity, they become the conduit between multiple groups—networking teams, architecture teams, program managers, and security reviewers—that each have their own requirements, timelines, and vocabularies. The intake process alone can consume weeks of meetings before any actual implementation begins. The new model replaces that with an AI-powered interface that captures requirements directly from the customer, interprets them, and routes them to a predefined deployment pattern.

Predefined network patterns. Most hybrid workloads map to a small set of repeatable architectures. Rather than treating each onboarding as a custom engagement, the team has catalogued the most common hybrid connectivity scenarios and translated them into repeatable, validated patterns. The patterns drive both the AI intake and the automation layer, creating a system where the right solution can be identified and deployed without starting from scratch each time.

“The long pole in the tent used to be just getting the infrastructure up and running, but we are now able to do that pretty fast,” McCleery says. “Now, the challenge is sitting down with our customers, figuring out their requirements, and interpreting those into tasks that we can go implement in a matter of hours.”

End-to-end automation. On-premises, transport, and cloud network automation operate separately, but one-day delivery requires unified, pattern-aware orchestration. An AI orchestration agent manages sequencing, dependencies, and exceptions, enabling the hybrid stack to deploy as a single pipeline instead of in fragmented steps.

“The key architectural insight we reached is that any code touching device configuration should come from the service lines that own those devices. That’s a DevOps boundary—you own the customer experience, you specify the requirements, and then you call upon what we’ve built to interact with the back end. That’s a fundamentally different way of thinking about hybrid automation, and it’s what makes the end-to-end build possible.”

Juan Jimenez, principal cloud network engineer, Microsoft Digital

This is the work that Juan Jimenez, a principal cloud network engineer on the team, has been driving with multiple engineering cohorts.

“The key architectural insight we reached is that any code touching device configuration should come from the service lines that own those devices,” Jimenez says. “That’s a DevOps boundary—you own the customer experience, you specify the requirements, and then you call upon what we’ve built to interact with the backend. That’s a fundamentally different way of thinking about hybrid automation, and it’s what makes the end-to-end build possible.”

Building consensus across the network stack

Perhaps the hardest part of getting to “A Customer a Day” has been organizational. Bringing together cloud networking teams, on-premises network engineers, identity teams, security stakeholders, and program managers around a common framework requires a level of cross-disciplinary alignment that is extremely difficult.

What has helped is having a clear, human-scale goal that everyone can immediately understand and rally behind. When Venkatraman first named the initiative “A Customer a Day,” something shifted.

“You go over to the identity folks and say we’re trying to get a customer onboarded in a day—they’re like, ‘That would be great!’” McCleery says. “Same thing with on-premises networking. That message is easier to land than going in and saying, ‘Your engineers need to learn more about cloud.’ That’s when people start taking mental health days.”

One of the deeper mindset shifts the team has also been working to drive is a redefinition of what connectivity means. Historically, connectivity meant simply the network. In a cloud-first, AI-accelerated world, that definition is no longer sufficient.

“Connectivity means network and identity—together,” Venkatraman says. “That is the new definition, but it is not prevalent everywhere yet. Any CIO or CTO should pivot their entire organization to think about it that way. Don’t have two separate teams making decisions in silos and then trying to integrate. Get them in the room together from the start.”

Where we are today, and what comes next

Our Microsoft Digital team is candid about where we are in the journey: We’ve made meaningful progress, but we’re not yet at the finish line. The near-term goal is to complete the first customer launch scenarios within the next quarter, followed by broader adoption of the pattern framework in the quarter after that.

The goal isn’t 100% automation. The team is clear that a portion of hybrid networking will always require the custom work that complex or security-sensitive scenarios demand.

“We’re always going to have a longtail of scenarios that need human judgment,” McCleery says. “But for the 80% of common scenarios, if a customer is going down the compliant, paved path, things should happen a lot faster.”

For a team that’s spent years watching the gap between cloud and on-premises connectivity grow wider, the prospect of closing it—one customer, one day at a time—feels less like a moonshot and more like a welcome, needed correction.

Key takeaways

If your organization is wrestling with hybrid cloud integration, here are concrete steps you can act on today, informed by what we’ve learned on our journey:

  • Audit your hybrid integration timeline. If connecting a new cloud environment to on-premises networks takes more than a few weeks, map where the delays actually live—requirements gathering, cross-team handoffs, on-premises automation gaps, or other issue. You can’t fix what you haven’t measured.
  • Redefine connectivity to include identity. Bring your network and identity teams into the same room before any hybrid integration project begins. Treating these as separate workstreams is a primary source of rework, security gaps, and delay.
  • Identify your most common connectivity scenarios and document them as repeatable patterns. Even before you build automation, codifying your top five to ten hybrid connectivity use cases into standard blueprints gives every team a shared vocabulary and an accelerated starting point.
  • Set a single, human-scale goal your teams can align on. A unifying outcome (like “integrate a new environment in one day”) is more effective at driving cross-team alignment than a technical mandate. Find the shared aspiration before prescribing the solution.
  • Extend cloud tooling and automation frameworks to your on-premises teams. Don’t wait for on-premises engineers to independently upskill on cloud-native tooling. Invest in democratizing that capability deliberately, or the automation gap between your two environments will continue to widen.
  • Design intake around your systems, not your customers. Any hybrid integration process that requires an internal team to act as coordinator between multiple groups is a bottleneck by design. Use AI-assisted intake to make the requirements capturing self-service and the routing automatic.
  • Promote the framework before the tooling is finished. Publishing your architectural principles and patterns early (even when implementation is still in progress) aligns teams, accelerates buy-in, and gives other organizations a head start on their own journey.

The post Reinventing hybrid cloud integration at Microsoft—from months to one day appeared first on Inside Track Blog.

]]>
23843
Supercharging network operations at Microsoft with AI-based unified network intelligence http://approjects.co.za/?big=insidetrack/blog/supercharging-network-operations-at-microsoft-with-ai-based-unified-network-intelligence/ Thu, 21 May 2026 15:30:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=23737 At Microsoft, our network engineers work across multiple systems, including topology views, telemetry dashboards, logs, incidents, tickets, and fragmented tools. They piece together signals from these sources to understand what’s happening during an incident, often under considerable time pressure. But this kind of fragmentation slows down reasoning. Engineers spend more time navigating tools than diagnosing […]

The post Supercharging network operations at Microsoft with AI-based unified network intelligence appeared first on Inside Track Blog.

]]>
At Microsoft, our network engineers work across multiple systems, including topology views, telemetry dashboards, logs, incidents, tickets, and fragmented tools. They piece together signals from these sources to understand what’s happening during an incident, often under considerable time pressure.

But this kind of fragmentation slows down reasoning. Engineers spend more time navigating tools than diagnosing issues.

To address this, the Microsoft Infrastructure, Networking, and Tenant organization in Microsoft Digital, the company’s IT organization, is building Infrastructure Graph (IGraph), a unified platform that brings topology, real-time telemetry, and operational context into a single view.

On top of this foundation, agentic capabilities enable AI agents to reason across these signals, surfacing insights, explaining issues, and recommending next steps. This shifts the experience from exploring data to making decisions faster and with greater confidence.

A photo of Sinha.

“Engineers increasingly face fragmented visibility. We wanted to unify live telemetry, topology, and context into one single intelligent visualization experience and show engineers what’s really important, so they don’t have to dive into oceans of data.”

Astha Sinha, product manager, Infrastructure, Networking, and Tenant team, Microsoft Digital

This visualization layer and intelligence platform provides a view of our entire Microsoft enterprise network—including more than 20,000 on-premises devices across 900 sites worldwide—to instantly surface the most critical issues and offer proactive recommendations to our engineers.

“Engineers increasingly face fragmented visibility,” says Astha Sinha, a product manager in the Infrastructure, Networking, and Tenant team in Microsoft Digital. “We wanted to unify live telemetry, topology, and context into one single intelligent visualization experience and show engineers what’s really important, so they don’t have to dive into oceans of data.”

Network insight at speed

IGraph displays the following in a single pane-of-glass view for a given site:

  • Topology and dependency context: Visualizes routers, switches, access points, client devices, and their relationships, enriched with path and dependency awareness to localize impact areas
  • Real-time health and telemetry insights: Surfaces live performance signals (utilization, errors, abnormal behavior) correlates directly onto the topology to highlight where the network is degraded or “running hot”
  • Operational and incident context: Integrates incidents, tickets, and change signals into the graph, enabling engineers to understand what is happening and where and what systems are affected in a single view
A photo of Kumar Singh.

“Fragmentation across operational data sources was only part of the problem. The harder challenge was externalizing and structuring the implicit domain knowledge engineers rely on, then integrating it with real-time telemetry and topology to enable low-latency, context-aware reasoning in the agentic layer.”

Vinod Kumar Singh, principal software engineer, Infrastructure, Networking, and Tenant team, Microsoft Digital

On top of this visualization layer, the team is building an agentic layer using Azure Foundry that allows AI agents to discover and use external tools and data sources.

Without IGraph agent, accessing data involves pulling from multiple existing sources, including servers and logs, with mixed latency (from minutes to hours). This fragmentation makes near-real-time reasoning almost impossible, as agents lack a unified, low-latency view of topology and telemetry.

“Fragmentation across operational data sources was only part of the problem,” says Vinod Kumar Singh, a principal software engineer in the Infrastructure, Networking, and Tenant team in Microsoft Digital. “The harder challenge was externalizing and structuring the implicit domain knowledge engineers rely on, the integrating it with real-time telemetry and topology to enable low latency, context-aware reasoning in the agentic layer.”

How IGraph works

The user starts in context. Say they’re on the IGraph UI for Building 32. They can already see the building topology, recent incidents, support tickets, and live health and performance metrics.

The engineer can ask a natural language question such as, “The internet is not working in Building 32—what’s going on?”

The AI agent begins reasoning across UI context (location, devices, open incidents), topology (involved devices and neighbors), historical metrics, and real-time device calls. It works with specialized MCP servers and agents to identify impacted devices, test live responsiveness, measure neighboring impact, verify data flow, and flag abnormal utilization or error trends.

A photo of Vijay.

“Engineers spend a lot of time firefighting. The visualization layer gives them the view they need to quickly solve the incidents. It helps free up their time to engage in more systemic improvements on their applications.”

Abhijit Vijay, principal software engineer manager, Infrastructure, Networking, and Tenant team, Microsoft Digital

Using this context, IGraph pulls in the relevant logs, real-time telemetry, and incident history to complete the analysis.

Instead of raw metrics and hundreds of rows of data, the agent returns a clean summary that provides a view of the failing device, the health of neighboring devices, and the blast radius. It shows what’s broken, what’s still healthy, the likely causes, and next actions.

The engineer stays in one UI for all this, and isn’t forced to use different tools or manually correlate data.

“Engineers spend a lot of time firefighting,” says Abhijit Vijay, a principal software engineer manager on the team in Microsoft Digital. “The visualization layer gives them the view they need to quickly solve the incidents. It helps free up their time to engage in more systemic improvements on their applications.”

The impact of incident visibility

IGraph offers a new real-time telemetry layer that:

  • Uses a UI that surfaces telemetry and topology by correlating data from upstream systems
  • Decreases effective latency for users, enabling near-real-time insights (often within seconds)
  • Provides near-real-time signals in the UI on health, performance, routing state, and neighboring device relationships
A photo of Mallick.

“Our goal is to accelerate how network engineers understand what’s happening, enabling them to shift from reactive troubleshooting to proactive prevention—identifying and mitigating issues before they occur.”

Nevedita Mallick, principal product manager, Infrastructure, Networking, and Tenant team, Microsoft Digital

Combined, these capabilities give network engineers an up-to-the moment view of what’s happening across the network, before small issues can cascade into larger incidents.

By making live telemetry easier to access and interpret, IGraph helps teams move from reactive troubleshooting to proactive prevention.

“Our goal is to accelerate how network engineers understand what’s happening, enabling them to shift from reactive troubleshooting to proactive prevention—identifying and mitigating issues before they occur,” says Nevedita Mallick, a principal product manager for the Infrastructure, Networking, and Tenant team in Microsoft Digital.

That speed and clarity are especially important for new engineers.

A photo of Keskar.

“The tool delivers value right away, especially for newer engineers. Instead of having to piece things together, they get an instant view of the network that shows how devices are connected and displays the already-surfaced incidents directly on the graph.”

Manjiri Keskar, principal cloud network engineer, Infrastructure, Networking, and Tenant team, Microsoft Digital

Complex networks rely on unwritten knowledge and experience built up over time, which can slow onboarding and make troubleshooting harder than it needs to be. IGraph shortens that learning curve by making the network’s relationships and current state immediately visible.

“The tool delivers value right away, especially for newer engineers,” says Manjiri Keskar, a principal cloud network engineer in the Infrastructure, Networking, and Tenant team in Microsoft Digital. “Instead of having to piece things together, they get an instant view of the network that shows how devices are connected and displays the already-surfaced incidents directly on the graph.”

What’s next for IGraph Agent

Without IGraph Agent, network analysis is largely reactive.

Teams often address failures after customers have already felt the impact, instead of preventing issues by acting when early warning signs appear.

A photo of Munde.

“Agentic AI is transforming networking DevOps from manual, reactive operations into intelligent intent-driven systems that can provision, validate, and troubleshoot networks autonomously. Looking ahead, it will power self-healing networks and dramatically accelerate buildouts, allowing engineers to focus on architecture, strategy, and innovation.”

Sonika Munde, senior network engineer, Infrastructure, Networking, and Tenant team, Microsoft Digital

Teams often address failures after customers have already felt the impact, instead of preventing issues by acting when early warning signs appear.

“Agentic AI is transforming networking DevOps from manual, reactive operations into intelligent, intent-driven systems that can provision, validate, and troubleshoot networks autonomously,” says Sonika Munde, a senior network engineer in the Infrastructure, Networking, and Tenant team in Microsoft Digital. “Looking ahead, it will power self-healing networks and dramatically accelerate buildouts, allowing engineers to focus on architecture, strategy, and innovation.”

That unified network intelligence will let IGraph Agent communicate with multiple lightweight agents that continuously analyze network conditions, dramatically compressing response times.

“What used to happen in hours will happen in minutes,” Munde says.

Now, the team is pushing further. One example is layering in weather intelligence to help engineers anticipate issues before they materialize, as big storms can trigger power fluctuations that ripple through the network. By visualizing this data, engineers can proactively communicate with customers and take mitigation steps that protect operational workloads.

Overall, IGraph lets teams focus on prevention. Engineers spend less time navigating dashboards and cross-checking data and more time detecting patterns and surfacing emerging risks. Manual analysis is reduced as the agent highlights insights in real time.

A photo of Thompson.

“By bringing telemetry, topology, and AI together in one intelligent layer, we’re turning fragmented signals into real-time intelligence so teams can move faster, act earlier, and protect the critical workloads that power Microsoft.”

Jason Thompson, principal group product manager, Infrastructure, Networking, and Tenant team, Microsoft Digital

The technology is poised to go even further. IGraph will eventually help power self-healing networks and speed up network build-outs, freeing engineers to focus on architecture and innovation. The future vision for the tool includes fully automated predictive network intelligence across all Microsoft campuses, with agents that monitor, reason, recommend responses, and safely take action.

“By bringing telemetry, topology, and AI together in one intelligent layer, we’re turning fragmented signals into real-time intelligence so teams can move faster, act earlier, and protect the critical workloads that power Microsoft,” says Jason Thompson, a principal group product manager for the Infrastructure, Networking, and Tenant team in Microsoft Digital.

Key takeaways

To move from reactive operations to proactive AI-supported network management, we recommend starting with these steps:

  • Start consolidating real-time telemetry into a single view. Even a lightweight dashboard is enough to prepare for AI-driven insights later.
  • Identify high-frequency incident types to target for AI triage. Pick the most common or disruptive scenarios and map out what data engineers currently review for them.
  • Document the decision logic your engineers use today. Before implementing AI, capture the human reasoning steps to help guide your approach.
  • Pilot an agentic solution with one network segment or site. Start with one building, one lab, or a small testbed.

The post Supercharging network operations at Microsoft with AI-based unified network intelligence appeared first on Inside Track Blog.

]]>
23737
Microsoft CISO advice: Apply engineering fundamentals to securing AI http://approjects.co.za/?big=insidetrack/blog/microsoft-ciso-advice-apply-engineering-fundamentals-to-securing-ai/ Thu, 30 Apr 2026 16:00:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=23334 Agentic AI, like any software, is just one part of a business solution. It is not the only element that needs to be secured. Engineers need to approach securing agentic AI in the corporate IT ecosystem the same way they would consider any security problem—from end to end. Yonatan Zunger, CVP and deputy CISO for […]

The post Microsoft CISO advice: Apply engineering fundamentals to securing AI appeared first on Inside Track Blog.

]]>
Agentic AI, like any software, is just one part of a business solution. It is not the only element that needs to be secured. Engineers need to approach securing agentic AI in the corporate IT ecosystem the same way they would consider any security problem—from end to end.

Yonatan Zunger, CVP and deputy CISO for Microsoft, suggests focusing exclusively on hardening a piece of software to security threats may make it difficult to use and introduce a new risk when users get frustrated and try to bypass controls. This is why engineers need to consider not just individual components but how they work together to maintain productivity.

“Think of every system as a socio-technical system containing many parts, and all of them working together in unison have to be secured,” Zunger says.

Watch this video to see Yonatan Zunger explain why engineering fundamentals are critical to building resilient AI systems. (For a transcript, please view the video on YouTube: https://www.youtube.com/watch?v=YU-8lpwPtm0 )

The post Microsoft CISO advice: Apply engineering fundamentals to securing AI appeared first on Inside Track Blog.

]]>
23334
Reclaiming engineering time with AI in Azure DevOps at Microsoft http://approjects.co.za/?big=insidetrack/blog/reclaiming-engineering-time-with-ai-in-azure-devops-at-microsoft/ Thu, 16 Apr 2026 16:00:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=23161 At Microsoft Digital, the company’s IT organization, we’re reimagining how engineers, product managers, and program managers work. Microsoft Azure DevOps (ADO) is our company’s end-to-end software development lifecycle (SDLC) solution for planning, coding, testing, and delivery. It combines tools for work tracking, source control, pipelines, and artifacts so teams can manage the entire SDLC in […]

The post Reclaiming engineering time with AI in Azure DevOps at Microsoft appeared first on Inside Track Blog.

]]>
At Microsoft Digital, the company’s IT organization, we’re reimagining how engineers, product managers, and program managers work.

Microsoft Azure DevOps (ADO) is our company’s end-to-end software development lifecycle (SDLC) solution for planning, coding, testing, and delivery. It combines tools for work tracking, source control, pipelines, and artifacts so teams can manage the entire SDLC in one environment.

Although ADO excels at streamlining the development process, we found that users were still spending significant time performing repetitive administrative tasks, like creating and breaking down work items, writing and managing queries for reporting, and reclaiming lost permissions.

Our Engineering Systems Platform team successfully embedded AI into ADO, resulting in ADO experiences that replace manual workflows and free up our IT professionals to concentrate on work that makes a real impact.

Identifying the opportunity

The Engineering Systems Platform team supports 15,000 active users across one of the largest ADO platforms at Microsoft.

A photo of Panigrahy.

“We saw the toll these processes took on users, whether they were compiling information or performing manual tasks. Even with automation, there was still an opportunity to give time back to engineers.”

Gopal Panigrahy, principal product manager, Microsoft Digital

Three years ago, the team began exploring opportunities to automate repetitive ADO tasks like creating and updating work items, navigating project data, gathering statuses, and breaking large initiatives into sprint-ready work.

While they found ways to automate some of these tasks, they discovered decision-making and information synthesis still consumed valuable time and occasionally introduced some human errors.

“We saw the toll these processes took on users, whether they were compiling information or performing manual tasks,” says Gopal Panigrahy, a principal product manager in Microsoft Digital. “Even with automation, there was still an opportunity to give time back to engineers.”

Adding AI to ADO workflows

ADO spans a vast area at Microsoft, serving a wide range of enterprise use cases and personas. What these workers have in common is heavy workloads. With this in mind, different categories of ADO users expressed the desire for AI-powered experiences that could help streamline workflows and speed up day-to-day development tasks.

As generative AI matured, our team explored whether they could embed AI technology inside ADO to act as a real-time assistant, handling administrative work and answering contextual questions using natural language.

A photo of Sahoo.

“We saw it as a win-win experiment. If we could give engineers back in ADO, they could spend it building, not managing artifacts.”

Debashis Sahoo, principal group engineering manager, Microsoft Digital

The guiding principles of the experiment were simple: Stay in context and preserve user control while aligning with existing ADO permissions and processes.

That vision led to the creation of two complementary Microsoft Copilot agents: The DevOps Assistant and the AI Work Item Assistant.

“We saw it as a win-win experiment,” says Debashis Sahoo, a principal group engineering manager in Microsoft Digital. “If we could give engineers time back in ADO, they could spend it building, not managing artifacts.”

What makes this initiative distinctive is it brings AI closer to the core ADO product and its users. It allows for secure, confidential, and context-rich ADO data to be used safely for meaningful AI-powered experiences.

DevOps Assistant offers conversational, in-context support

DevOps Assistant is a chat‑based experience present in the ADO user interface (UI). It’s activated in a side panel where users can ask natural language questions to retrieve information, check project statuses, and run common DevOps actions without navigating away from their main ADO display.

The DevOps Assistant enables cross-source discovery, which reduces context switching and discovery time and helps lower the cognitive load for engineers and product managers. By reducing the time it takes to switch contexts and search for information, the DevOps Assistant helps ADO users move faster and stay focused on product delivery.

Under the hood, the DevOps Assistant is a constellation of specialized agents, each of which is focused on a different segment of the DevOps lifecycle:

  • Work Item Agent creates, refines, and scopes work into sprint-ready backlogs
  • Knowledge Board Agent surfaces the right DevOps knowledge at the right moment
  • Permission Agent handles access and permission requests
  • Bulk Complete Agent runs repetitive, large-scale updates
  • Sprint Board Agent summarizes sprint status and provides instant, prompt‑driven insights
A photo of Gupta.

“We didn’t just build a chatbot. We built a distributed system of agents that understands the intent of the DevOps user and acts on it securely and in context.”

Apoorv Gupta, principal software engineer, Microsoft Digital

Agents are built in Copilot Studio and coordinated by Orchestrator Agent, Copilot Studio’s front door.

For example, if a user asks to create or refine work items, the Orchestrator Agent routes the request to the Work Item Agent to handle. If the question is about permissions, then it delegates the work to the Permission Agent. It does this for each task.

“We didn’t just build a chatbot,” says Apoorv Gupta, a principal software engineer in Microsoft Digital. “We built a distributed system of agents that understands the intent of DevOps user and acts on it securely and in context.”

At present, the DevOps Assistant is available across all our internal ADO environments at Microsoft. The plan is to make it available to external customers soon.

AI Work Item Assistant provides inline assistance

The AI Work Item Assistant is a real-time embedded experience within ADO work items. Powered by Microsoft Foundry, it helps users create and refine work items using context and business requirements.

The assistant works immersively, keeping users focused and within ADO as they structure work items or generate child items from the parent.

For product and program managers who start with high‑level ideas, the assistant understands intent. It can automatically suggest logical, sprint‑ready breakdowns, helping to dramatically reduce the time spent on planning, sorting, and prioritizing work items.

Screenshot showing the “Use AI to edit this item” button in the Azure DevOps UI.
The AI Work Item Assistant is just a click away in Azure DevOps work items.

Turning newfound time into innovation

The key to reclaiming time for your workforce isn’t just the introduction of new AI-driven features. It’s using the technology to enforce structure and quality at the beginning, so that everything downstream moves faster.

Panigrahy describes the practice as three reinforcing feedback loops.

The first loop is upstream quality amplification. AI agents help consistently structure work items with clear acceptance criteria and templates. The structure then feeds other tools (such as GitHub Copilot), allowing them to generate higher-quality code and more predictable outcomes—shortening the overall software development lifecycle.

The second feedback loop is acceleration of execution. In a typical sprint planning session, a team of eight engineers might:

  • Take an hour (or more) to manually break user stories into more than 100 tasks
  • Create different tasks in their own style, introducing inconsistency and ambiguity
  • Generate uneven details, then spend time clarifying data later

With DevOps Assistant and AI Work Item Assistant, that same task breakdown turns into a prompt-driven action that no longer requires hours of work.

“It burns a lot of time for everyone to manually create each item in their own way, making sure they’re using the correct inputs from the product manager and confirming they aren’t missing anything,” Panigrahy says. “Now, with AI magic, it takes less than three minutes.”

The third feedback loop is capacity reinvestment. Instead of spending hours on tactical DevOps mechanics, teams can now spend more time on engineering judgment, resulting in better estimation, technical decisions, and design. They can use these reclaimed hours to learn new tools, experiment with new agents, and innovate on the SDLC.

“Capacity saving keeps giving back, in a loop,” Gupta says. “You get more capacity back. You innovate. You learn. You do better.”

What’s next on the AI-in-ADO journey

The DevOps Assistant and the AI Work Item Assistant can help change user behavior, shifting from time spent doing tactical DevOps tasks to performing higher‑value, judgment-based work. These tools can help teams increase work quality and reduce wasted time.

“Our next chapter is about making AI smarter, more action-oriented, and truly agentic,” Sahoo says. “The goal is to reduce cognitive load and allow the experience to live wherever users are—from Azure DevOps to Microsoft Teams and Microsoft 365—so the agent works seamlessly across their workflow.”

AI-driven productivity gains are arguably the biggest opportunity in the industry. It’s fundamentally redefining the engineering experience at an unprecedented pace.

“While we’ve made huge strides embedding AI into the everyday Azure DevOps experience, it still feels like we’re just getting started,” Sahoo says. “Staying relevant means continuously evolving to deliver ever-greater value and efficiency to engineers.”

Key takeaways

Keep these tips in mind as you get started on your own journey with AI and Microsoft ADO:

  • Treat AI as a strategic accelerator, not as an add-on. Identify where your engineering process can use AI to move from simple assistance to transforming your workflows.
  • Target high-effort, high-volume tasks first. Analyze where your teams are spending significant manual time, even if AI tools are already in place in those workflows.
  • Validate productivity with measurable data, not intuition. Track time reclaimed, workflow efficiency, reduction in manual steps, and user satisfaction. Tangible data can help your initiative earn trust and justify the expansion of AI tool use on your team.

The post Reclaiming engineering time with AI in Azure DevOps at Microsoft appeared first on Inside Track Blog.

]]>
23161
Olutunde Makinde: From Lagos to Redmond, a Microsoft IT engineer’s journey http://approjects.co.za/?big=insidetrack/blog/olutunde-makinde-from-lagos-to-redmond-a-microsoft-it-engineers-journey/ Thu, 02 Apr 2026 16:05:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=22855 A career in Microsoft Digital, the company’s internal IT organization, puts employees at the center of one of the world’s most complex and forward‑leaning enterprise environments. This is the team that runs Microsoft on Microsoft technology and services—maintaining more than a million computing devices, enabling global collaboration, and shaping the employee experience for more than […]

The post Olutunde Makinde: From Lagos to Redmond, a Microsoft IT engineer’s journey appeared first on Inside Track Blog.

]]>
A career in Microsoft Digital, the company’s internal IT organization, puts employees at the center of one of the world’s most complex and forward‑leaning enterprise environments. This is the team that runs Microsoft on Microsoft technology and services—maintaining more than a million computing devices, enabling global collaboration, and shaping the employee experience for more than 200,000 people.

To accomplish these huge tasks, it’s essential to cultivate a range of perspectives, expertise, and lived experiences.

Olutunde Makinde is an example of this.

A photo of Makinde.

“A friend once laughed at me back in college when I said I wanted to work at Microsoft, like it was impossible. But I knew I could achieve the impossible if I could just be focused. I never gave up.”

Olutunde Makinde, senior service engineer, Microsoft Digital

Makinde, a senior service engineer in Microsoft Digital, came to the company the long way around—roughly 7,000 miles away from the Redmond, Washington, headquarters, in fact. He’s originally from Lagos, Nigeria.

As a global organization, Microsoft builds teams where people with different experiences and life journeys actively influence how products, services, and internal platforms are designed. Makinde, commonly known around the office as “Tunde” (“rhymes with Sunday,” he notes), embodies that diverse approach, bringing his unique insights and experiences to critical work at the company.

“A friend once laughed at me back in college when I said I wanted to work at Microsoft, like it was impossible,” Makinde says. “But I knew I could achieve the impossible if I could just be focused. I never gave up.”

Launching an IT career in Nigeria

Makinde’s journey to Microsoft began with earning a degree in computer engineering in Lagos, after which he found work as a network engineer. He spent the next several years developing his skills through certifications and other learning opportunities.

“I did a lot of self-paced training, learning how to configure Cisco routers. Eventually I became a Cisco-certified network professional (CCNP),” Makinde says. “Around that time, I had a friend who was preparing for Windows Server 2008 certifications, and through his study materials I started learning more about Microsoft and its products.”

Makinde’s first direct encounter with Microsoft came in 2014, when the company he worked for received a contract to deploy the first Microsoft Azure cloud installation in Nigeria.  

“I spent the last day of 2014 and the first day of 2015 at the customer site, figuring out how to connect their on-premises network to Azure,” Makinde says. “It had never been done before in Nigeria, and taking up that challenge really propelled me into the world of Microsoft-specific technology.”

From there, Makinde set his sights on a career at Microsoft. He parlayed his initial exposure to cloud architecture into a focus on Azure, as well as Amazon Web Services. After spending some time in the United Kingdom, he achieved his goal when he was hired by the Microsoft Digital team in 2022. He moved to the United States in 2025.

He credits support from his family, especially his wife, with helping him achieve his dreams.

“My wife was a pillar of support through every career transition, from Nigeria to the UK to the United States,” Makinde says. “She believed in me when I faced rejections, celebrated with me when I finally got the offer, and now keeps me grounded whenever work gets intense. I couldn’t have made this journey without her.”

Making an impact from day one

Kathren Korsky, a principal technical program manager in Microsoft Digital and Makinde’s hiring manager, remembers the impression he made right away. It was clear that Makinde’s experience and technical background were major assets.

“What caught my attention was how well-prepared he was for the conversation and how well he communicated,” Korsky says. “The stories he shared about his work with Azure deployment in Nigeria really drew my interest. But I was also intrigued by how he was able to bridge technology with the business world, working with different banks across the continent to gather requirements, understand them, and build solutions.”

Upon being hired at Microsoft, he initially worked remotely from the UK on a Redmond-based device and application management team. The team was looking to deploy Cloud PC internally and needed a system in which employees could request access and get approvals to use Cloud PCs.

“He was able to stand up a full Power Automate workflow within a short period, and with a very high degree of quality,” Korsky says. “Rarely did anyone find any defects or bugs in his system.”

Makinde’s designs drove value moving forward as well, as the team made updates to his initial workflows.

A photo of Korsky

“His design was so strong that we were basically able to follow exactly what he had created in Power Platform and build that exact same design in ServiceNow. It really expedited that whole process.”

Kathren Korsky, principal technical program manager, Microsoft Digital

ServiceNow was more commonly used for systems that involved access requests and approvals, but when a platform update from Power Automate was initiated the team found Makinde’s original design was durable enough to weather the shift.

“His design was so strong that we were basically able to follow exactly what he had created in Power Platform and build that exact same design in ServiceNow,” Korsky says. “It really expedited that whole process.”

Driving efficiency and managing change

Since moving to the United States to work at company headquarters, Makinde has continued to push important projects forward—working with different stakeholders to deploy policy changes across Microsoft, managing the Change Advisory Board (CAB) intake process, and driving configuration updates for security and first-party product deployments.

“There’s a lot of diligence required to see the edge cases happening, to pay attention to them, and to watch out for potential problems. Tunde stops rollouts regularly to flag potential defects or risks, which prevents issues from interrupting our work and reducing productivity.”

Jeff Duncan, principal service engineering manager, Microsoft Digital

Makinde learned how to assess change requests and understand risk profiles, as well as enforce best practices for managing change within the security environment. Within about a year, he was able to take the lead in the space and own the deployment process.

A single misconfigured policy can cause major disruption. Makinde’s role puts him in position to be the checkpoint that prevents incidents before they happen.

“There’s a lot of diligence required to see the edge cases happening, to pay attention to them, and to watch out for potential problems,” says Jeff Duncan, principal service engineering manager in Microsoft Digital and Makinde’s manager. “Tunde stops rollouts regularly to flag potential defects or risks, which prevents issues from interrupting our work and reducing productivity.”

Softer skills like transparency, collaboration, and clear communication across levels and teams are key aspects of Makinde’s work as well.

“Tunde is thoughtful and detail-oriented, and he’s very good at explaining the decision-making process when he provides overviews for leadership,” Duncan says. “There’s rational, logical reasoning behind the decisions he makes.”

Makinde has implemented new efficiencies in how he manages the CAB and deployment service using AI. This includes CABBIE—an AI-powered agent that automates CAB communications. For Intune deployments, he uses AI to streamline deployment coordination and package reviews. These innovations reflect our Customer Zero approach to AI adoption here in Microsoft Digital.

“We run weekly CAB meetings to review change requests. That comes with a lot of communication work — status updates, follow-ups, coordination with stakeholders. It was all manual,” Makinde says. “CABBIE pulls the data from Azure DevOps, generates the emails, updates requests, and logs approvals automatically. It saves time and reduces errors.”

Success at Microsoft Digital: Aptitude and curiosity

As the organization at the center of the company’s own digital transformation, we in Microsoft Digital function as a living showcase of what’s possible with Microsoft technology. Our team tests new capabilities at enterprise scale as Customer Zero for Microsoft, identifying gaps and providing insights to ensure our customers benefit from what we’ve learned.

Because the impact of Microsoft Digital extends far beyond internal systems, team members have to set the standard for digital excellence. They must demonstrate what enterprise transformation looks like in practice and empower customers with the confidence to pursue their own modernization journeys.

 Hiring talented people like Makinde is essential to this mission.

“There are three core traits I look for when hiring—aptitude, attitude, and curiosity,” Korsky says. “Aptitude is not only what you currently know, but your propensity and desire to learn and grow those skills. Attitude goes hand in hand with that—are you willing to demonstrate grit and perseverance? And then curiosity, because so much of what we do from an innovation perspective requires a willingness to challenge assumptions and think of completely new ways of doing things.”

Makinde’s journey here at Microsoft Digital embodies and illustrates the company’s larger story: how technical expertise, innovative thinking, and a commitment to continuous learning combine to deliver world-class results.

“I’m now up to 25 certifications, and I continue to learn how to do more at Microsoft to positively impact the organization and protect our employees’ experience across applications and devices.”

Olutunde Makinde, senior service engineer, Microsoft Digital

That attitude of persistent curiosity and the willingness to keep learning continue to fuel Makinde’s experience at Microsoft. 

“Self-improvement is a way of life for me that has driven my career forward,” Makinde says. “At an early stage in my career, I did a lot of self-training—from learning how to configure Cisco routers and switches, to migrating on-premises workloads to Azure and managing cloud resources. I’m now up to 25 certifications, and I continue to learn how to do more at Microsoft to positively impact the organization and protect our employees’ experience across applications and devices.”

Key takeaways

Olutunde Makinde’s career experience here in Microsoft Digital offers some important insights that you can apply to your own organizational development:

  • AI adoption starts with practical problems. Makinde’s use of AI to streamline CAB communications and deployment coordination shows how Customer Zero teams find real-world applications for emerging technology.
  • Different experiences and perspectives contribute to business success. Achieving ambitious goals as an organization is dependent upon attracting talented people like Makinde from a range of backgrounds, disciplines, and lived experiences.
  • Strong technical skills paired with innovative thinking drives value. Makinde’s contributions to flexible cloud deployment workflows are an example of how this combination pays dividends.
  • Proactive risk management and attention to detail can prevent large-scale disruptions. By being willing to stop rollouts and flag risks before they become problems, Makinde’s approach to his work exemplifies how thoughtful decision-making safeguards productivity and security.
  • Persistence, curiosity, and continuous learning are critical career accelerators. Having a long and successful career at a company like Microsoft goes beyond just technical aptitude; it also requires perseverance and a passion for learning. Makinde’s self-driven training efforts and his refusal to give up have enabled him to achieve what once seemed impossible.

The post Olutunde Makinde: From Lagos to Redmond, a Microsoft IT engineer’s journey appeared first on Inside Track Blog.

]]>
22855
Protecting anonymity at scale: How we built cloud-first hidden membership groups at Microsoft http://approjects.co.za/?big=insidetrack/blog/protecting-anonymity-at-scale-how-we-built-cloud-first-hidden-membership-groups-at-microsoft/ Thu, 26 Feb 2026 17:00:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=22465 Some Microsoft employee groups can’t afford to be visible. For years, we supported email‑based communities internally here at Microsoft whose very existence depends on anonymity. These include employee resource groups, confidential project teams, and other sensitive audiences where simply revealing who belongs can create real‑world risk. Traditional distribution groups make membership discoverable by default. Owners […]

The post Protecting anonymity at scale: How we built cloud-first hidden membership groups at Microsoft appeared first on Inside Track Blog.

]]>
Some Microsoft employee groups can’t afford to be visible.

For years, we supported email‑based communities internally here at Microsoft whose very existence depends on anonymity. These include employee resource groups, confidential project teams, and other sensitive audiences where simply revealing who belongs can create real‑world risk.

Traditional distribution groups make membership discoverable by default. Owners can see members. Admins can see members. In some cases, other users can infer membership through directory queries or tooling.

That model doesn’t work when anonymity is a requirement.

A photo of Reifers.

“When the SFI wave hit, it was made clear to us that we needed to keep our people safe, and to do that, we needed to build a new hidden memberships group MVP. We needed to raise the bar with modern groups, and we needed to do it in six months or miss meeting our goals.”

Brett Reifers, senior product manager, Microsoft Digital

For over 15 years, we relied on a custom, on‑premises solution that enabled employees to send and receive messages through groups with fully hidden memberships.

The system worked, but we were deprecating the Microsoft Exchange servers that it ran on. At the same time, we were also deploying our Secure Future Initiative (SFI), which required us to reassess legacy systems that could expose sensitive data or slow incident response, including hidden membership groups.

The system wasn’t broken, but it represented concentrated risk simply by existing outside our modern cloud controls and monitoring.

“When the SFI wave hit, it was made clear to us that we needed to keep our people safe, and to do that, we needed to build a new hidden memberships group MVP,” says Brett Reifers, a product manager in Microsoft Digital, the company’s IT organization. “We needed to raise the bar with modern groups, and we needed to do it in six months or miss meeting our goals.”

The mandate was clear. Preserve anonymity, eliminate on‑premises dependencies, and do it quickly.

A photo of Carson.

“Our solution would enable us to deprecate our legacy on-premises Exchange hardware while maintaining the privacy of our employee groups, and it would do so in a cloud-first manner.”

Nate Carson, principal service engineer, Microsoft Digital

Instead of retrofitting hidden membership into standard Microsoft 365 groups, we asked a different question: What if the group lived somewhere else entirely? What if users interacted with a simple, secure front end, while all membership expansion and mail flow occurred in a locked‑down tenant built specifically for this purpose?

That idea became the foundation for Hidden Membership Groups: A new cloud‑first architecture that would separate user experience, leverage first‑party Microsoft services, and keep our group memberships hidden from everyone—including owners and administrators—by design.

“Our solution would enable us to deprecate our legacy on-premises Exchange hardware while maintaining the privacy of our employee groups, and it would do so in a cloud-first manner,” says Nate Carson, a principal service engineer in Microsoft Digital.

Once we settled on a solution, our next step was to get support for solving a problem not many people thought much about.

“Not everyone was aware of how serious of a situation we were in,” Carson says. “We had to show everyone what was at stake, and to share our solution with them.”

After taking their plan on the road, the team got the buy in it needed, and that’s when the real work started.  

Planning to solve business problems with security built-in

Before we designed anything, we had to be clear about what success meant.

Hidden Membership Groups aren’t just another collaboration feature. They support scenarios where anonymity wasn’t optional—it’s foundational. That reality shaped every requirement that we built into our solution, including:

1. Absolute privacy

Group membership couldn’t be immediately visible to users, group owners, or administrators–under any circumstances. That requirement immediately ruled out standard group models.

2. Cloud only

Any new solution had to live entirely in our cloud, use first‑party services, and align with modern identity, security, and compliance practices. On‑premises infrastructure wasn’t an option.

3. Scale

Some groups had a handful of members. Others had tens of thousands. Membership changed frequently, and those changes had to propagate safely and predictably without exposing data or degrading performance.

4. Separation of concerns

User interaction and membership truth couldn’t live in the same place. Employees needed a simple way to discover groups, request access, and manage participation, without ever interacting with the system that stored or expanded membership.

5. Self‑service with guardrails

The solution needed to reduce operational overhead, not introduce a new bottleneck. Group lifecycle management had to be automated, auditable, and secure, while still giving teams flexibility.

6. Simple to use

Employees shouldn’t need special training. They shouldn’t need to understand tenants, identity synchronization, or mail routing. The experience needed to be intuitive, consistent, and accessible—without compromising security.

Once those requirements were clear, our solution started to emerge. Incremental changes wouldn’t be enough. A traditional group model wouldn’t work. The solution required a new architecture—one designed around isolation, automation, and intentional limitation.

That’s when we started the engineering work.

Creating a cloud-first architecture

Designing for hidden membership meant eliminating ambiguity. If any surface could reveal membership, even indirectly, it didn’t belong in the design.

That constraint led us toward a model built on strict isolation, explicit APIs, and intentionally narrow interfaces. The result is straightforward to use, but deliberately difficult to interrogate.

Two tenants, with sharply separated responsibilities

At the foundation of the solution is a two‑tenant model.

Our primary Microsoft 365 tenant is where employees authenticate, discover groups, and initiate actions. A secondary, isolated tenant hosts the distribution lists and performs mail expansion for Hidden Membership Groups.

A photo of Mace.

“Tenant isolation is what makes the privacy guarantee real. By moving membership expansion to a tenant that users and owners can’t access, we removed the possibility of accidental exposure. The system simply doesn’t give you a place where membership can be seen.”

Chad Mace, principal architect, Microsoft Digital

That separation matters because the secondary tenant isn’t designed for interactive use. Only Exchange and the minimum directory constructs required for mail routing and expansion are enabled.

Operationally, when an employee sends email to a Hidden Membership Group, they send to a mail contact visible in the corporate tenant. That contact routes to the corresponding distribution group in the isolated tenant, where membership expansion occurs. Expanded messages are then delivered back in recipients’ inboxes in the corporate tenant, so sent and received mail lives where users already work.

“Tenant isolation is what makes the privacy guarantee real,” says Chad Mace, a principal architect in Microsoft Digital. “By moving membership expansion to a tenant that users and owners can’t access, we removed the possibility of accidental exposure. The system simply doesn’t give you a place where membership can be seen.”

Identity without interactive access

This isolated tenant only works if it can resolve recipients. To enable that, our development team used Microsoft Entra ID multi‑tenant organization identity sync to represent corporate users in the secondary tenant.

These identities are treated as business guest identities, and we disable sign‑in to prevent interactive access. The tenant can perform expansion, but nothing more.

However, complete isolation wasn’t technically possible. Privileged access always exists at some level. The design response was to minimize that exposure. Access to the isolated tenant is tightly restricted, and membership changes flow through automation rather than broad UI-based administration.

The goal: reduce exposure to the smallest viable operational group.

API-first automation as the control plane

With tenancy and identity model established, the team needed a single, consistent way to create groups, connect objects across tenants, and manage changes without introducing new administrative workflows. That’s where the APIs come in.

A photo of Pena II.

“We split the backend into multiple APIs so the system could scale without becoming fragile. That let us separate everyday operations from high-volume membership work and keep performance predictable.”

John Pena II, principal software engineer, Microsoft Digital

The backend is intentionally modular, split into three distinct APIs:

  • The control API handles group creation, configuration, and cross‑tenant coordination.
  • The membership API handles standard add and remove operations.
  • The bulk membership APIs handle large‑scale operations involving tens of thousands of users, with services designed to run long‑lived jobs, manage throttling, and recover from partial failures.

“We split the backend into multiple APIs so the system could scale without becoming fragile,” says John Pena II, a principal software engineer in Microsoft Digital. “That let us separate everyday operations from high-volume membership work and keep performance predictable.”

The APIs run as PowerShell-based Azure Functions and use managed identity patterns, including federated identity credentials, to securely connect across tenants.

Creating the user experience with PowerApps

For the front end, we built a Canvas app in Power Apps, backed by Dataverse. The goal was speed and flexibility, without compromising strict privacy boundaries.

By using Power Apps as the primary interaction layer, we deliver a secure, modern experience without unnecessary custom infrastructure. The Canvas app provides a single, focused surface for discovering, joining, and managing hidden membership groups, while all sensitive operations remain behind controlled APIs and tenant boundaries. This separation allows the team to iterate quickly on experience design without weakening the privacy guarantees that the solution depends on.

Power Platform also simplifies how security is being enforced across the solution. Dataverse enables fine‑grained, role‑based access, ensuring users only see data they’re entitled to see—while keeping sensitive membership information entirely out of the client layer. That reduces long‑term maintenance overhead and makes it easier to evolve the solution as requirements change.

“From the beginning, we designed everything with security roles and workflows in mind,” says Shiva Krishna Gollapelly, senior software engineer in Microsoft Digital. “Dataverse let us control who could see or change data without building additional APIs or storage layers, and keeping everything inside the Power Apps ecosystem saved us a lot of maintenance over time.”

Dataverse plays a precise role here: it maintains the datastore the app needs to function without becoming a secondary membership repository.

A photo of Amanishahrak.

“Using the Power Platform let us move fast, integrate deeply with Microsoft identity, and enforce security without building a full web stack from scratch.”

Bita Amanishahrak, software engineer II, Microsoft Digital

From a security posture perspective, Dataverse security is used intentionally to restrict what different users can see and do, and the Power App was developed with security roles and workflows in mind.

Short version: the app brokers intent, the APIs execute it, and all the pieces that need to stay separate do exactly that.

“Using the Power Platform let us move fast, integrate deeply with Microsoft identity, and enforce security without building a full web stack from scratch,” says Bita Amanishahrak, a software engineer in Microsoft Digital.

The architectural intent is consistent throughout—isolate the sensitive plane and ensure the user plane operates only through controlled interfaces.

Benefits and impact

The most important outcome of the new architecture is also the simplest: Hidden membership stays hidden.

Anonymity isn’t enforced by policy. It’s enforced by architecture. Membership data never appears in the user experience or administrative tooling, and it doesn’t surface as a side effect of scale.

“We’re no longer asking people to trust that we’ll handle sensitive membership carefully through process,” Reifers says. “The system makes exposure structurally impossible.”

The impact was immediate.

At launch, we migrated more than 2,200 hidden membership groups, representing over 200,000 users, from the legacy on‑premises system into the new cloud‑first architecture. Groups ranged from small, tightly controlled communities to audiences with tens of thousands of members, all supported without special handling.

“Some of these groups are massive,” Pena says. “We knew from the beginning we were dealing with memberships in the tens of thousands, which is why we designed bulk operations as a first‑class capability instead of an afterthought.”

The separation between routine APIs and bulk‑membership APIs proved critical, enabling large migrations and ongoing changes without degrading day-to-day performance.

Operationally, moving to a cloud‑only model reduced both risk and complexity. Decommissioning the on‑premises Exchange infrastructure eliminated specialized maintenance requirements and improved monitoring, auditing, and access controls alignment with our modern cloud standards.

Delivery speed also mattered. Driven by Secure Future Initiative urgency and strong executive sponsorship, the team designed and delivered a minimum viable product in less than six months.

“That timeline forced discipline,” Reifers says. “We focused on what mattered: Security, privacy guarantees, scale, and a UX that wouldn’t disrupt group owners and/or members that had relied on a 15-year old tool.”

Everything else was secondary.

A photo of Gollapelly.

“Most users never think about tenants or APIs. They just see a clean experience that does what they need, without exposing anything it shouldn’t.”

Shiva Krishna Gollapelly, senior software engineer, Microsoft Digital

From an employee perspective, the experience became simpler and safer. Users now interact through a Power Platform app consistent with the rest of Microsoft 365.

Discovering a group, requesting access, or leaving a group no longer requires understanding the architecture behind it.

“Most users never think about tenants or APIs,” Gollapelly says. “They just see a clean experience that does what they need, without exposing anything it shouldn’t.”

The result is sustainable. The platform protects anonymity at scale, simplifies operations, boosts resiliency, and can evolve without reopening core privacy questions.

Moving forward

Delivering the initial solution was only the beginning.

The team sees Hidden Membership Groups as more than a single solution. It’s a reusable pattern for sensitive collaboration in a cloud‑first world: isolate what matters most, automate everything else, and design experiences that don’t require trust to be safe.

As adoption grows, the team plans to support additional anonymity-sensitive scenarios while maintaining the same underlying model.

“We don’t want every sensitive scenario inventing its own workaround,” Mace says. “This gives us a pattern we can reuse confidently.”

Future priorities include improving lifecycle and ownership experiences, strengthening auditing and reporting for approved administrators, and enhancing self‑service workflows—without compromising membership privacy. If it risks exposing membership, it doesn’t ship.

With the legacy system fully retired, Reifers reflects on what the team accomplished to get here.

“We shipped a new enterprise pattern in six months using our first party tools,” Reifers says. “We achieved this because a stellar team cared about the mission. That’s the takeaway.”

Key takeaways

Use these tips to strengthen your privacy, simplify your operations, and future-proof your organization’s collaboration systems:

  • Prioritize privacy by design. Embed privacy considerations from the start to protect sensitive information in all collaboration scenarios.
  • Architect for scale. Treat bulk operations to support large groups efficiently as a first-class capability.
  • Automate and modernize workflows. Replace legacy systems with cloud-native solutions to reduce risk, improve transparency, and enable continuous improvement.
  • Streamline user experience. Provide intuitive, consistent interfaces that make it easy for users to access, join, or leave groups without requiring technical knowledge.
  • Enforce strict access and auditing controls. Align monitoring and administration with modern cloud standards to maintain security and accountability.
  • Create reusable patterns. Establish and share successful privacy patterns to avoid reinventing solutions for each new case.
  • Focus on operational simplicity and resilience. Design systems that are easy to maintain and improve, freeing up teams to concentrate on innovation rather than upkeep.

The post Protecting anonymity at scale: How we built cloud-first hidden membership groups at Microsoft appeared first on Inside Track Blog.

]]>
22465
Powering data governance at Microsoft with Purview Unified Catalog http://approjects.co.za/?big=insidetrack/blog/powering-data-governance-at-microsoft-with-purview-unified-catalog/ Thu, 05 Feb 2026 17:00:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=22272 Data fuels everything that we do here at Microsoft, from the daily operations that keep the business running to the innovations that shape the future. But as data sprawls across teams, systems, and borders, the task of ensuring that it remains secure, accurate, and well-governed is a daunting one. A sound approach to data governance […]

The post Powering data governance at Microsoft with Purview Unified Catalog appeared first on Inside Track Blog.

]]>
Data fuels everything that we do here at Microsoft, from the daily operations that keep the business running to the innovations that shape the future.

But as data sprawls across teams, systems, and borders, the task of ensuring that it remains secure, accurate, and well-governed is a daunting one. A sound approach to data governance is the backbone of responsible data use across the enterprise, creating clarity around data ownership and access.

In an organization the size of Microsoft, no single team can carry this responsibility on its own. Effective data governance must be a distributed effort across all departments and functions.

This story explains how our marketing organization uses the Microsoft Purview Unified Catalog to organize and standardize the data we rely on daily. By putting clear ownership, consistent definitions, and reliable governance in place, we’re turning fragmented, unreliable data into an advantage that supports faster decisions and more effective campaigns.

Data governance at scale

As companies grow, their data governance becomes increasingly complex, with different teams creating their own versions of key data concepts, often without realizing it. The complexity is most visible in the way users across an organization define foundational terms.

A photo of Doughty.

“We found adoption to be much easier when helping teams focus on building more value in their data instead of driving governance like a compliance effort.”

Nick Doughty, senior product manager, Microsoft Purview Unified Catalog

Examples in marketing include what counts as a customer (active vs. inactive, marketing- or sales-qualified), what constitutes sensitive data (personally identifiable information, behavioral data, partner data), and what a metric means (conversion, engagement, attribution windows).

When inconsistent practices take hold, ownership becomes murky. With the increasing demands that managing data quality and integrity put on our leaders and their teams, effective data governance becomes one more hurdle to productivity.

“We started off implementing data governance like an issue register,” says Nick Doughty, a senior product manager within Microsoft Purview Unified Catalog. “Then we progressed to more of an enforcement method, similar to how we were doing security at the time. We found that when we started to push really hard on teams, similar to how we drove other compliance efforts, it was difficult for them to justify or understand why they would want the added governance.”

The introduction of Microsoft Azure Purview in 2020 marked a turning point.

A united platform for data governance, security, and compliance, Purview helps organizations understand, protect, and manage data across environments. It also addresses fragmented data, lack of visibility into where sensitive data lives and how it moves, compliance complexity with regulations (including GDPR and HIPAA), and security risks.

A photo of Mathur

“Our marketing teams used to spend hours hunting for the right customer list because multiple versions lived in different locations, each with unclear owners and inconsistent labels. Now our marketers can trust they are working from current information, while avoiding compliance risks associated with incorrect or unauthorized data.”

Sourabh Mathur, principal engineering lead, Global Marketing Engines and Experiences

The Purview Unified Catalog serves as the AI-powered backbone, automatically discovering, classifying, and organizing information so users can easily find and trust the data they need.

By launching the unified catalog, we gave our users a consistent way to understand and use their data, while reinforcing strong governance and compliance practices. The result is data that’s more discoverable, reliable, and actionable. (The product was renamed Microsoft Purview in 2022 and became part of Microsoft 365 compliance tools.)

“Our marketing teams used to spend hours hunting for the right customer list because multiple versions lived in different locations, each with unclear owners and inconsistent labels,” says Sourabh Mathur, a principal engineering lead in Global Marketing Engines and Experiences, who helped set up Purview for our marketing organization.

With the unified catalog in place, Purview surfaces the dataset, shows its lineage, and applies the correct sensitivity classifications.

“Now our marketers can trust they are working from current information, while avoiding compliance risks associated with incorrect or unauthorized customer data,” Mathur says.

Powering marketing at Microsoft with Purview

With more than 200 Microsoft Azure subscriptions, our marketing organization manages one of the largest data estates at the company. The team faces the constant challenge of scattered data, unclear data ownership, and inconsistent governance practices that slow down campaigns and increase compliance risk.

A photo of Biswal.

“Marketing can now scale governance across hundreds of data products, support self-service data collection with guardrails, automate access decisions, and enable AI workloads on trusted data.”

Deepak Kumar Biswal, principal software engineering lead, Global Marketing Engines and Experiences

By adopting Purview, our marketing team gained unified visibility, clearer classification standards, and smoother collaboration with other departments, like IT and legal. This reduces friction while strengthening data protection.

The result is an organization that moves faster with greater confidence in how it handles customer and campaign data.

Instead of relying on legacy knowledge, forcing users to dig through different servers and SharePoint sites, or constantly sending queries to the engineering teams, our marketing professionals can now explore the curated Purview Unified Catalog, making streamlined, efficient data discovery possible.

“Marketing can now scale governance across hundreds of data products, support self-service data collection with guardrails, automate access decisions, and enable AI workloads on trusted data,” says Deepak Kumar Biswal, a principal software engineering lead in Global Marketing Engines and Experiences. “Purview turns responsible data use into everyday practice, not extra work.”

Data governance and security: Two sides of the same coin

For our marketing organization, data governance and security are inseparable concepts. As soon as you have customer information, you need to make sure it’s secure—sensitive data must be carefully defined, consistently managed, and protected from misuse or breach.

Purview supports this goal by combining governance capabilities with security and compliance controls that provide added layers of protection.

Within marketing, the governance and security teams work closely together. Good governance measures ensure our data is properly defined and standardized, while strong security policies ensure it’s handled with proper safeguards. By pairing governance with strong security practices, our marketing team can remain compliant with data privacy laws, prevent misuse of sensitive information, and foster trust across their organization.

When our marketing team began its Purview journey five years ago, it adopted a centralized governance model. Much like the structure of a government—where federal, state, and local entities each play a role—our approach allows both centralized standards and local autonomy. This creates consistency across the organization without stifling agility.

Our Data Governance team took on the role of steward, defining standards, onboarding systems, and collaborating with its IT partners to connect data environments. Existing assets like data dictionaries and process flows were used to seed the catalog, ensuring the team started from known ground rather than reinventing definitions from scratch.

This deliberate, incremental approach allowed our marketing team to thoughtfully build out healthy governance practices. By moving slowly, the team learned from each step on its journey, refining processes and establishing consistent practices as it moved along.

For example, working closely with our team in Microsoft Digital allowed them to experiment with different ways of discovering and cataloging their data. This involved taking learn and refine how Purview tuned their data before they rolled anything out broadly.

Our goal is to transition to a completely federated model in which responsibility shifts outward. Rather than the marketing governance team doing all the stewardship, individual groups will take ownership of their data within Purview. This shift distributes accountability, embeds governance deeper into daily operations, and makes it easier for teams to monitor data quality and enforce standards on their own.

Impact across the enterprise

Since adopting Purview Unified Catalog, we’ve seen tangible results across our data estate and our data governance practices in marketing and across all verticals within the company. Here are some companywide highlights:

  • Better consolidation: We’ve unified five catalogs into one.
  • Increased scale: We added 250 data sources onboarded in six months, representing roughly 10 million assets.
  • Higher internal adoption: We set up more than 50 governance domains, an effort we supported with reusable training assets, guides, and onboarding materials.

The benefits also include and extend beyond marketing:

  • Teams across the company are gaining increased confidence in their data definitions.
  • Compliance and privacy obligations are being met more effectively.
  • Business value is being generated through better, more trusted use of data.
  • Organizations are benefiting from faster time-to-insight.

Launching the marketing governance domain

We’re using Purview to combine essential capabilities like data governance, classification, and quality checks across our Microsoft services, which creates a unified foundation for our enterprise-wide metadata management. These unified capabilities make Purview an indispensable tool for us, and for large-scale enterprises.

A photo of Singh

“With various role types like data curator and data reader, we can add more visibility into our data—where it lives, how it’s being used, and who are its primary owners. Clearly defining these parameters helps us use the data governance framework as a starting point and improve our data governance capabilities.”

Vinny Singh, principal program manager, Global Marketing Engines and Experiences

As early adopters of Purview Unified Catalog, the group launched the Marketing Governance domain, registering more than 200 data products using the Unified Catalog’s data map.

The products, spanning various datasets, are aligned with strict internal governance standards. This gives marketing the ability to govern, classify, and track data across its ecosystem—ensuring adherence to GDPR and other regulatory compliance measures.

“With various role types like data curator and data reader, we can add more visibility into our data—where it lives, how it’s being used, and who are its primary owners,” says Vinny Singh, a principal program manager in Global Marketing Engines and Experiences. “Clearly defining these parameters helps us use the data governance framework as a starting point and improve our data governance capabilities.”

Key takeaways

Our journey with Microsoft Purview Unified Catalog has generated key insights that you can apply to your own data governance efforts. These include:

  • Start small: Don’t try to “boil the ocean.” Begin with three to five governance domains and scale from there.
  • Leverage what you have: Data dictionaries, glossaries, and existing documentation provide a strong starting point for a governance platform founded on the Purview Unified Catalog.
  • Focus on value, not enforcement: Governance resonates when teams see how it helps them, not when it’s mandated.
  • Adapt to your organization: Each team at your company will use Purview differently. Flexibility helps encourage adoption.
  • Build community: Data governance is not a solo effort. Collaboration among stakeholders produces stronger standards and better results.

The post Powering data governance at Microsoft with Purview Unified Catalog appeared first on Inside Track Blog.

]]>
22272
Moving from a ‘Scream Test’ to holistic lifecycle management: How we manage our Azure services at Microsoft http://approjects.co.za/?big=insidetrack/blog/moving-from-a-scream-test-to-holistic-lifecycle-management-how-we-manage-our-azure-services-at-microsoft/ Thu, 20 Nov 2025 17:05:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=21193 Nearly a decade ago, as we began our journey from relying on on-premises physical computing infrastructure to being a cloud-first organization, our engineers came up with a simple but effective technique to see if a relatively inactive server was really needed. They dubbed it the “Scream Test.” “We didn’t have a great server inventory and […]

The post Moving from a ‘Scream Test’ to holistic lifecycle management: How we manage our Azure services at Microsoft appeared first on Inside Track Blog.

]]>
Nearly a decade ago, as we began our journey from relying on on-premises physical computing infrastructure to being a cloud-first organization, our engineers came up with a simple but effective technique to see if a relatively inactive server was really needed.

They dubbed it the “Scream Test.”

“We didn’t have a great server inventory and tracking system, and we didn’t always know who owned a server,” says Brent Burtness, a principal software engineer in Commerce Financial Platforms, who was one of the leaders for the effort in his group. “So, we essentially just turned them off. If someone screamed—‘Hey, why’d you turn off my server?’—then we’d know it was still being used.”

Today, the basic idea behind the Scream Test is being used across the company, but in a more holistic way. Importantly, it’s been incorporated into the overall lifecycle management of our computing infrastructure. And, through the automation tools provided by Microsoft Azure, we have a much more efficient process for making sure that we’re saving time and money by reducing the number of underused machines we operate, monitor, and maintain.

A photo of Apple

“We thought we were going to get rid of a small number of machines that weren’t being used. But we found the actual share was about 15% of all machines, which saved us a lot of effort of moving those unused machines to the cloud. In other words, we downsized on the way to the cloud, rather than after the fact.”

Pete Apple, cloud network engineering architect, Microsoft Digital

Uncovering more than expected

The Scream Test was part of the huge effort to evaluate our on-premises compute resources before we began moving to the Azure cloud. After all, why spend resources moving something that isn’t needed?

Pete Apple, who helped develop the concept of the Scream Test, is a cloud network engineering architect in Microsoft Digital, the company’s IT organization. Looking back, he remembers the surprising results that emerged when they began shutting down specific servers to see who noticed.

“We thought we were going to get rid of a small number of machines that weren’t being used,” Apple says. “But we found the actual share was about 15% of all machines, which saved us a lot of effort of moving those unused machines to the cloud. In other words, we downsized on the way to the cloud, rather than after the fact.”

As part of this process, Apple explains, our engineers looked at two related factors to reduce inefficiencies in our usage of computing resources.

The first was to identify systems that were used infrequently, at a very low level of CPU (sometimes called “cold” servers). From that, we could determine which systems in our on-premises environments were oversized—meaning someone had purchased physical machines according to what they thought the load would be, but either that estimate was incorrect or the load diminished over time. We took this data and created a set of recommended Microsoft Azure Virtual Machine (VM) sizes for every on-premises system to be migrated.

“We learned that there’s a lot of orphaned, or underutilized, resources out there,” Burtness says. “These were cases where the workload was so small on a server—like under 5% CPU—that it didn’t make sense to host it on its own machine. We could then move the task or application and get it down to just one or two CPUs on a virtual machine.”

At the time, we did much of this work manually, because we were early adopters. The company now has a number of products available to assist with this review of your on-premises environment, led by Azure Migrate.

Another part of the process was determining which systems were being used for only a few days a month or at certain busy times of the year. These development machines, test/QA machines, and user acceptance testing machines (reserved for final verification before moving code to production) were running continuously in the datacenter but were really only needed during limited windows. For these situations, we applied the tools available in Azure Resource Manager Templates and Azure Automation to ensure the machines would only run when needed.

Automating with Azure

Today, we don’t have to rely on anything as crude as the Scream Test to find unused and underused computing resources. With 98% of our IT resources operating in the Azure cloud, we have much greater insight into how efficient our network is, so much of the process can be automated.

“We’ve found this effort much easier to manage in the cloud, because all our computing resources are integrated with the Azure portal,” Apple says. “They have an API system and offer various tools within Azure Update Manager and Azure Advisor to help with cost efficiency. It’s kind of like a modern version of Clippy—’Hey, it looks like your VM isn’t being used much. Do you want to downsize that or turn it off?'”

(For the uninitiated, Clippy was the Microsoft Office animated paperclip assistant introduced in the late 1990s. It offered tips and help with tasks, like writing and formatting documents. Clippy became iconic for its quirky suggestions, including recommending that you remove things from your desktop that you weren’t using.)

Burtness smiles in a portrait photo.

“With everything being in the Azure portal or in Azure Resource Graph, it’s much more streamlined, and makes it easier to get that data out to the teams. They can then go into the portal and clean up the resource.”

Brent Burtness, principal software engineer, Commerce Financial Platforms

And simply taking the step of turning off stuff that we weren’t using turned out to be very effective. Thanks, Clippy!

Today, we approach this challenge in a more efficient and sophisticated way, taking advantage of Azure tools like Update Manager and Advisor.

“With everything being in the Azure portal or in Azure Resource Graph, it’s much more streamlined, and makes it easier to get that data out to the teams,” Burtness says. “We can run automated queries with Azure Resource Graph. Then we bring that information into our internal Service 360 tool, which we use to give action items to our developers. Each item gives them a link to Azure portal, and they can then go into the portal and clean up the resource.”

Managing for the lifecycle

One of the most important things we learned by using the Scream Test to identify inefficiencies and moving our systems from on-premises servers to the cloud was that it’s an ongoing process, not a fixed-end project.

“We had this idea that it was going to be a one-time event, that we’ll move to the cloud and then we’ll be done,” Apple says. “A better understanding is that it’s a lifecycle. We have integrated this concept of continual evaluation into our processes around everything that’s still on-premises, because we still have labs, we still have physical infrastructure.”

We continue to do this evaluation on a regular basis with both physical and virtual computing resources, because needs and usage are constantly changing.

Cutting our cloud costs

A text graphic shows the savings that one group at Microsoft achieved by becoming more efficient in their compute usage.
In a pilot set of Azure subscriptions, the Commerce Financial Platforms team reduced usage by 233 resources across 36 subscriptions and 17 services in 6 team groups, saving more than $15,000 in monthly operating costs.

“Now we have a basic process around a six-month cycle,” Apple says. “So, every six months we ask, does this still need to be on-premises or should we start moving it to the cloud? And we do the same thing with our cloud resources. Who’s still using these VMs? And we still go through the same review process to see if it’s needed, or if we can shut it down or move it.”

This has resulted in significant cost savings for the company. “We’re up to about 15% to 20% less compute cost, depending on the organization, because of this much better understanding of our business needs,” Apple says.

Better governance, increased security

Another major benefit of this process was establishing much stronger governance of compute resources across the entire organization.

“When we first did the Scream Test, we weren’t always really sure who owned what, in some cases,” Apple says. “We’ve fixed that as part of this process. This governance aspect is a key part of being more efficient with our resources.”

Burtness explains why this is so important.

“It’s critical to know exactly who to contact when there’s something wrong with the server,” Burtness says. “Now, with clearer ownership, clearer accountability, and better inventory, it’s a much better experience.”

Better governance also means tighter security, according to both Apple and Burtness.

“This is really important when it comes to threat-actor response,” Apple says. “Unused servers can often be an entry point for hackers. Or, say we discover that a machine or server is getting hacked; you need to talk to who owns it. If you don’t know, it takes you longer to track them down and combat the hack. That’s not great. Improving our governance has definitely made securing our environment easier.”

Key takeaways

Here are some things to keep in mind when managing your own enterprise compute resources for greater efficiency:

  • It’s not a one-time exercise. For the best results, you should be evaluating your computing resources on a regular schedule to identify ”cold” servers and unused infrastructure.
  • Adjust for variable usage patterns. It’s not just about unused servers. Some machines may only be needed for a business function during certain busy times of the year. Consider turning the machines on just to handle the load during those periods and turning them off the rest of the year.
  • Use Azure tools for greater insight. If you’re operating your infrastructure in the Azure cloud, you can much more easily monitor and address orphaned resources using automated tools such as Azure Advisor, Azure Resource Graph, and the Azure portal.
  • Apply your savings to other priorities. “The more efficient you are, the more savings can be applied to other projects or given back to your manager—who is going to be very happy with you,” Apple says.
  • Saving money is not the only benefit. You’ll not only save operating costs, you’ll have a reduced maintenance and monitoring load, better governance, and fewer security vulnerabilities.

The post Moving from a ‘Scream Test’ to holistic lifecycle management: How we manage our Azure services at Microsoft appeared first on Inside Track Blog.

]]>
21193
Vuln.AI: Our AI-powered leap into vulnerability management at Microsoft http://approjects.co.za/?big=insidetrack/blog/vuln-ai-our-ai-powered-leap-into-vulnerability-management-at-microsoft/ Thu, 16 Oct 2025 16:05:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=20623 In today’s hyperconnected enterprise landscape, vulnerability management is no longer a back-office function—it’s a frontline defense. With thousands of devices from a multitude of vendors, and a relentless stream of Common Vulnerabilities and Exposures (CVEs), here at Microsoft we faced a challenge familiar to every IT decision maker: how to scale vulnerability response without scaling […]

The post Vuln.AI: Our AI-powered leap into vulnerability management at Microsoft appeared first on Inside Track Blog.

]]>
In today’s hyperconnected enterprise landscape, vulnerability management is no longer a back-office function—it’s a frontline defense. With thousands of devices from a multitude of vendors, and a relentless stream of Common Vulnerabilities and Exposures (CVEs), here at Microsoft we faced a challenge familiar to every IT decision maker: how to scale vulnerability response without scaling cost, complexity, or risk.

A photo of Fielder.

“While AI enables amazing capabilities for knowledge workers, it also increases the threat landscape, since bad actors using AI are constantly probing for vulnerabilities. Vuln.AI helps keep Microsoft safe by identifying and accelerating the mitigation of vulnerabilities in our environment.”

Brian Fielder, vice president, Microsoft Digital 

Enter Vuln.AI, an intelligent agentic system developed by our team in Microsoft Digital—the company’s IT organization—to transform how we identify, prioritize, and resolve vulnerabilities across our enterprise network.

Manual methods can’t keep up

As a company, we detect over 600 million cybersecurity threats every day, according to our latest Digital Defense Report. Some of those signals are bad actors probing our internal network and infrastructure looking for unpatched vulnerabilities. Our infrastructure supports over 300,000 employees and vendors, 25,000 network devices, and over 560 buildings across 102 countries. This scale means we face a constant stream of vulnerabilities—each requiring triage, impact analysis, and remediation.

“While AI enables amazing capabilities for knowledge workers, it also increases the threat landscape, since bad actors using AI are constantly probing for vulnerabilities. Vuln.AI helps keep Microsoft safe by identifying and accelerating the mitigation of vulnerabilities in our environment,” says Brian Fielder, a vice president within Microsoft Digital. 

Historically, our Infrastructure, Networking, and Tenant team here in Microsoft Digital relied on manual assessments to determine which network devices were impacted by new vulnerabilities. Traditional vulnerability scanning tools generate a lot of false positives and false negatives, and a significant amount of analysis still falls to security engineers, requiring manual validation before any vulnerability impact can be communicated to device owners. These manual methods were time-consuming, error-prone, and reactive—our security engineers were spending hours on each vulnerability, at times missing critical threats or sinking too much time into false alarms.

A photo of Bansal.

“AI’s true power lies in the problem it’s applied to. Start by identifying the most time-consuming or painful task in your organization-then explore how AI can augment or improve it. Begin with a small, targeted enhancement and iterate continuously.”

Ankit Bansal, senior product manager, Microsoft Digital

With the vast number of vulnerabilities coming in every day, security engineers needed a scalable way to quickly analyze, prioritize, and respond.

The solution: Vuln.AI

We already achieved dramatic impact with our AI Ops and Network Infrastructure Copilot, which is on track to save us over 11,000 hours of network service management time per year. We built Vuln.AI on top of that investment:

  1. The Research Agent analyzes vulnerability feeds and network metadata from our Infrastructure Data Lakehouse (IDL) built on top of Azure Data Explorer, which regularly ingests data from our device vendors and other sources. Once new vulnerabilities are detected, it automates the identification of impacted devices and integrates with other internal tooling for validation and reporting.
  2. The Interactive Agent acts as a gateway for engineers and device owners to ask follow-up questions and initiate remediation. Through agent-to-agent interaction, it leverages our Network Infrastructure Copilot to query the research agent’s findings. This agentic interface enables real-time decision-making and contextual insights.

Together, these agents are significantly improving our network security operations. The results we’re seeing so far are compelling:

  • A 70% reduction in time to vulnerability insights, enabling faster prioritization and mitigation, minimizing exposure windows.
  • Lower risk of compromise through increased accuracy, quicker detection, and containment of threats.
  • A stronger compliance posture that supports adherence to financial, legal, and regulatory requirements.
  • Higher accuracy in identifying vulnerable devices, reducing false positives and missed threats
  • Engineering hours saved and reduced fatigue, significantly improving productivity.

Our gains translate to lower operational risk, faster response times, and more resilient infrastructure—critical outcomes for any enterprise navigating today’s threat landscape.

“AI’s true power lies in the problem it’s applied to,” says Ankit Bansal, a senior product manager within Microsoft Digital. “Start by identifying the most time-consuming or painful task in your organization-then explore how AI can augment or improve it. Begin with a small, targeted enhancement and iterate continuously.”

How Vuln.AI works

The system continuously ingests our CVE data from our device suppliers’ API feeds and a publicly available database of known cybersecurity vulnerabilities.  It correlates that data with device attributes such as its hardware model and OS to identify the potential impact on the network and surface actionable insights.

Engineers interact with the system via Copilot, Teams, or custom tooling, which allows seamless integration with our network security teams’ daily workflows.

“We built a hybrid approach in Vuln.AI to guide LLMs through complex security advisories,” says Blaze Kotsenburg, a software engineer in Microsoft Digital. “By combining structured function calls, templated prompts, and data validation, we keep the model focused on producing reliable, actionable insights for vulnerability mitigation.”

A photo of Lollis.

“We chose Durable Functions for Vuln.AI because it allowed us to confidently orchestrate complex, stateful research. The reliability and simplicity of the framework meant we could shift our focus to engineering the intelligence behind the agent, especially the prompting strategies used in Vuln.AI’s backend processing.”

Mike Lollis, a senior software engineer in Microsoft Digital.

When it came to building Vuln.AI, we relied heavily on our own technology platforms, including: 

  • Azure AI Foundry for model development and deployment
  • Azure Data Explorer to store device metadata and CVEs
  • Agent to agent interaction with Network Copilotto query our database for device and inventory knowledge
  • Azure OpenAI models for natural language processing and classification
  • Azure Durable Functions for fine-grained orchestration and custom LLM workflows

“We chose Durable Functions for Vuln.AI because it allowed us to confidently orchestrate complex, stateful research,” says Mike Lollis, a senior software engineer in Microsoft Digital.  “The reliability and simplicity of the framework meant we could shift our focus to engineering the intelligence behind the agent, especially the prompting strategies used in Vuln.AI’s backend processing.”

Vuln.AI in action

Consider a common scenario: a new CVE that affects a network switch has just been published. Vuln.AI’s research agent immediately flags the vulnerability, maps it to potentially affected devices in our network inventory, and pushes the findings to an internal database.

A photo of Lee.

“AI is only as good as the data you provide. Much of the success with Vuln.AI came from our dedicated efforts to source comprehensive vulnerability data and device attributes. For effective AI-powered solutions, you really need to invest in a strong data foundation and a strategy for how to integrate into the rest of your infrastructure.”

Linda Lee, product manager II, Microsoft Digital

This data then becomes immediately accessible in our internal tools, where it is validated and approved by security engineers. Following this, network engineers are provided with precise information about their vulnerable devices.

Engineers can prompt Vuln.AI’s interactive agent to instantly retrieve the following information:

“12 devices impacted by CVE-2025-XXXX. Would you like me to suggest some next steps for mitigation or remediation?”

With Vuln.AI, network engineers can now begin vulnerability response operations much more quickly—no spreadsheet wrangling and no delays.

“AI is only as good as the data you provide,” says Linda Lee, a product manager II within Microsoft Digital. “Much of the success with Vuln.AI came from our dedicated efforts to source comprehensive vulnerability data and device attributes. For effective AI-powered solutions, you really need to invest in a strong data foundation and a strategy for how to integrate into the rest of your infrastructure.”

It’s about automating manual workflows and research.

“Vuln.AI has reduced our triage time by over 50%,” says Vincent Bersagol, a principal security engineer in Microsoft Digital.

This is allowing our engineers to focus on deeper analysis.

“The synergy between security and AI engineering has unlocked a new level of precision in vulnerability insights,” Bersagol says. “This is just the beginning.”

The journey ahead

Our journey with AI-powered vulnerability management has only just begun. Looking ahead, our roadmap for Vuln.AI includes:

  • Extending data coverage to include more hardware suppliers
  • Integrating more detailed device profiles for more targeted vulnerability response
  • Supporting autonomous workflows to streamline network engineers’ remediation efforts
  • Incorporating other AI agents to support more security use cases

These enhancements will further reduce risk, accelerate response times, and empower engineers to focus on more strategic initiatives.

“Trust is the foundation of everything we do in Microsoft Digital,” Bansal says. “Securing our network is essential to upholding that trust. Intelligent solutions like Vuln.AI not only help us stay ahead of emerging threats—they also establish the blueprint for integrating AI more deeply into our security operations.”

For IT leaders, Vuln.AI offers a blueprint for modern vulnerability management:

  • Scalable: Handles thousands of devices and vulnerabilities with ease
  • Accurate: Reduces false positives and missed threats
  • Efficient: Saves time, money, and resources
  • Secure: Built on Microsoft’s trusted AI and security frameworks

In a world where every second counts and any threat can be costly, Vuln.AI transforms vulnerability management from a bottleneck into a competitive advantage for Microsoft.

Key takeaways

As your organization looks for ways to improve security and threat response in a fast-changing landscape, consider the following insights on how AI is reshaping vulnerability management at Microsoft:

  • Fight fire with fire: The threat landscape has broadened dramatically due to bad actors using AI. Supplementing your own efforts with AI can help you manage your risk more effectively than traditional vulnerability management.
  • Agility is key: Effective vulnerability response hinges on acting fast. An AI-powered solution like Vuln.AI can cut the time needed to analyze and mitigate vulnerabilities by over 50%, enabling organizations to enhance security operations at scale.
  • The future is now: Looking ahead, Microsoft Digital will integrate agentic workflows into more security operations, boosting efficiency in risk prevention, threat detection and response, thereby enabling security practitioners and developers to focus on more strategic projects.

The post Vuln.AI: Our AI-powered leap into vulnerability management at Microsoft appeared first on Inside Track Blog.

]]>
20623
Keeping our in-house optical network safe with a Zero Trust mentality http://approjects.co.za/?big=insidetrack/blog/keeping-our-in-house-optical-network-safe-with-a-zero-trust-mentality/ Thu, 16 Oct 2025 16:00:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=20611 When it comes to corporate connectivity at Microsoft, a minute of lost connection can lead to catastrophic disruptions for our product teams, sleepless nights for our network engineers, and millions of dollars of lost value for the company. That’s why we built our own optical network at our headquarters in Washington state, and that’s why […]

The post Keeping our in-house optical network safe with a Zero Trust mentality appeared first on Inside Track Blog.

]]>
When it comes to corporate connectivity at Microsoft, a minute of lost connection can lead to catastrophic disruptions for our product teams, sleepless nights for our network engineers, and millions of dollars of lost value for the company.

That’s why we built our own optical network at our headquarters in Washington state, and that’s why we’re building similar networks at other regional campuses around the United States and the rest of the world.

With so much on the line, we need to make sure these in-house networks never go down.

But how are we doing that?

We’re applying the same robust Zero Trust approach we take to security and identity. While our optical networks are extremely reliable, any complex system can be knocked offline. In alignment with the Zero Trust mentality we have as a company, we trusted the integrity of what we’ve built, but we needed a resilient backup system that went beyond redundancy to provide true resilience.

Driven by this goal, we created a Zero Trust Optical Business Continuity Disaster Recovery (BCDR) network that combines two fully independent optical systems designed to sustain uninterrupted services, even during systemic failures. The result is more confidence for our employees and vendors, less pressure on our network engineers, and comprehensive network resilience that will protect us against a major outage.

The urgency of resilience

In 2021, our team in Microsoft Digital, the company’s IT organization, deployed our first next-generation optical network to serve the exclusive network needs of our Puget Sound metro campuses. It offers more bandwidth on less fiber for a lower operational cost than leasing from traditional carriers.

“Puget Sound is a highly concentrated developer network where we need to provide very high throughput,” says Patrick Alverio, principal group software engineering manager for Infrastructure and Engineering Services within Microsoft Digital. “Our optical system is the backbone of all that traffic.”

Our state-of-the-art optical network fulfills our need for fast and reliable connectivity at up to 400 Gbps between core sites, labs, data centers, and the internet edge. We built this network on the Reconfigurable Optical Add/Drop Multiplexer (ROADM) technology, delivering dynamic reconfiguration, colorless, directionless, contentionless (CDC) capabilities, flexible grid support, remote provisioning, and automation. It also features a full-mesh topology that provides a layer of redundancy.

But what if the entire ROADM-based system fails?

There are plenty of operational risks that can derail even the most robust network. Anything from misconfigured automation scripts to policy changes to misaligned software versioning to simple human error can cause outages.

A photo of Elangovan

“We don’t want even a second of downtime. We needed a life raft for when failures occur that could also function as a standby network for core site migrations or platform upgrades.”

Vinoth Elangovan, senior network engineer, Hybrid Core Network Services, Microsoft Digital

To some degree, those kinds of minor disruptions are inevitable. But catastrophic events like fiber cuts, failures in the ROADM operating system, or even natural disasters have the potential for even more wide-ranging disruption.

During a catastrophic outage, thousands of engineers, developers, researchers, and other technical employees who need access to crucial lab environments and data centers could lose connectivity. That can sabotage feature delivery, disrupt product patches, interrupt updates, and halt all kinds of core product functions.

On top of normal software development operations, new AI tools demand massive bandwidth and consistent uptime. Finally, our hybrid networks feature paths integrated with Microsoft Azure that consume on-premises resources, so they also stand to benefit from increased resilience.

A catastrophic network outage can cause incredible damage to all of these business functions. In fact, we experienced exactly that in 2022.

A fiber cut combined with a ROADM system hardware reboot caused a five-minute outage at our Puget Sound metro region. In this environment, every minute of lost connectivity can result in significant financial impact, making network resilience absolutely essential.

“We don’t want even a second of downtime,” says Vinoth Elangovan, senior network engineer, who designed and implemented the Zero Trust Optical BCDR network for Microsoft. “We needed a life raft for when failures occur that could also function as a standby network for core site migrations or platform upgrades.”

Delivering greater network resilience

To ensure we could deliver uninterrupted network connectivity even in the midst of a catastrophic outage, we needed to consider the technical demands of a truly resilient system. Five design pillars helped us assemble our architectural criteria:

  1. Independent optical systems: To provide true resilience, our primary and BCDR platforms needed to operate autonomously.
  2. Physically independent paths: Circuits should avoid shared conduits, fibers, and splices to operate completely independently.
  3. Separate control software: The primary and backup networks should operate through dedicated network management systems (NMSs), automation, and provisioning domains.
  4. Unified client interface: Both systems needed to terminate into the same interface to unify service for clients and applications.
  5. Survivability by design: We couldn’t assume that any system would be immune to failure. Instead, we built for the best possible outcomes.

The result was the Zero Trust Optical BCDR architecture, a layered approach to optical networking. It consists of our primary, ROADM-based transport layer and a secondary, MUX-based transport layer, both terminating into a single logical port channel.

“Our core responsibility is the employee experience, so our main design thrust was making sure service is seamless and uninterrupted—even during an outage.”

Vinoth Elangovan, senior network engineer, Hybrid Core Network Services, Microsoft Digital

Both systems are live and active, which means they deliver production services through their own independent fibers, power supplies, and software stacks. By layering fully independent optical domains and logically unifying them at the Ethernet edge, the network can sustain a complete failure of one system and maintain continuity.

That physical and operational independence is the difference between simple redundancy and robust resilience.

“Our core responsibility is the employee experience, so our main design thrust was making sure it’s seamless and uninterrupted—even during an outage,” Elangovan says.

Optical network backed by a BCDR network

A schematic of an optical network running between different nodes and backed up by a BCDR network.
The optical network in our Puget Sound region connects core sites to labs, datacenters, and the internet edge, while the BCDR network provides backup connections to deliver resilience in case of a catastrophic network failure.

A typical ROADM optical network connects campus and data center sites to the internet edge. Our design features three interconnected optical rings, with two internet edges as multi-directional nodes, while other sites operate as dual-degree nodes with bidirectional redundancy. Meanwhile, our campuses and datacenters are designated as critical sites and equipped with Optical BCDR links to ensure enhanced resiliency. In the event of a complete Optical ROADM line failure, these critical sites retain connectivity.

In the event of an outage on the primary network, the port channel handles forward continuity automatically, shifting WAN traffic between optical paths in real time.

The transition occurs seamlessly and transparently, with no noticeable impact to clients.

A photo of Martin

“Our initial goal was to provide high-throughput connectivity for major labs, with less than six minutes of downtime per year. That represents a service level of 99.999% network continuity, and we’re aiming for even better moving forward.”

Blaine Martin, principal engineering manager, Hybrid Core Network Services, Microsoft Digital

Coupling at the Ethernet layer provides clients and applications with one logical interface, automatic load balancing and traffic distribution, and seamless failover, regardless of which optical domain is providing service.

“Our initial goal was to provide high-throughput connectivity for major labs, with less than six minutes of downtime per year,” says Blaine Martin, principal engineering manager for Hybrid Core Network Services in Microsoft Digital. “That represents a service level of 99.999% network continuity, and we’re aiming for even better moving forward.”

A new era of confidence for network engineers

For the network engineers who keep Microsoft employees and resources connected, the Zero Trust Optical BCDR network relieves much of the pressure that comes from resolving outages.

“Before, we were dependent on a single system, even with redundancies, so the human experience was like firefighting. Now, if the primary optical network is having a problem, I don’t even see it.”

Kevin Bullard, principal cloud network engineering manager, Microsoft Digital

When a network goes down, engineers have an enormous set of responsibilities to manage: processing the incident report, assigning severity, performing checks, notifying internal teams, providing updates, and engaging with physical support teams—all with a profound urgency to restore productivity.

Dialing those pressures back has been a huge benefit.

“Before, we were dependent on a single system, even with redundancies, so the human experience was like firefighting,” says Kevin Bullard, Microsoft Digital principal cloud network engineering manager responsible for maintaining WAN interconnectivity between labs. “Now, if the primary optical network is having a problem, I don’t even see it.”

There will always be pressure on network engineers to restore connectivity during an outage, but they can breathe easier knowing it won’t cost the company millions of dollars as the time to resolve ticks away. And in non-emergency situations like core site migrations, the BCDR network provides a much easier way to shunt services while the main network is offline.

“Our internal users have become more confident that they can stay connected, no matter what,” says Chakri Thammineni, principal cloud network engineer for Infrastructure and Engineering Services in Microsoft Digital. “That gives the people responsible for maintaining our enterprise networks incredible peace of mind.”

Fortunately, there hasn’t been a substantial network outage in the Puget Sound metro area since 2022. But our network engineering teams know that if and when it happens, the BCDR network will be ready to maintain service continuity.

A photo of Alverio.

“We’re always looking ahead into industry trends to stay at the bleeding edge, whether that’s in the technology we provide for our customers or the networks we use to do our own work.”

Patrick Alverio, principal group software engineering manager, Infrastructure and Engineering Services, Microsoft Digital

With our Puget Sound network protected, we have plans in place to extend this model to other metro areas. Naturally, we have to balance population, criticality, and the knowledge that elevated reliability and availability come with a cost.

Our selection criteria for new BCDR networks have largely centered around two factors: expansions of AI-critical infrastructure and concentrations of secure access workspaces (SAWs) for technical employees. With these criteria in mind, we’re planning new BCDR networks first in the Bay Area and Dublin, then in Virginia, Atlanta, and London.

Zero Trust optical BCDR architecture represents a paradigm shift in enterprise network resilience, and we’re committed to expanding the model to benefit both conventional workloads and the expanding infrastructure demands of AI.

“We’re always looking ahead into industry trends to stay at the bleeding edge, whether that’s in the technology we provide for our customers or the networks we use to do our own work,” Alverio says. “We refuse to accept the status quo, and we’re elevating the experience for employees across Puget Sound and Microsoft as a whole.”

Driving AI innovation in optical network resilience

Our journey towards an AI-driven optical network is gaining momentum.

As part of our Secure Future initiative, we’ve automated our Optical Management Platform credential rotation and are actively developing intelligent incident management ticket enrichment, auto-remediation, link provisioning, deployment validation, and capacity planning.

AI plays a central role in this transformation.

With Microsoft 365 Copilot and GitHub Copilot integrated into our engineering workflows, we’re accelerating development cycles, improving code accuracy, and uncovering optimization opportunities that would otherwise take hours of manual effort.

These Copilots are also helping our engineers analyze network patterns, simulate outcomes, and validate deployment logic before execution, reducing human error and strengthening our Zero Trust posture. Over time, we’re evolving toward a system where AI not only assists but proactively predicts potential disruptions, recommends remediations, and continuously learns from operational telemetry.

These advancements are paving the way for a future where our optical infrastructure can anticipate issues, recover faster, and operate with the agility and assurance expected in a Zero Trust environment.

Key takeaways

If you’re considering implementing your own optical and BCDR networks, consider these tips:

  • Understand the technical components of resilience: Independent optical systems, physically independent paths, separate control software, a unified client interface, and survivability by design are the key technical components of true resilience.
  • Plan from a preparedness and value perspective: Evaluate the critical points in your infrastructure and determine where you can get the most value out of resilient connectivity.
  • Ensure your teams have the right skillset: Carefully consider the right workforce to run those systems and be accountable for their operation.

The post Keeping our in-house optical network safe with a Zero Trust mentality appeared first on Inside Track Blog.

]]>
20611