Network and infrastructure Archives - Inside Track Blog http://approjects.co.za/?big=insidetrack/blog/tag/network-and-infrastructure/ How Microsoft does IT Wed, 10 Jun 2026 23:57:01 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.4 137088546 Microsoft Build 2026: Empowering our developers to adopt agentic AI at Microsoft http://approjects.co.za/?big=insidetrack/blog/microsoft-build-2026-empowering-our-developers-to-adopt-agentic-ai-at-microsoft/ Tue, 02 Jun 2026 19:15:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=23855 In Microsoft Digital, the company’s IT organization, our journey to agentic AI has been an evolution—one that began with early experimentation in AI-powered productivity and has grown into a coordinated effort to enable intelligent, scalable solutions across the enterprise. As AI capabilities advanced, we saw an opportunity to move beyond individual productivity gains and toward […]

The post Microsoft Build 2026: Empowering our developers to adopt agentic AI at Microsoft appeared first on Inside Track Blog.

]]>
In Microsoft Digital, the company’s IT organization, our journey to agentic AI has been an evolution—one that began with early experimentation in AI-powered productivity and has grown into a coordinated effort to enable intelligent, scalable solutions across the enterprise.

As AI capabilities advanced, we saw an opportunity to move beyond individual productivity gains and toward something more transformative: Empowering our developers to build intelligent agents that can automate workflows, streamline operations, and create new business value.

Realizing this vision required more than new tools. We needed to rethink how we foster development, govern innovation, and operate at scale.

A photo of Fielder

“We’ve made a lot of progress enabling our developers to build agents that make us more productive. We’re Customer Zero at Microsoft, which means we’re the first to deploy and use the technology and services that we later sell to our customers. Those learnings give us a unique perspective and story to share about the journey our developers have been on with AI and agents.”

Brian Fielder, vice president, Microsoft Digital

Today, we’re sharing the foundation we built that supports this shift.

We’re driving employees across Microsoft to create and use AI agents—from simple, task-focused solutions to enterprise-grade applications available across the company. It’s all supported by a secure, governed, and extensible platform.

“We’ve made a lot of progress enabling our developers to build agents that make us more productive,” says Brian Fielder, vice president of Microsoft Digital, the company’s IT organization. “We’re Customer Zero at Microsoft, which means we’re the first to deploy and use the technology and services that we later sell to our customers. Those learnings give us a unique perspective and story to share about the journey our developers have been on with AI and agents.”

Within the context of Microsoft Build 2026, we’re sharing what it really takes to move from experimentation to impact. Through this collection of stories and resources, we highlight how we’re empowering our developers to build with agentic AI—from establishing governance and platform capabilities to driving adoption and delivering real-world outcomes. Our goal is to provide practical insights you can use to accelerate your own AI journey.

“We hope you find the journey we’ve been on practical and useful,” Fielder says. “When it comes to agents, we’re accelerating fast and scaling at an enterprise level. As our story continues to evolve, we look forward to sharing it with you.”

Guidance for developers: How we manage agentic AI at Microsoft

These articles outline our vision for agentic AI, showing how we’re building a secure, governed, and extensible foundation for AI agents—from Work IQ and Copilot Studio to Agent 365, Azure DevOps, and Model Context Protocol—so developers can create scalable, high-value solutions across the enterprise.

Our IT guide to becoming a Frontier Firm

These stories share our IT playbook for becoming a Frontier Firm, highlighting a practical path to enterprise AI maturity through agentic transformation, operational scale, responsible innovation, and partnership—showing how IT leaders can balance governance, modernization, and employee engagement while building an AI-first organization.

Working as developer in IT at Microsoft in the era of AI

These stories explore what it means to work in Microsoft Digital during the AI era, showing how developers and knowledge workers are reshaping engineering, the employee experience, and their own career growth through AI-powered tools, new ways of working, and personal journeys that reflect the evolving culture of IT at Microsoft.

Key takeaways

From our journey enabling agentic AI across Microsoft Digital, several key principles have emerged to help organizations move from experimentation to scalable, enterprise-wide impact.

  • Treat your organization as Customer Zero. Use your own AI capabilities first to generate real-world insights, validate scenarios, and build credibility before scaling to customers.
  • Build a foundation for scale. Establish a secure, governed, and extensible platform that enables developers to create AI agents—from simple solutions to enterprise-grade applications.
  • Empower developers to drive transformation. Move beyond productivity gains by enabling developers to build intelligent agents that automate workflows and unlock new business value.
  • Align governance with innovation. Rethink how you enable development, govern AI, and operate at scale to balance flexibility with responsible use.
  • Connect tools, platforms, and workflows. Integrate AI capabilities across your ecosystem—linking platforms, governance models, and development tools to support consistent, scalable adoption.
  • Translate experimentation into impact. Focus on turning early AI exploration into coordinated, enterprise-wide efforts that deliver measurable outcomes.

The post Microsoft Build 2026: Empowering our developers to adopt agentic AI at Microsoft appeared first on Inside Track Blog.

]]>
23855
Visualizing success: Steering your AI deployment with a strategy council http://approjects.co.za/?big=insidetrack/blog/visualizing-success-steering-your-ai-deployment-with-a-strategy-council/ Thu, 28 May 2026 16:05:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=23832 The pace of change when it comes to AI’s impact on business today is astounding. Companies are scrambling to develop and maintain a cohesive strategy for managing this impact and getting the most out of this revolutionary technology. At Microsoft Digital, the company’s IT organization, we’re using a set of employee councils to guide how […]

The post Visualizing success: Steering your AI deployment with a strategy council appeared first on Inside Track Blog.

]]>
The pace of change when it comes to AI’s impact on business today is astounding. Companies are scrambling to develop and maintain a cohesive strategy for managing this impact and getting the most out of this revolutionary technology.

At Microsoft Digital, the company’s IT organization, we’re using a set of employee councils to guide how we deploy and adopt AI across our organization. We took this approach for a simple reason: We need a model that can keep pace with technological change while staying grounded in business value.

Our baseline expectation for AI at Microsoft is practical.

Our AI initiatives need to deliver value every quarter, and we track progress through KPIs reviewed monthly at the leadership level. That standard creates healthy pressure. It also exposes a common gap many organizations experience in the beginning stages of their AI efforts: It’s easy to generate a lot of activity without producing business results.

A photo of Campbell.

“Our strategy council is how we separate signal from noise in our AI acceleration. It identifies the top scenarios with the greatest enterprise leverage, sharpens our executive focus on what truly matters, and enforces a one-to-one alignment between the work we resource and the outcomes we’re accountable to deliver.”

Don Campbell, principal group technical program manager, Microsoft Digital

In our council-based approach to AI, different councils focus on different needs. Together, they help us move from experimentation to repeatable, enterprise-grade outcomes. We think of these councils as building blocks that we can combine and evolve as the technology, the business, and our operating model change.

In this model, AI strategy needs its own council to help guide the overall approach and align our efforts across the enterprise. At the highest level, the strategy council is where we prioritize what matters most, decide how it maps to the outcomes we’re accountable for, and determine how we’ll judge progress month over month.

Our strategy council is how we separate signal from noise in our AI acceleration,” says Don Campbell, a principal group technical program manager in Microsoft Digital. “It identifies the top scenarios with the greatest enterprise leverage, sharpens our executive focus on what truly matters, and enforces a one-to-one alignment between the work we resource and the outcomes we’re accountable to deliver.

Strategy keeps our AI conversation at Microsoft from getting bogged down in discussions of tools and technology and forces us to keep our focus on the main goal: What are we trying to change in the business, and how will we know if we’ve succeeded?

A photo of Chand.

“We need a single cohesive story to bring together what’s happening across the organization and how those efforts contribute to real impact. The goal is to stitch that story together and solve for redundancies—if one part of the org has already solved a problem, another team shouldn’t have to reinvent the solution.”

Mohit Chand, principal group engineering manager, Microsoft Digital

AI strategy in action: Focus, alignment, and a monthly cadence

As our AI work at Microsoft accelerates, we continuously balance two truths at the same time. We want broad experimentation, because it’s how teams and employees learn fast. At the same time, we want our people to focus on what matters most to our enterprise and to ensure we are identifying and reducing potential redundancy.

Maintaining this balance is the core work of our AI strategy council. It helps us identify the AI-enabled scenarios that will deliver the most value, then keeps us honest about whether we’re delivering against the outcomes we’ve committed to.

“We need a single cohesive story to bring together what’s happening across the organization and how those efforts contribute to real impact,” says Mohit Chand, a principal group engineering manager in Microsoft Digital. “The goal is to stitch that story together and solve for redundancies—if one part of the org has already solved a problem, another team shouldn’t have to reinvent the solution.”

We have a detailed process that relies on engaging with our subject matter experts to keep the most impactful AI portfolio visible and actionable. We use it to summarize and track our top scenarios. Our AI strategy council views this process as work that’s always in process—a living view that changes as products ship and priorities shift. Delivered items come off, emerging bets go on, and the continuing discussion stays anchored to our goals.

“The pace right now is incredible. There’s a lot of excitement, but there’s also a risk if it’s not sustainable. A big part of our focus is figuring out how to take churn out of the system and make this work long‑term—for the business and for our people.”

Myron Wan, principal group product manager, Microsoft Digital

A tight rhythm and monthly cadence ensures that our conversations stay focused on whether the biggest bets are moving the needles we care about. That cadence helps us answer the questions leaders and customers are asking on a regular basis:

  • Where are you investing?
  • Why?
  • What’s working?
  • What would you do differently next time?
  • What did you learn along the way?
  • Where are we reinvesting and creating additional agency or capabilities for our employees?

When these questions frame the conversation, the outcomes naturally align to the direction our enterprise wants to go.

Structuring strategy and execution

To make our strategy council effective, we needed more than just a monthly meeting. We needed a way to organize work, assign accountability, and compare progress across very different teams without forcing everyone into the same mold.

We use three practices to accomplish this:

  • Group work into clear focus areas
  • Rely on product owners to drive execution
  • Use a shared approach for measuring value

“The pace right now is incredible,” says Myron Wan, a principal group product manager in Microsoft Digital. “There’s a lot of excitement, but there’s also a risk if it’s not sustainable. A big part of our focus is figuring out how to take churn out of the system and make this work long‑term—for the business and for our people.”

Working into focus areas

When we started to scale our initial AI efforts, our first challenge was simple: Everyone is building, but not always toward the same destination. That’s why we split the work into two primary focus areas that match how an IT organization operates. These areas include:

  • AI for corporate functions. Our AI work supports teams like finance, legal, and HR. We focus on removing friction from core processes and helping people make faster, better decisions.
  • AI for IT. We support AI initiatives across our IT operations in several areas:
    • Network and devices. We’re using AI for faster network device lifecycle management, more efficient incident management and remediations, and lower costs
    • Employee experience. We want to enable Microsoft employees to contribute real business value and enjoy how they do it.
    • Support. We’re reducing tickets, resolving issues faster, and helping support teams stay ahead instead of reacting.
    • Tenant management and security. Our AI investments strengthen how we run and protect our Microsoft 365 tenant.

From there, we map AI initiatives into those focus areas so we can see what’s happening across the landscape and spot gaps, overlaps, and opportunities to reuse what already exists.

A photo of O’Brien.

“We operate a council which helps set direction, but product management oversees execution of the solutions. Without product management’s ownership, our council would degrade into just a low-level approval step, which quickly makes us a roadblock instead of an enabler.”

Bill O’Brien, principal group product manager, Microsoft Digital

This step sounds basic, but it changes the conversation. It moves us away from a list of disconnected projects and toward a portfolio view, where we can figure out which scenarios matter most, where we have duplication, and where we need to invest more.

Keeping execution with product owners

While our AI strategy council sets direction, execution lies strictly with our product owners. A strategy council can’t run delivery. If it tries, it slows everything down. We avoid that trap by separating direction from doing.

“We operate a council which helps set direction, but product management oversees execution of the solutions,” says Bill O’Brien, a principal group product manager in Microsoft Digital. “Without product management’s ownership, our council would degrade into just an approval step, which quickly makes us a roadblock instead of an enabler.”

This clarity on roles and responsibilities helps teams work fast and ensures the council remains strategic. Product owners can prioritize week by week, learning from usage, adjusting product features, and shipping value. The council can stay focused on the portfolio and which bets rise to the top, what tradeoffs to make, and how we communicate progress and business outcomes to leadership.

A photo of Bunge.

“The first part of our strategy was all about getting people to a point where they could identify what they were trying to accomplish and report on how they’re getting there. We created a value measurement framework in partnership across multiple key players to give teams an idea of what’s valuable to the organization.”

Keith Bunge, principal software engineer, Microsoft

Using a common value framework

Once we can see the portfolio and have identified clear ownership, we still need one more thing: A shared language for determining value. Early in our journey, we were tempted to declare success simply based on activity—how many pilots we launched, how many tools we built, or how many demos we could show.

That activity is critical for innovation, but it doesn’t help us understand and drive business value. We needed teams to define the value they expect to deliver, explain why, and show how they’ll measure it.

“The first part of our strategy was all about getting people to a point where they could identify what they were trying to accomplish and report on how they’re getting there,” says Keith Bunge, a principal software engineer at Microsoft. “We created a value measurement framework in partnership across multiple key players to give teams an idea of what’s valuable to the organization.”

That framework helps in two ways:

  1. It forces upfront discipline: Teams clarify what value they’re chasing and how they’ll prove they’ve achieved it.
  2. It allows for fair comparison across very different initiatives: Everyone is describing impact in consistent categories, rather than inventing a new scorecard each time.

As our approach matures, we’re also pushing past raw savings metrics to the harder question: What did we do with the time or money we saved, and how did this create increased agency or capabilities?

Combining strategy and execution: A practical example

Here’s how that looks when we apply this approach to a real-world scenario.

Say one of our teams is proposing an AI solution to automate energy management in buildings. On day one, the idea sounds great: use signals from internal temperature and movement sensors to automatically adjust HVAC usage across large buildings. But the role of the strategy council isn’t just to approve great ideas. We ask for a clear value claim and a measurement plan.

Bunge provides a solid value claim for the example above.

“I’m going to come up with an automation that allows me to automatically turn off air conditioning in a building based on signals that we have from our internal sensors,” he says. “I think I’m going to be able to save $100,000 a quarter with this project because of my usage projections overlaid on the HVAC costs over the past five years.”

That kind of statement is useful, because it’s specific. It also forces the next question: How do you prove it? We’re asking teams to explain what data they’ll use as a baseline, what counts as savings, and how they’ll report progress over time.

We’re also raising the bar as the program matures.

Early on, teams may be able to prove that they saved time or reduced effort. As we get more rigorous, we’re pushing the “so what” conversation: What happens with the time saved, and what changes in the business as a result? It’s all part of moving from value measures to business outcomes, including what gets reinvested and where impact actually accrues.

Connecting AI strategy to the rest of our councils

Our AI strategy council is not the final measure or a standalone solution. We use it as the front door to a broader ecosystem that helps us move AI from ideas to enterprise outcomes.

A photo of Wu.

“Business strategy needs to lead the AI strategy. Business strategy defines the ‘what and why.’ AI defines the ‘how’ to get the business strategy implemented with real value. We need to use AI to help us achieve the business strategy, not the other way around.”

Qingsu Wu, principal group product manager, Microsoft Digital

Here’s how it fits together in practice. We use the strategy council to set our direction, and we keep a short list of top scenarios visible. Then we rely on complementary councils and capability groups to make those scenarios real: teams are building skills and patterns through enablement, strengthening foundations through data readiness, and applying Responsible AI practices so solutions scale safely.

We use process improvement and change management to drive adoption, because a strong model doesn’t matter if people don’t change how they work. And we use metrics and value tracking to keep the entire system accountable.

We’re also keeping a clear principle at the center: Business strategy leads, AI follows.

“Business strategy needs to lead the AI strategy,” says Qingsu Wu, a principal group product manager in Microsoft Digital. “Business strategy defines the “what and why.” AI defines the ‘how” to get the business strategy implemented with real value. We need to use AI to help us achieve the business strategy, not the other way around.”

That distinction matters as AI capabilities keep expanding and as teams continue to move faster.

Moving forward

As this work matures, one thing is clear: Strategy isn’t something we finish and move on from. It’s something we’re actively maintaining as AI adoption accelerates.

What we’ll do next is consistent with that mindset.

We plan to keep scaling what works while tightening and improving the system around it. We’re strengthening alignment across teams, pushing for more consistent measurement of impact, and sharpening how we choose the right approach for the right problem. We’re also treating strategy as a living motion, not an annual document, because business and technology are constantly changing.

We know that what got us here isn’t going to get us where we need to go next. We’re excited about the continued evolution of AI strategy here at Microsoft Digital as we focus on scale, alignment to real business problems, and making sure the pace is sustainable for our business.

Key takeaways

Leaders who are scaling AI across IT can apply these lessons from our experience to stay focused, move faster, and deliver measurable business impact.

  • Treat strategy as an ongoing practice. We’re revisiting priorities regularly to keep our AI work aligned with changing business goals.
  • Separate direction from execution. We’re using a small strategy group to set focus and expectations while product teams remain accountable for delivery.
  • Create a shared language for value. A consistent way to describe impact helps leaders compare initiatives, make tradeoffs, and explain progress with confidence.
  • Let experimentation mature into focus. Early exploration builds capability, but scaling requires narrowing attention to the AI scenarios that matter most.
  • Design for scale and sustainability. We’re paying as much attention to reuse, data readiness, and team sustainability as we are to speed and innovation.

The post Visualizing success: Steering your AI deployment with a strategy council appeared first on Inside Track Blog.

]]>
23832
Reinventing hybrid cloud integration at Microsoft—from months to one day http://approjects.co.za/?big=insidetrack/blog/reinventing-hybrid-cloud-integration-at-microsoft-from-months-to-one-day/ Thu, 28 May 2026 16:00:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=23843 For years, network engineering teams at Microsoft have faced a paradox: They can spin up a full Microsoft Azure cloud environment in a matter of hours but connecting that environment to on-premises labs and private networks can take up to nine months. Now, a team in Microsoft Digital—the company’s IT organization—is working to shrink that […]

The post Reinventing hybrid cloud integration at Microsoft—from months to one day appeared first on Inside Track Blog.

]]>
For years, network engineering teams at Microsoft have faced a paradox: They can spin up a full Microsoft Azure cloud environment in a matter of hours but connecting that environment to on-premises labs and private networks can take up to nine months.

Now, a team in Microsoft Digital—the company’s IT organization—is working to shrink that lengthy nine-month timeline to a single day.

The problem is architectural.

As our cloud footprint has grown, it has evolved into something richly segmented, tightly secured, and increasingly automated—a far cry from the relatively flat, monolithic corporate network that we originally extended into the cloud.

Getting those two worlds—on-premises and the cloud—to talk to each other securely and efficiently has become one of our most stubborn infrastructure challenges.

The solution we’re building is a fundamentally new operating model for hybrid cloud integration. It’s powered by AI-driven intake, end-to-end automation, and a set of repeatable patterns that treat the cloud as the new core of the network, rather than a distant branch of the old one.

The gap between cloud speed and network complexity

To understand the problem our team in Microsoft Digital set out to solve, it helps to understand how our company’s network architecture evolved over the past decade. When Microsoft first embraced Azure, the cloud was conceived as an extension of the existing corporate network.

A photo of McCleery.

“We have a development assembly line, and our goal is to give engineers the most efficient, frictionless experience doing software development for the company. Every day we delay solving this issue systemically is another day for the problem to get bigger.”

Tom McCleery, principal group cloud network engineering manager, Microsoft Digital

But the cloud grew faster than anyone anticipated.

Product engineering teams, drawn by the speed and flexibility of cloud-native tooling, began self-organizing their systems in Azure. They built segmented, purpose-built environments optimized for security and automation that looked nothing like the sprawling on-premises network they were supposed to connect to.

This shift had real consequences for Microsoft developers.

A software engineer sitting in building 32 on campus, for example, might have her Azure environment provisioned in half a day. But if she needed network connectivity to a physical Azure Stack lab down the hallway, getting that connection established—through firewalls, virtual routing frameworks, access control lists, and cross-team coordination—could take weeks or months.

“We have a development assembly line, and our goal is to give engineers the most efficient, frictionless experience doing software development for the company,” says Tom McCleery, principal group cloud network engineering manager in Microsoft Digital. “Every day we delay on solving this issue systemically is another day for the problem to get bigger.”

Why on-premises networks aren’t going away

Why not simply move everything to the cloud?

For Microsoft, the answer comes in many forms. As a company we build physical hardware, requiring hundreds of on-premises labs for software and hardware testing. We operate conference rooms, badge readers, thermostats, and wireless access points that will always require a physical network presence.

More fundamentally, Microsoft as a company hosts the cloud itself. If Azure were ever to go offline, our engineers responsible for recovery would need robust on-premises access that doesn’t rely on the very infrastructure they’re trying to restore.

Compounding all of these challenges are security requirements introduced by our Secure Future Initiative (SFI). The drive to reduce lateral threat movement across our network—limiting how far an attacker could reach if they compromised a single identity or device—has pushed our teams toward increasingly segmented environments. For our developers, that segmentation has meant navigating multiple networks, maintaining multiple identities, and juggling Yubikeys, smart cards, and authenticator apps just to move from one system to another.

The challenge, in short, is not that our network has too many pieces to be easily connected, it’s that those pieces weren’t designed to talk to each other efficiently.

This is what we had to fix.

Automation, patterns, and the path to ‘A Customer a Day’

Raghavendran Venkatraman is the principal engineering manager in Microsoft Digital who first pitched the vision of delivering a hybrid infrastructure in a single day.

A photo of Venkatraman.

“If we are not fast enough, our customers are going to outpace us and do it themselves—and they may not be adhering to all our enterprise security standards. The faster we deliver reliable infrastructure, the higher their confidence in us.”

Raghavendran Venkatraman, principal engineering manager, Microsoft Digital

The concept, which the team calls “A Customer a Day,” is built around the idea that it’s possible to deliver hybrid connectivity within 24 hours of finalizing requirements. Gathering, validating, and completing those requirements is where the team had to put their focus.

“If we are not fast enough, our customers are going to outpace us and do it themselves—and they may not be adhering to all our enterprise security standards,” Venkatraman says. “The faster we deliver reliable infrastructure, the higher their confidence in us.”

Three sequential domains of opportunity were identified, each a distinct bottleneck in the process. They all boasted impressive potential for improvement:

AI-driven unified intake

Customer describes requirements once. AI interprets and routes to the right pattern—no human coordination needed.

Replaces: Weeks of cross-team meetings before any build begins.

Predefined network patterns

A catalog of validated blueprints matches each request to a proven solution—no custom work from scratch.

Replaces: One-off negotiations restarted for every customer engagement.

End-to-end automation

A single workflow deploys from Azure all the way to the on-premises endpoint—no manual handoffs between teams.

Replaces: Days or weeks of manual steps after the cloud build is finished.

The result of these three innovations was the ability to make hybrid infrastructure live in one day, not months.

AI-driven unified intake. Today, when an engineering team needs hybrid connectivity, they become the conduit between multiple groups—networking teams, architecture teams, program managers, and security reviewers—that each have their own requirements, timelines, and vocabularies. The intake process alone can consume weeks of meetings before any actual implementation begins. The new model replaces that with an AI-powered interface that captures requirements directly from the customer, interprets them, and routes them to a predefined deployment pattern.

Predefined network patterns. Most hybrid workloads map to a small set of repeatable architectures. Rather than treating each onboarding as a custom engagement, the team has catalogued the most common hybrid connectivity scenarios and translated them into repeatable, validated patterns. The patterns drive both the AI intake and the automation layer, creating a system where the right solution can be identified and deployed without starting from scratch each time.

“The long pole in the tent used to be just getting the infrastructure up and running, but we are now able to do that pretty fast,” McCleery says. “Now, the challenge is sitting down with our customers, figuring out their requirements, and interpreting those into tasks that we can go implement in a matter of hours.”

End-to-end automation. On-premises, transport, and cloud network automation operate separately, but one-day delivery requires unified, pattern-aware orchestration. An AI orchestration agent manages sequencing, dependencies, and exceptions, enabling the hybrid stack to deploy as a single pipeline instead of in fragmented steps.

“The key architectural insight we reached is that any code touching device configuration should come from the service lines that own those devices. That’s a DevOps boundary—you own the customer experience, you specify the requirements, and then you call upon what we’ve built to interact with the back end. That’s a fundamentally different way of thinking about hybrid automation, and it’s what makes the end-to-end build possible.”

Juan Jimenez, principal cloud network engineer, Microsoft Digital

This is the work that Juan Jimenez, a principal cloud network engineer on the team, has been driving with multiple engineering cohorts.

“The key architectural insight we reached is that any code touching device configuration should come from the service lines that own those devices,” Jimenez says. “That’s a DevOps boundary—you own the customer experience, you specify the requirements, and then you call upon what we’ve built to interact with the backend. That’s a fundamentally different way of thinking about hybrid automation, and it’s what makes the end-to-end build possible.”

Building consensus across the network stack

Perhaps the hardest part of getting to “A Customer a Day” has been organizational. Bringing together cloud networking teams, on-premises network engineers, identity teams, security stakeholders, and program managers around a common framework requires a level of cross-disciplinary alignment that is extremely difficult.

What has helped is having a clear, human-scale goal that everyone can immediately understand and rally behind. When Venkatraman first named the initiative “A Customer a Day,” something shifted.

“You go over to the identity folks and say we’re trying to get a customer onboarded in a day—they’re like, ‘That would be great!’” McCleery says. “Same thing with on-premises networking. That message is easier to land than going in and saying, ‘Your engineers need to learn more about cloud.’ That’s when people start taking mental health days.”

One of the deeper mindset shifts the team has also been working to drive is a redefinition of what connectivity means. Historically, connectivity meant simply the network. In a cloud-first, AI-accelerated world, that definition is no longer sufficient.

“Connectivity means network and identity—together,” Venkatraman says. “That is the new definition, but it is not prevalent everywhere yet. Any CIO or CTO should pivot their entire organization to think about it that way. Don’t have two separate teams making decisions in silos and then trying to integrate. Get them in the room together from the start.”

Where we are today, and what comes next

Our Microsoft Digital team is candid about where we are in the journey: We’ve made meaningful progress, but we’re not yet at the finish line. The near-term goal is to complete the first customer launch scenarios within the next quarter, followed by broader adoption of the pattern framework in the quarter after that.

The goal isn’t 100% automation. The team is clear that a portion of hybrid networking will always require the custom work that complex or security-sensitive scenarios demand.

“We’re always going to have a longtail of scenarios that need human judgment,” McCleery says. “But for the 80% of common scenarios, if a customer is going down the compliant, paved path, things should happen a lot faster.”

For a team that’s spent years watching the gap between cloud and on-premises connectivity grow wider, the prospect of closing it—one customer, one day at a time—feels less like a moonshot and more like a welcome, needed correction.

Key takeaways

If your organization is wrestling with hybrid cloud integration, here are concrete steps you can act on today, informed by what we’ve learned on our journey:

  • Audit your hybrid integration timeline. If connecting a new cloud environment to on-premises networks takes more than a few weeks, map where the delays actually live—requirements gathering, cross-team handoffs, on-premises automation gaps, or other issue. You can’t fix what you haven’t measured.
  • Redefine connectivity to include identity. Bring your network and identity teams into the same room before any hybrid integration project begins. Treating these as separate workstreams is a primary source of rework, security gaps, and delay.
  • Identify your most common connectivity scenarios and document them as repeatable patterns. Even before you build automation, codifying your top five to ten hybrid connectivity use cases into standard blueprints gives every team a shared vocabulary and an accelerated starting point.
  • Set a single, human-scale goal your teams can align on. A unifying outcome (like “integrate a new environment in one day”) is more effective at driving cross-team alignment than a technical mandate. Find the shared aspiration before prescribing the solution.
  • Extend cloud tooling and automation frameworks to your on-premises teams. Don’t wait for on-premises engineers to independently upskill on cloud-native tooling. Invest in democratizing that capability deliberately, or the automation gap between your two environments will continue to widen.
  • Design intake around your systems, not your customers. Any hybrid integration process that requires an internal team to act as coordinator between multiple groups is a bottleneck by design. Use AI-assisted intake to make the requirements capturing self-service and the routing automatic.
  • Promote the framework before the tooling is finished. Publishing your architectural principles and patterns early (even when implementation is still in progress) aligns teams, accelerates buy-in, and gives other organizations a head start on their own journey.

The post Reinventing hybrid cloud integration at Microsoft—from months to one day appeared first on Inside Track Blog.

]]>
23843
Supercharging network operations at Microsoft with AI-based unified network intelligence http://approjects.co.za/?big=insidetrack/blog/supercharging-network-operations-at-microsoft-with-ai-based-unified-network-intelligence/ Thu, 21 May 2026 15:30:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=23737 At Microsoft, our network engineers work across multiple systems, including topology views, telemetry dashboards, logs, incidents, tickets, and fragmented tools. They piece together signals from these sources to understand what’s happening during an incident, often under considerable time pressure. But this kind of fragmentation slows down reasoning. Engineers spend more time navigating tools than diagnosing […]

The post Supercharging network operations at Microsoft with AI-based unified network intelligence appeared first on Inside Track Blog.

]]>
At Microsoft, our network engineers work across multiple systems, including topology views, telemetry dashboards, logs, incidents, tickets, and fragmented tools. They piece together signals from these sources to understand what’s happening during an incident, often under considerable time pressure.

But this kind of fragmentation slows down reasoning. Engineers spend more time navigating tools than diagnosing issues.

To address this, the Microsoft Infrastructure, Networking, and Tenant organization in Microsoft Digital, the company’s IT organization, is building Infrastructure Graph (IGraph), a unified platform that brings topology, real-time telemetry, and operational context into a single view.

On top of this foundation, agentic capabilities enable AI agents to reason across these signals, surfacing insights, explaining issues, and recommending next steps. This shifts the experience from exploring data to making decisions faster and with greater confidence.

A photo of Sinha.

“Engineers increasingly face fragmented visibility. We wanted to unify live telemetry, topology, and context into one single intelligent visualization experience and show engineers what’s really important, so they don’t have to dive into oceans of data.”

Astha Sinha, product manager, Infrastructure, Networking, and Tenant team, Microsoft Digital

This visualization layer and intelligence platform provides a view of our entire Microsoft enterprise network—including more than 20,000 on-premises devices across 900 sites worldwide—to instantly surface the most critical issues and offer proactive recommendations to our engineers.

“Engineers increasingly face fragmented visibility,” says Astha Sinha, a product manager in the Infrastructure, Networking, and Tenant team in Microsoft Digital. “We wanted to unify live telemetry, topology, and context into one single intelligent visualization experience and show engineers what’s really important, so they don’t have to dive into oceans of data.”

Network insight at speed

IGraph displays the following in a single pane-of-glass view for a given site:

  • Topology and dependency context: Visualizes routers, switches, access points, client devices, and their relationships, enriched with path and dependency awareness to localize impact areas
  • Real-time health and telemetry insights: Surfaces live performance signals (utilization, errors, abnormal behavior) correlates directly onto the topology to highlight where the network is degraded or “running hot”
  • Operational and incident context: Integrates incidents, tickets, and change signals into the graph, enabling engineers to understand what is happening and where and what systems are affected in a single view
A photo of Kumar Singh.

“Fragmentation across operational data sources was only part of the problem. The harder challenge was externalizing and structuring the implicit domain knowledge engineers rely on, then integrating it with real-time telemetry and topology to enable low-latency, context-aware reasoning in the agentic layer.”

Vinod Kumar Singh, principal software engineer, Infrastructure, Networking, and Tenant team, Microsoft Digital

On top of this visualization layer, the team is building an agentic layer using Azure Foundry that allows AI agents to discover and use external tools and data sources.

Without IGraph agent, accessing data involves pulling from multiple existing sources, including servers and logs, with mixed latency (from minutes to hours). This fragmentation makes near-real-time reasoning almost impossible, as agents lack a unified, low-latency view of topology and telemetry.

“Fragmentation across operational data sources was only part of the problem,” says Vinod Kumar Singh, a principal software engineer in the Infrastructure, Networking, and Tenant team in Microsoft Digital. “The harder challenge was externalizing and structuring the implicit domain knowledge engineers rely on, the integrating it with real-time telemetry and topology to enable low latency, context-aware reasoning in the agentic layer.”

How IGraph works

The user starts in context. Say they’re on the IGraph UI for Building 32. They can already see the building topology, recent incidents, support tickets, and live health and performance metrics.

The engineer can ask a natural language question such as, “The internet is not working in Building 32—what’s going on?”

The AI agent begins reasoning across UI context (location, devices, open incidents), topology (involved devices and neighbors), historical metrics, and real-time device calls. It works with specialized MCP servers and agents to identify impacted devices, test live responsiveness, measure neighboring impact, verify data flow, and flag abnormal utilization or error trends.

A photo of Vijay.

“Engineers spend a lot of time firefighting. The visualization layer gives them the view they need to quickly solve the incidents. It helps free up their time to engage in more systemic improvements on their applications.”

Abhijit Vijay, principal software engineer manager, Infrastructure, Networking, and Tenant team, Microsoft Digital

Using this context, IGraph pulls in the relevant logs, real-time telemetry, and incident history to complete the analysis.

Instead of raw metrics and hundreds of rows of data, the agent returns a clean summary that provides a view of the failing device, the health of neighboring devices, and the blast radius. It shows what’s broken, what’s still healthy, the likely causes, and next actions.

The engineer stays in one UI for all this, and isn’t forced to use different tools or manually correlate data.

“Engineers spend a lot of time firefighting,” says Abhijit Vijay, a principal software engineer manager on the team in Microsoft Digital. “The visualization layer gives them the view they need to quickly solve the incidents. It helps free up their time to engage in more systemic improvements on their applications.”

The impact of incident visibility

IGraph offers a new real-time telemetry layer that:

  • Uses a UI that surfaces telemetry and topology by correlating data from upstream systems
  • Decreases effective latency for users, enabling near-real-time insights (often within seconds)
  • Provides near-real-time signals in the UI on health, performance, routing state, and neighboring device relationships
A photo of Mallick.

“Our goal is to accelerate how network engineers understand what’s happening, enabling them to shift from reactive troubleshooting to proactive prevention—identifying and mitigating issues before they occur.”

Nevedita Mallick, principal product manager, Infrastructure, Networking, and Tenant team, Microsoft Digital

Combined, these capabilities give network engineers an up-to-the moment view of what’s happening across the network, before small issues can cascade into larger incidents.

By making live telemetry easier to access and interpret, IGraph helps teams move from reactive troubleshooting to proactive prevention.

“Our goal is to accelerate how network engineers understand what’s happening, enabling them to shift from reactive troubleshooting to proactive prevention—identifying and mitigating issues before they occur,” says Nevedita Mallick, a principal product manager for the Infrastructure, Networking, and Tenant team in Microsoft Digital.

That speed and clarity are especially important for new engineers.

A photo of Keskar.

“The tool delivers value right away, especially for newer engineers. Instead of having to piece things together, they get an instant view of the network that shows how devices are connected and displays the already-surfaced incidents directly on the graph.”

Manjiri Keskar, principal cloud network engineer, Infrastructure, Networking, and Tenant team, Microsoft Digital

Complex networks rely on unwritten knowledge and experience built up over time, which can slow onboarding and make troubleshooting harder than it needs to be. IGraph shortens that learning curve by making the network’s relationships and current state immediately visible.

“The tool delivers value right away, especially for newer engineers,” says Manjiri Keskar, a principal cloud network engineer in the Infrastructure, Networking, and Tenant team in Microsoft Digital. “Instead of having to piece things together, they get an instant view of the network that shows how devices are connected and displays the already-surfaced incidents directly on the graph.”

What’s next for IGraph Agent

Without IGraph Agent, network analysis is largely reactive.

Teams often address failures after customers have already felt the impact, instead of preventing issues by acting when early warning signs appear.

A photo of Munde.

“Agentic AI is transforming networking DevOps from manual, reactive operations into intelligent intent-driven systems that can provision, validate, and troubleshoot networks autonomously. Looking ahead, it will power self-healing networks and dramatically accelerate buildouts, allowing engineers to focus on architecture, strategy, and innovation.”

Sonika Munde, senior network engineer, Infrastructure, Networking, and Tenant team, Microsoft Digital

Teams often address failures after customers have already felt the impact, instead of preventing issues by acting when early warning signs appear.

“Agentic AI is transforming networking DevOps from manual, reactive operations into intelligent, intent-driven systems that can provision, validate, and troubleshoot networks autonomously,” says Sonika Munde, a senior network engineer in the Infrastructure, Networking, and Tenant team in Microsoft Digital. “Looking ahead, it will power self-healing networks and dramatically accelerate buildouts, allowing engineers to focus on architecture, strategy, and innovation.”

That unified network intelligence will let IGraph Agent communicate with multiple lightweight agents that continuously analyze network conditions, dramatically compressing response times.

“What used to happen in hours will happen in minutes,” Munde says.

Now, the team is pushing further. One example is layering in weather intelligence to help engineers anticipate issues before they materialize, as big storms can trigger power fluctuations that ripple through the network. By visualizing this data, engineers can proactively communicate with customers and take mitigation steps that protect operational workloads.

Overall, IGraph lets teams focus on prevention. Engineers spend less time navigating dashboards and cross-checking data and more time detecting patterns and surfacing emerging risks. Manual analysis is reduced as the agent highlights insights in real time.

A photo of Thompson.

“By bringing telemetry, topology, and AI together in one intelligent layer, we’re turning fragmented signals into real-time intelligence so teams can move faster, act earlier, and protect the critical workloads that power Microsoft.”

Jason Thompson, principal group product manager, Infrastructure, Networking, and Tenant team, Microsoft Digital

The technology is poised to go even further. IGraph will eventually help power self-healing networks and speed up network build-outs, freeing engineers to focus on architecture and innovation. The future vision for the tool includes fully automated predictive network intelligence across all Microsoft campuses, with agents that monitor, reason, recommend responses, and safely take action.

“By bringing telemetry, topology, and AI together in one intelligent layer, we’re turning fragmented signals into real-time intelligence so teams can move faster, act earlier, and protect the critical workloads that power Microsoft,” says Jason Thompson, a principal group product manager for the Infrastructure, Networking, and Tenant team in Microsoft Digital.

Key takeaways

To move from reactive operations to proactive AI-supported network management, we recommend starting with these steps:

  • Start consolidating real-time telemetry into a single view. Even a lightweight dashboard is enough to prepare for AI-driven insights later.
  • Identify high-frequency incident types to target for AI triage. Pick the most common or disruptive scenarios and map out what data engineers currently review for them.
  • Document the decision logic your engineers use today. Before implementing AI, capture the human reasoning steps to help guide your approach.
  • Pilot an agentic solution with one network segment or site. Start with one building, one lab, or a small testbed.

The post Supercharging network operations at Microsoft with AI-based unified network intelligence appeared first on Inside Track Blog.

]]>
23737
25 Years of SharePoint at Microsoft: Our lessons learned as Customer Zero http://approjects.co.za/?big=insidetrack/blog/25-years-of-sharepoint-at-microsoft-our-lessons-learned-as-customer-zero/ Thu, 14 May 2026 16:05:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=23570 For more than two decades, SharePoint has been a foundational part of how work happens at Microsoft. This pivotal application supports everything we do, including companywide communications, day‑to‑day collaboration, and empowering our employees to create, share, and manage information. In 2026, we’re celebrating 25 years of SharePoint at Microsoft. Microsoft Digital, the company’s IT organization, […]

The post 25 Years of SharePoint at Microsoft: Our lessons learned as Customer Zero appeared first on Inside Track Blog.

]]>
For more than two decades, SharePoint has been a foundational part of how work happens at Microsoft. This pivotal application supports everything we do, including companywide communications, day‑to‑day collaboration, and empowering our employees to create, share, and manage information.

In 2026, we’re celebrating 25 years of SharePoint at Microsoft. Microsoft Digital, the company’s IT organization, is commemorating this anniversary by reflecting on the journey we’ve taken with the product over the last quarter-century.

In this article, we’ll share our journey as SharePoint’s Customer Zero and step through the lessons we’ve learned building and maintaining an IT stack in the age of agentic AI.

Why SharePoint?

In the early 2000s, we faced a technical challenge familiar to just about any organization: We had important documents and data scattered across siloed file shares, institutional knowledge hidden away in email attachments, and access challenges preventing different teams from collaborating across geographical borders and departmental boundaries.

SharePoint offered the solution to these challenges.

Its flexible, web-based platform gave us the ability to collaborate using shared sites, centralized document libraries, and widely accessible workspaces. The application also fundamentally reshaped our corporate communications and publishing capabilities, providing features that would power key internal portals like Microsoft Web (our longtime internal company homepage, often called MSW), HRWeb, and MS Library.

A photo of Crewdson.

“At the time, because there were so few customers running SharePoint at scale, the product was in many ways directly built to meet our IT needs.”

Sam Crewdson, principal program manager, Microsoft Digital

The evolution of how we used SharePoint in Microsoft Digital can best be described in three phases:

  1. Our on-premises expansion and optimization
  2. Our migration to the cloud, self-service growth, and modernization
  3. Our incorporation of agentic AI

On-premises expansion and growing pains

When we first adopted on-premises SharePoint at scale, it became indispensable almost immediately. Internal teams used SharePoint to replace their existing file shares, publish information internally, and create many custom workflows and applications tailored to their needs.

Our team at Microsoft Digital was responsible for deploying SharePoint on an enterprise scale. Because we were one of the first enterprise customers to fully use SharePoint’s capabilities, we worked closely with the SharePoint product team from the beginning of its existence as a company. This meant we played a sizable role in influencing what SharePoint ultimately became.

At the time, because there were so few customers running SharePoint at scale, the product was in many ways directly built to meet our IT needs,” says Sam Crewdson, a principal program manager in Microsoft Digital. “A result of our being their first and best customer at the time was that the SharePoint team often built capabilities for us that no one else was asking for yet, such as specific portals features and supportability needs.”

Our initial adoption of SharePoint exposed some structural limitations and gaps. To meet the goals of our internal customers, we often relied on custom code, which made upgrades more difficult. And data governance and lifecycle management could be challenging, with our internal teams creating thousands of sites with little or no ownership tracking.

Using SharePoint in this way meant rapidly accumulating abandoned sites and outdated content. Trying to conduct even routine maintenance became difficult because there was no reliable way to contact site owners.

A photo of Snyder.

“Because of the initial difficulties, SharePoint was frustrating at first, especially for admins. But then I realized how important it was for our users—the product saved them so much time, and they were so happy that it was available. It was a complete 180-degree shift in my mindset towards SharePoint.”

Thomas Snyder, principal service engineer, Microsoft Digital

These challenges meant tensions often ran high for the IT team during the initial adoption phase. Tempers sometimes flared as we navigated this period in SharePoint’s evolution at Microsoft.

However, the time and effort we put into overcoming these growing pains—time and effort our customers didn’t have to invest themselves—made the frustrations well worth it.

“Because of the initial difficulties, SharePoint was frustrating at first, especially for admins,” says Thomas Snyder, a principal service engineer in Microsoft Digital. “But then I realized how important it was for our users—the product saved them so much time, and they were so happy that it was available. It was a complete 180-degree shift in my mindset towards SharePoint.”

Scalable self-service, effective governance, and the cloud

SharePoint’s role at Microsoft quickly expanded from a collaboration platform into a more powerful application where our teams could build workflows, forms, dashboards, and other solutions.

Thanks to a decision to enable SharePoint’s self-service site creation capabilities, our internal customers were able to use it to build the sites they needed without having to wait for us in IT. By removing the friction of having to work with IT, they innovated faster and built new capabilities on their own using SharePoint’s out-of-the-box technology.

However, this self-service power we gave to our users also drove some sprawl that we were not initially ready to manage. By the late 2000s, the information explosion that SharePoint sparked at the company was increasing our operational and governance burden. The rapid growth in sites delayed upgrades and introduced security and compliance issues stemming from a lack of clear ownership when site owners changed jobs or left the company.

As a result of this growth, we made the decision to invest heavily in building up our governance and lifecycle management for SharePoint. We prioritized defining clear ownership for all SharePoint sites, establishing best practices around data cleanup, and building the guardrails necessary to make widespread adoption and use more manageable.

Moving SharePoint to the cloud

Our cloud migration started in late 2010 and quickly became the driving force for us in IT. Rather than see the migration as a simple lift-and-shift activity, we took the opportunity to strategically reconfigure the architecture and customization level of our SharePoint instance.

This was a huge undertaking.

We had to think globally across all our sites in different regions and countries. The tooling suite for migration was immature at the time, meaning some of our portals and sites would require refactoring. We also had to contend with the constraints of varied and sometimes conflicting regional data residency requirements.

A photo of Johnson.

“It’s effectively filtering, so you don’t migrate everything. You’re cleaning your house before you move. You don’t move everything in your garage—you clean it out first. The easiest move is the one you don’t have to do.”

David Johnson, principal product manager architect, Microsoft Digital

Our approach to moving SharePoint to the cloud took several phases

First, early adopters who expressed active interest in migrating were provisioned the first sites in the cloud. By harnessing their enthusiasm for cloud services, we allowed them to self-migrate their own site content

Second, we did extensive analysis of all sites to establish actively used sites. Sites where we had no recent usage were backed up, stored offline, and deleted. If nobody screamed, we didn’t move them to the cloud.

Third, we moved the zero- and low-customization sites. These were sites using out-of-box features that had the highest likelihood of a successful migration

Finally, all we had left were the highly customized sites, which often used customization approaches which were not supported in the cloud. These we chose to manually rebuild and often to refactor as part of our migration approach.

While we were making these first-in-the-world migrations, we spent a lot of time with our SharePoint product team partners to learn how best to move sites and to document the approaches for the millions of sites that would follow. Sites which had high levels of customization or features that the cloud couldn’t support were instead rebuilt in the cloud environment from the ground up.

We treated our SharePoint cloud migration as an opportunity to take stock of what we had and decide what we didn’t want to bring with us into the new age of SharePoint at Microsoft. We cleaned our data and retired unused sites based on which content and functions employees told us they regularly used and relied on.

“It’s effectively filtering, so you don’t migrate everything. You’re cleaning your house before you move,” says David Johnson, a principal product manager architect in Microsoft Digital. “You don’t move everything in your garage—you clean it out first. The easiest move is the one you don’t have to do.”

Cloud migration also presented fresh governance challenges for our team. Governance practices had to be established for this new environment that would allow for effective self-service across multiple sites.

Building governance around lifecycle management, attestation, ownership policies, and guarding against oversharing required a significant amount of effort from the team, but it was necessary to ensure a smooth transition from an on-premises tool to the cloud.

Site modernization: Reducing the need for customization

Around 2016, SharePoint rolled out what came to be known as SharePoint Modern. This new version was a game changer for our major portals, as it reduced the need for heavy, developer-driven customization and replaced it with powerful out-of-the-box page creation capabilities, responsive design, and improved accessibility. The product also eventually added seamless built-in integration with solutions like Microsoft Teams and OneDrive.

Less custom code meant we could upgrade faster and dramatically lower our development, support, and maintenance costs. But the best part was the improved user experience and better navigability of the new version. Before this, our IT team fielded numerous questions about SharePoint on a weekly basis. The more intuitive, user-friendly experience of modern SharePoint reduced the volume of inquiries and service requests drastically. Our internal users were happier, and so were we.

SharePoint in the age of agentic AI

We see SharePoint as a key “knowledge platform” for AI. It’s a critical enterprise-scale repository for our documents and data and other information that we use to power our global enterprise.

“Security through obscurity is dead. It’s the double-edged sword of semantic search.”

Thomas Snyder, principal service engineer, Microsoft Digital

As such, it’s one of our key “knowledge platforms,” locations where we store the information that is the lifeblood of our enterprise. And as our enterprise-scale repository for documents, data, and other information used to run our global multinational, it has become the launching point for many of our AI-powered experiences.

AI is only as effective as the quality of the data it can access, which is why we’ve prioritized governance best practices as we make this transition. With these new tools, we’ve had to overcome new challenges.  For example, in the early days of AI, the discovery of previously well-buried personal data is becoming a common occurrence.

“Security through obscurity is dead,” Snyder says. “It’s the double-edged sword of semantic search.”

Prioritizing good governance helps ensure agentic AI only has access to the data it’s permitted to use, avoiding accidental oversharing and related hallucinations.

As an AI-driven Frontier Firm, we’re empowering our non-technical users and engineering and development teams alike to begin building custom AI agents to drive innovation at Microsoft. Our teams can now use agents in SharePoint for tasks like creating applications, knowledge depositories, and sites, saving huge amounts of time and effort.

Many of these agents will eventually be available in Azure DevOps and GitHub, so we’re focused on helping SharePoint site owners put the appropriate data ownership and permissions in place to effectively manage and govern the data for use by agentic AI.

After 25 years, SharePoint remains a core part of IT operations across Microsoft. We look forward to growing alongside it as it continues to evolve and improve.

Key takeaways

These insights can help you mature and transform how you use SharePoint at your company:

  • Self-service and good governance go together. Without solid guardrails for your SharePoint instance, your organization could contend with information sprawl and internal friction between departments.
  • Cloud migration is a golden opportunity. Before you migrate from on-premises IT to the cloud, take the time to clean your data to avoid carrying technical debt and outdated information into the future.
  • Out-of-the-box capabilities are your friend. Customization is useful, but too much of it can be unwieldy and expensive to maintain.
  • Make data hygiene a priority. Poorly governed data can undermine users’ trust in AI, expose sensitive information, and delay widespread adoption.

The post 25 Years of SharePoint at Microsoft: Our lessons learned as Customer Zero appeared first on Inside Track Blog.

]]>
23570
Harnessing AI: How a data council is powering our unified data strategy at Microsoft http://approjects.co.za/?big=insidetrack/blog/harnessing-ai-how-a-data-council-is-powering-our-unified-data-strategy-at-microsoft/ Thu, 09 Apr 2026 16:00:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=23030 Information technology is an ever-evolving landscape. Artificial Intelligence is accelerating that evolution, providing employees with unprecedented access to information and insights. Data-driven decision making has never been more critical for businesses to achieve their goals. In light of this priority, we have established a data council to help accelerate our companywide AI-powered transformation. Our data […]

The post Harnessing AI: How a data council is powering our unified data strategy at Microsoft appeared first on Inside Track Blog.

]]>
Information technology is an ever-evolving landscape. Artificial Intelligence is accelerating that evolution, providing employees with unprecedented access to information and insights. Data-driven decision making has never been more critical for businesses to achieve their goals.

In light of this priority, we have established a data council to help accelerate our companywide AI-powered transformation.

Our data council is a cross-functional team with representation from multiple domains within Microsoft, including Microsoft Digital, the company’s IT organization; Corporate, External, and Legal Affairs (CELA); and Finance.

A photo of Tripathi.

“By championing robust data governance, literacy, and responsible data practices, our data council is a crucial part of our AI-powered transformation. It turns enterprise data into a strategic capability that fuels predictive insights and intelligent outcomes across the organization.”

Naval Tripathi, principal engineering manager, Microsoft Digital

Our data council’s mission is to drive transformative business impact by establishing a cohesive data strategy across Microsoft Digital, empowering interconnected analytics and AI at scale. Our vision is to guide our organization toward Frontier Firm maturity through a clear blueprint for high-quality, reliable, AI-ready data delivered on trusted, scalable platforms.

“By championing robust data governance, literacy, and responsible data practices, our data council is a crucial part of our AI-powered transformation,” says Naval Tripathi, principal engineering manager in Microsoft Digital. “It turns enterprise data into a strategic capability that fuels predictive insights and intelligent outcomes across the organization.”

Our evolving data strategy

Over the past two decades, we at Microsoft—along with other large enterprises—have continuously evolved our data strategies in search of the right balance between control and agility. Early approaches were highly decentralized, with different teams owning and managing their own data assets. While this enabled local optimization, it also resulted in inconsistent quality and limited enterprise-wide insight.

Our subsequent shift toward centralized data platforms brought much-needed standardization, security, and scalability. However, as data platforms grew more sophisticated, ownership often drifted away from the business domains closest to the data, slowing responsiveness and diluting accountability.

Today, we and other leading companies are embracing a more balanced, federated approach, often described as a data mesh. Rather than forcing all our data into a single centralized system or allowing unchecked decentralization, the data mesh formalizes domain ownership while embedding governance, quality, and interoperability directly into shared platforms.

With this approach, our domain teams publish data as well-defined, discoverable products, while common standards for security, metadata, and compliance are enforced through automation rather than manual processes. This model preserves enterprise trust and consistency without sacrificing speed or autonomy.

By adopting a data mesh mindset, we can scale analytics and AI more effectively across the organization while still keeping ownership closely connected to the business focus. The result is a system that supports innovation at the edges, strong governance at the core, and seamless collaboration across domains, enabling the transformation of data from a technical asset to a strategic, enterprise-wide capability.

Quality, accessibility, and governance

To scale enterprise data and AI, organizations must first ensure their data is trusted, discoverable, and responsibly governed. At Microsoft Digital, our data strategy is designed to create data foundations that power intelligent applications and effective decision making across the company.

A photo of Uribe.

“High-quality, well-governed data is essential to accelerate implementation and adoption of AI tools. Data quality, accessibility, and governance are imperatives for AI systems to function effectively, and recognizing that is propelling our data strategy.”

Miguel Uribe, principal PM manager, Microsoft Digital

By implementing a data mesh strategy at scale, we aim to unlock valuable data insights and analytics, enabling advanced AI scenarios. Our data council focuses on three core dimensions that make AI-ready data possible:

  • Quality: Making sure enterprise data is reliable and complete
  • Accessibility: Enabling secure and discoverable access to data
  • Governance: Protecting and managing our data responsibly

Together, these dimensions form the foundation for scalable innovation and AI-powered data use. They connect data silos and ensure consistent, high‑quality access across the enterprise—enabling both humans and AI systems to work from the same trusted data foundation. As AI use cases mature, this foundation allows AI agents to retrieve and reason over data through enterprise endpoints, while supporting advanced analytics, data science, and broader technology.

“High-quality, well-governed data is essential to accelerate implementation and adoption of AI tools,” says Miguel Uribe, a principal PM manager in Microsoft Digital. “Data quality, accessibility, and governance are imperatives for AI systems to function effectively, and recognizing that is propelling our data strategy.”

Quality

AI-ready data is available, complete, accurate, and high-quality. By adopting this standard, our data scientists, engineers, and even our AI agents are better able to locate, process, and govern the information needed to drive our organization and maximize AI efficiencies.

By utilizing Microsoft Purview, our data council can oversee the monitoring of data attributes to ensure fidelity. It also monitors parameters to enforce standards for accuracy and completeness.

Accessibility

Ensuring that our employees get access to the information they need while prioritizing security is a foundational element of our enterprise data strategy. Microsoft Fabric allows us to unify our organization’s siloed data in a single “mesh” that enables advanced analytics, data science, data visualization and other connected scenarios.

Microsoft Purview then gives us the ability to democratize that data responsibly. By implementing a data mesh architecture, our employees can work confidently, unencumbered by siloed or inaccessible data, and with the assurance that the data they’re working with is secure.

A graphic shows how the data mesh architecture allows employees to access data they need, with platform services and data management zones surrounding this architecture.
The data mesh architecture enables our employees to do their work efficiently while preventing the data they’re working on from becoming siloed.

The data mesh connects and distributes data products across domains, enabling shared data access and compute while scaling beyond centralized architectures.

Platform services are standardized blueprints that embed security, interoperability, policies, standards, and core capabilities—providing guardrails that enable speed without fragmentation.

Data management zones provide centralized governance capabilities for policy enforcement, lineage, observability, compliance, and enterprise-wide trust.  

Governance

As organizations scale AI capabilities, strong governance becomes essential to ensure security, compliance, and ethical data use. Data governance—which includes establishing data policies, ensuring data privacy and security, and promoting ethical AI usage—is critical, as is compliance with General Data Protection Regulation (GDPR) and Consumer Data Protection Act (CDPA) regulations, among others.

However, governance is not only a technical capability; it’s also a cultural commitment.

Responsible data use must be embedded into the way teams manage data and build AI solutions. Through Microsoft Purview, we implemented an end-to-end governance framework that automates the discovery, classification, and protection of sensitive data across the enterprise data landscape.

This unified approach allows teams to innovate confidently, knowing that the data powering their insights and AI systems is trusted and protected, as well as responsibly managed.

“AI systems are only as reliable as the data that powers them,” Uribe says. “By investing in trusted and well-managed data, we accelerate not only the adoption of AI tools but our ability to generate meaningful insights and intelligent outcomes.”

The data catalog as the discovery layer

By serving as a common discovery layer for humans and AI, the data catalog ensures that governance translates directly into speed, accuracy, and trust at scale.

A unified data strategy only succeeds if both people and AI systems can consistently find the right data. At Microsoft, this is enabled by our enterprise data catalog, which operationalizes the standards set by our data council. 

For business users, the catalog provides intuitive search, ownership transparency, and trust signals—enabling confident self‑service analytics. For AI agents, the same catalog exposes machine‑readable metadata, allowing agents to programmatically discover canonical datasets, validate schema and freshness, and respect governance constraints.

Our role as Customer Zero

In Microsoft Digital, we operate as Customer Zero for the company’s enterprise solutions, so that our customers don’t have to.

That means we do more than adopt new products early. We deploy them at enterprise-scale, operate them under real‑world constraints, and hold them to the same standards our customers expect. The result is more resilient, ready‑to‑use solutions and a higher quality bar for every enterprise customer we serve.

A photo of Baccino.

“When we engage product teams with real telemetry from how data is created, governed, and consumed at scale, we move the conversation from theory to execution. That’s how enterprise readiness becomes real.”

Diego Baccino, principal software engineering manager, Microsoft Digital

Our data council embodies this Customer Zero mindset through its Enterprise Readiness initiative. By engaging product engineering as a unified enterprise voice, the council drives strategic conversations that surface operational blockers, influence roadmap prioritization, and ensure new and existing data solutions are truly ready for enterprise use.

These learnings are then shared broadly across Microsoft Digital to accelerate adoption, reduce duplication, and scale proven patterns across teams.

“When we engage product teams with real telemetry from how data is created, governed, and consumed at scale, we move the conversation from theory to execution,” says Diego Baccino, a principal software engineering manager in Microsoft Digital and a member of the council. “That’s how enterprise readiness becomes real.”

This work is deeply integrated with our AI Center of Excellence (CoE), where Customer Zero principles are applied to accelerate AI outcomes responsibly. Together, the AI CoE and the data council focus on improving data documentation and quality—foundational capabilities that are required to make AI feasible, trustworthy, and scalable across the enterprise.

By grounding AI innovation in measurable data quality and governance standards, Microsoft Digital ensures that experimentation can safely mature into production‑ready solutions. The partnership between our data council, our AI CoE, and our Responsible AI (RAI) Council is essential to our broader data and AI strategy.

“AI readiness isn’t aspirational—it’s operational,” Baccino says. “By measuring the health of our data, setting clear quality baselines, and using those signals to guide product and platform decisions, we turn data into a strategic asset and AI into a repeatable capability.”

Together, these teams exemplify what it means to be Customer Zero: Transforming enterprise experience into action, governance into acceleration, and data into durable competitive advantage.

Advancing our data culture

Our data council plays a pivotal role in advancing the organization transition from data literacy to enterprise data and AI capability. In conjunction with our AI CoE, it creates curricula and sponsors learning pathways, operational practices, and community programs to equip our employees with the skills and mindset required to thrive in a data- and AI-centric world.

While early efforts focused on improving data literacy, our data council ’s mission has evolved to enable data and AI capability at scale together with our AI CoE—where employees not only understand data but can effectively apply it to build, operate, and govern intelligent solutions.

“Our focus is not just teaching our teams about data. It is enabling employees to apply data to create AI-driven outcomes. When teams understand how data powers AI systems, they can make better decisions, design better products, and build more responsible AI experiences.”

Miguel Uribe, principal product manager, Microsoft Digital

Our curriculum includes high-level courses on data concepts, applications, and extensibility of AI tools like Microsoft 365 Copilot, as well as data products like Microsoft Purview and Microsoft Fabric.

By facilitating AI and data training, offering internally focused data and AI certifications, and internal community engagement, our council ensures that employees develop the capabilities required to responsibly build and operate AI-powered solutions. Achieving data and AI certifications not only promotes career development through improved data literacy, it also enhances the broader data-driven culture within our organization.

“We recognize that AI capability is built when data skills are applied directly to real AI scenarios and business outcomes—not when learning exists in isolation,” Uribe says. “Our focus is not just teaching our teams about data; it is enabling employees to apply data to create AI‑driven outcomes. When teams understand how data powers AI systems, they can make better decisions, design better products, and build more responsible AI experiences.”

Lessons learned

Our data council was created to develop and execute a cohesive data strategy across Microsoft Digital and to foster a strong data culture within our organization. Over time, several critical lessons have emerged.

Executive sponsorship enables transformation

Executive sponsorship is a key element to ensure implementation and adoption of a data strategy. Our leaders are committed to delivering and sustaining a robust data strategy and culture and have been effective champions of the council’s work.

“Leadership provides support and reinforcement of the council’s mission, as well as guidance and clarity related to diverse organizational priorities,” Baccino says.

Cross-functional collaboration accelerates impact

Our council’s work has also benefited from the diverse representation offered by different disciplines across our organization. Embracing diverse perspectives and understanding various organizational priorities is critical to implementing a successful data strategy and culture in a large and complex organization like Microsoft Digital.

Modern platforms allow for scalable AI productivity

Technology and architecture also play a critical role in enabling enterprise data and AI capability. Platforms like Microsoft Purview and Microsoft Fabric provide the governance, discovery, and analytics infrastructure required to create trusted, AI-ready data ecosystems.

Combined with strong leadership support and community engagement, these platforms allow our organization to move beyond isolated data projects toward connected, enterprise-wide intelligence.

As our organization continues to evolve, our data council’s strategic work and valuable insights will be crucial in shaping the future of data-driven decision making and AI transformation at Microsoft.

Key takeaways

Here are some things to keep in mind as you contemplate forming a data council to help you manage and scale AI impacts responsibly at your own organization:

  • A data mesh strikes the balance enterprises have been chasing. By formalizing domain ownership while enforcing standards through shared platforms, you avoid both chaotic decentralization and slow, over-centralized control.
  • Governance is an accelerator when it’s automated and embedded. Using platforms like Microsoft Purview and Microsoft Fabric, governance shifts from a manual gatekeeping function to a built‑in capability that enables faster, trusted analytics and AI.
  • AI systems are only as strong as their discovery layer. A unified enterprise data catalog allows both people and AI agents to find, trust, and use data consistently—turning standards into operational speed.
  • Customer Zero turns theory into enterprise‑ready execution. By operating its own data and AI platforms at scale, Microsoft Digital provides real telemetry and practical feedback that directly shapes product readiness.
  • Building AI capability is a cultural effort, not just a technical one. Our data council’s focus on applied learning, certification, and real-world AI scenarios ensures data skills translate into durable business outcomes.
  • AI scale exposes the cost of fragmented data ownership. A data council cuts through silos by aligning priorities, resolving tradeoffs, and concentrating investment on the data assets that matter most for AI impact.
  • Shared metrics create shared ownership. Publishing data quality and AI‑readiness scores at the leadership level reinforces accountability and positions data as a core enterprise asset.

The post Harnessing AI: How a data council is powering our unified data strategy at Microsoft appeared first on Inside Track Blog.

]]>
23030
Powering the new age of AI-led engineering in IT at Microsoft http://approjects.co.za/?big=insidetrack/blog/powering-the-new-age-of-ai-led-engineering-in-it-at-microsoft/ Thu, 05 Mar 2026 17:05:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=22539 When generative AI burst into the mainstream, it landed in our IT engineering organization like a shockwave. There was excitement, curiosity, skepticism, and no shortage of questions about what this technology meant for the future of IT. At Microsoft Digital—the company’s IT organization—we didn’t start with a grand transformation plan. Instead, we started with a […]

The post Powering the new age of AI-led engineering in IT at Microsoft appeared first on Inside Track Blog.

]]>
When generative AI burst into the mainstream, it landed in our IT engineering organization like a shockwave.

There was excitement, curiosity, skepticism, and no shortage of questions about what this technology meant for the future of IT.

At Microsoft Digital—the company’s IT organization—we didn’t start with a grand transformation plan. Instead, we started with a realization: AI wasn’t just another tool to roll out. It was a fundamental shift in how engineering work could happen.

For years, our IT teams have been focused on scale, reliability, and operational excellence. Those priorities didn’t change. What changed were the possibilities.

Suddenly, engineers could draft code in seconds, summarize complex systems instantly, or automate work that had once consumed hours or days. It was an opportunity to take the skills and capabilities of our people and amplify them with AI.

That realization forced us to step back and ask harder questions.

How do you help thousands of engineers understand what AI can actually do to impact their day-to-day work? How do you move from experimentation to trust? And how do you adopt AI in a way that strengthens engineering fundamentals instead of eroding them?

The answer came in the form of a phased journey grounded in people, culture, and continuous learning.

Phase 1: Awareness and access

It might sound surprising when speaking about engineering processes, but our first challenge wasn’t technology; it was understanding.

When generative AI entered the conversation, most engineers saw the headlines and dabbled in various tools, but few understood fully what it meant for their work. Some were excited, others were wary. Many simply didn’t know where to start. That gap between awareness and practical value was the first barrier we had to address.

We realized early that top-down mandates wouldn’t work. Telling engineers to “use AI” without context or relevance would only deepen skepticism. Instead, we focused on something both simpler and more difficult: Exposure.

We started by making AI visible and accessible in the tools engineers already used. GitHub Copilot. Microsoft 365 Copilot. Early copilots embedded directly into engineering workflows. The goal wasn’t immediate productivity gains. It was familiarity. Letting engineers see, firsthand, what AI could and couldn’t do.

A photo of Singhal.

“We encouraged tool usage and adoption so people would at least play around with AI. And once they did, they started seeing the value. That’s when the mindset shifted from ‘AI might replace me’ to ‘AI can be my companion.’”

Mukul Singhal, partner group engineering manager, Microsoft Digital

Just as important, we talked openly about limitations.

AI wasn’t perfect. It hallucinated. It made confident mistakes. And that honesty mattered. By framing AI as an assistant, we reinforced the role of engineering judgment. Engineers didn’t need to fear losing control. They needed to understand how to stay in control.

We also made experimentation safe.

No quotas. No forced adoption metrics. Engineers were encouraged to try AI on low‑risk tasks: summarizing documentation, generating test cases, or exploring unfamiliar codebases. Small wins built confidence, confidence built curiosity, and curiosity drove organic adoption.

As that experimentation took hold, the mindset began to shift.

“We encouraged tool usage and adoption so people would at least play around with AI,” says Mukul Singhal, a partner group engineering manager in Microsoft Digital. “And once they did, they started seeing the value. That’s when the mindset shifted from ‘AI might replace me’ to ‘AI can be my companion.’”

Over time, conversations changed from ‘Should we use AI?’ to ‘Where does AI help most?’

Engineers began sharing prompts, tips, and lessons learned with one another. What started as individual exploration turned into community learning. Awareness gave way to momentum.

Phase one was about providing access to explore, to question, and to learn. And that foundation made everything that followed possible.

Phase 2: Culture shift

Access created awareness and awareness created curiosity.

As more engineers began experimenting with AI, we noticed a pattern. Some teams were moving faster, learning faster, and reducing friction in their day‑to‑day work. Others stalled after initial trials. The difference wasn’t technical skill or capability, it was mindset.

A photo of Mamilla.

“People started shifting from the mindset of ‘Will AI work?’ to ‘AI is working for me.’ I think that was a very transformational shift, to where I believe a lot of engineers in the organization started believing in AI.”

Veera Mamilla, principal group engineering manager, Microsoft Digital

To move forward, we had to shift how AI was perceived from something optional or experimental to something that was simply part of how modern engineering gets done.

That meant normalizing AI as a trusted partner in the engineering process.

Leaders played a critical role in that shift. Rather than positioning AI as a productivity shortcut, they framed it as a way to strengthen engineering fundamentals: clearer design discussions, better documentation, faster feedback loops, and more time for deep problem‑solving. The message was intentional and consistent. Using AI wasn’t about cutting corners, it was about reimagining how work gets done.

We also had to address a fear that surfaced early: that AI adoption was a signal of replacement rather than empowerment.

“People started shifting from the mindset of ‘Will AI work?’ to ‘AI is working for me,’” says Veera Mamilla, a principal group engineering manager in Microsoft Digital. “I think that was a very transformational shift, to where I believe a lot of engineers in the organization started believing in AI.”

That framing mattered.

As engineers incorporated AI into their workflows, success stopped being measured by output alone. The focus shifted to outcomes. Did AI help you understand a system faster? Did it surface risks earlier? Did it free up time to focus on higher‑value work?

Over time, AI stopped feeling like a novelty. It became part of the engineering fabric. We reinforced it through leadership modeling, peer learning, and shared success stories. Teams no longer asked whether AI belonged in their workflows. They asked how to use it responsibly and effectively.

Phase 3: Upskilling and role evolution

Once AI moved from curiosity to expectation, the challenge of skill building became unavoidable.

From the start, we made a deliberate choice: This would be an upskilling and reskilling journey, not a wholesale replacement of roles. The goal wasn’t a new workforce. It was an investment in the one we had.

That decision shaped everything that followed.

Early upskilling efforts focused on practical entry points. Prompt engineering. Tool literacy. Understanding how copilots and early agents behaved in real engineering workflows. We treated these as something every engineer needed to experiment with, regardless of discipline.

But it quickly became clear that skills alone weren’t the full story. Roles themselves were starting to evolve.

A photo of Singh.

“Your title might still be software engineer or principal engineer. But if you’re acting like an AI engineer, what does that actually mean? That question helped us start defining how these roles were evolving.”

Ragini Singh, partner group engineering manager, Microsoft Digital

Across software development, service engineering, and cloud network engineering, the work was shifting from manual execution toward orchestration and oversight. Engineers were no longer expected to do every task end‑to‑end by hand. Instead, they were learning how to guide AI, review its output, and decide where automation made sense and where it didn’t.

As part of this shift, we began researching how the industry itself was redefining engineering roles. Leaders examined emerging job descriptions from across the market and compared them with Microsoft’s own role frameworks. At the time, there was no formal “AI engineer” role in the internal job library. Rather than creating a new title, the focus stayed on evolving expectations within existing roles.

The idea of an “AI‑native engineer” emerged not as a job description, but as a mindset.

An AI‑native engineer still understands systems, architecture, and risk. What’s different is how that expertise gets applied. Routine tasks are delegated to AI. Judgment, design, and accountability stay with the human. Engineers move from doing all the work themselves to supervising work done in partnership with AI.

“Your title might still be software engineer or principal engineer,” says Ragini Singh, a partner group engineering manager in Microsoft Digital. “But if you’re acting like an AI engineer, what does that actually mean? That question helped us start defining how these roles were evolving.”

This evolution looked different across disciplines. Software engineers focused on AI‑assisted coding, test generation, and spec‑driven development. Service engineers leaned into AI for incident response, knowledge capture, and operational decision support. Cloud network engineers began moving from manual intervention toward intelligent orchestration and agent‑assisted troubleshooting. The common thread wasn’t identical tooling, it was a shared shift toward higher‑order work and reduced toil.

Phase 4: Embedding AI across the engineering lifecycle

By this phase, we knew individual productivity gains were simply the starting point for larger and broader benefits.

Early on, most AI usage showed up in familiar places: Code suggestions, documentation summaries, quick answers. Useful, but fragmented. The bigger opportunity emerged when we stepped back and asked a harder question: What would it look like if AI were embedded across the entire engineering lifecycle, not just used at isolated moments?

We stopped thinking in terms of tools and started thinking in terms of flow. Design. Build. Test. Deploy. Operate. Improve. AI needed to show up across all of it, in ways that reinforced how engineers already worked.

A photo of Sadasivuni.

“If AI is only showing up at one step, you don’t get the full value. The real impact comes when it’s integrated across the lifecycle, where engineers can design, build, operate, and learn faster as a system.”

Sudhakar Sadasivuni, principal group engineering manager, Microsoft Digital

In software engineering, that meant pulling AI earlier into the process. We began using it to help draft requirements, reason through design options, and review code with broader system context to accelerate how quickly we could get to informed decisions. Coding assistance mattered, but it was no longer the center of gravity.

Testing and quality followed a similar pattern. AI supported test generation, defect analysis, and code review, reducing repetitive effort and helping issues surface sooner. That gave engineers more time to focus on quality and architecture instead of cleanup.

In service engineering, we embedded AI into incident management and operational workflows. Engineers used it to summarize incidents, surface relevant knowledge, and analyze signals across systems. In cloud network engineering, AI helped shift work away from manual intervention toward orchestration and intelligent troubleshooting. Across disciplines, the principle stayed the same: AI should reduce friction, not introduce it.

As we scaled this approach, one thing became clear. Embedding AI wasn’t just a technical exercise. It was a systems change.

“If AI is only showing up at one step, you don’t get the full value,” says Sudhakar Sadasivuni, a principal group engineering manager in Microsoft Digital. “The real impact comes when it’s integrated across the lifecycle, where engineers can design, build, operate, and learn faster as a system.”

As AI became part of core workflows, engineers remained accountable for outcomes. AI output was reviewed, tested, and validated like any other engineering input. Embedding AI didn’t lower the bar for rigor. It raised expectations around judgment, oversight, and data quality. We became more deliberate about responsibility and governance.

Over time, these integrations created compound benefits.

Faster design cycles reduced downstream rework. Better testing lowered operational noise. Improved operational insight shortened recovery times. AI stopped being something we used occasionally and became something the engineering system itself was built around.

Phase 5: Eliminating toil and accelerating outcomes

At some point, every AI story hits the same test. Does it actually make engineers’ days better? For us, that proof showed up fastest in elimination of toil.

Across Microsoft Digital, engineers have always spent time on work that was necessary but draining. It included tasks such as manual troubleshooting, repetitive diagnostics, log analysis, and routine operational tasks that kept systems running but didn’t move the organization forward.

AI gave us a chance to change that.

A photo of Garrison.

“Toil reduction is the biggest thing. That’s where engineers’ eyes light up. If we can eliminate toil, people engineers will flock to use AI. I really believe it.”

Beth Garrison, principal cloud network engineer, Microsoft Digital

In cloud network engineering, for example, troubleshooting used to require manually reconstructing what happened, such as logging into devices, chasing configurations, and piecing together context after the fact. As we began introducing agents and machine learning into these workflows, that work shifted. Instead of spending time assembling the picture, engineers could generate the views they needed faster and focus on resolving issues.

The same shift showed up in how we used operational data.

Rather than reacting to incidents after impact, we started using machine learning to analyze logs, identify patterns, and surface anomalies earlier. That moved teams from reactive response toward proactive monitoring and prevention.

One thing became clear very quickly: Toil reduction wasn’t just a benefit; it was the catalyst for adoption.

“Toil reduction is the biggest thing. That’s where engineers’ eyes light up,” says Beth Garrison, a principal cloud network engineer at Microsoft Digital. “If we can eliminate toil, people engineers will flock to use AI. I really believe it.”

Service engineering followed a similar arc.

Across governance, operations, productivity, and cost management, we began applying agents and automation to simplify complex work and reduce manual review cycles. Governance and compliance workflows became faster and more consistent. Operational processes benefited from guided remediation and earlier insight. Knowledge capture improved as documentation and remediation guidance could be generated and updated automatically.

When we removed repetitive work such as manual triage, rote diagnostics, endless documentation cleanup, we transformed how engineers spent their time. More focus on design. More proactive problem‑solving. More energy directed toward improving systems instead of just maintaining them.

Toil reduction made the value of AI tangible. It’s the moment AI stopped being interesting and became indispensable, and our engineering teams started asking where else we can apply it next.

Measuring what matters

By the time AI was embedded across our engineering lifecycle, a new question came into focus: “How do we know it’s working?”

In the early days, we paid close attention to usage. Which tools engineers were trying, where adoption was growing, or where it stalled. Those signals mattered and adoption was the leading indicator that people were getting comfortable and starting to integrate AI into real work.

“Adoption was always the starting point. But we were clear from the beginning that usage isn’t the destination. The real goal is impact; more time for engineers to focus on the work that truly matters.”

Ullas Kumble, principal group software engineering manager, Microsoft Digital

But using AI doesn’t automatically mean better outcomes. So, we shifted the conversation and started asking, “What’s different now that our engineers are using AI?”

That change reframed how we thought about measurement. We began looking beyond tool activity to understand impact across the engineering system. Faster design cycles. Earlier defect detection. Reduced time spent on repetitive operational work. Shorter incident resolution. Clearer documentation. Fewer handoffs. Less rework.

These weren’t abstract metrics. They showed up in the flow of work.

We were intentional about not forcing a single definition of value across every role. Software engineers, service engineers, and cloud network engineers experience impact differently. What mattered was that each team could point to tangible improvements in how work moved through the system.

That perspective shaped how leadership talked about success.

“Adoption was always the starting point,” says Ullas Kumble, a principal group software engineering manager at Microsoft Digital. “But we were clear from the beginning that usage isn’t the destination. The real goal is impact; more time for engineers to focus on the work that truly matters.”

Over time, this approach changed the quality of our conversations. Instead of debating whether AI was worth the investment, teams talked about where it was removing friction and where it still wasn’t delivering enough value. Measurement became a tool for learning and prioritization.

Moving forward

Looking ahead, one lesson stands out: this journey isn’t complete.

AI tools will continue to evolve. Agents will become more capable. Roles will keep shifting. What it means to be an engineer will continue to change. And that means our approach must stay grounded in the same principles that guided us from the start: invest in people, reinforce fundamentals, embed AI into real workflows, and stay honest about what’s working and what isn’t.

We didn’t set out to build an AI‑driven engineering organization overnight, we built it phase by phase.

By meeting engineers where they were
By reshaping culture before redefining roles.
By embedding AI across the lifecycle, not bolting it on.
By reducing toil and measuring impact where it mattered most.

The result is better engineering: powered by AI, guided by human judgment, and built to keep evolving.

Key takeaways

Here’s a set of approaches you can take to establish AI-led engineering for your organization:

  • Start with access and understanding. Give engineers safe, easy access to AI in the tools they already use so curiosity and confidence can develop organically before you push for outcomes.
  • Frame AI as a partner, not a replacement. Position AI as an assistant that strengthens engineering judgment and fundamentals rather than a shortcut or a threat to roles.
  • Normalize experimentation without pressure. Encourage low‑risk experimentation and peer sharing instead of mandates, allowing adoption to grow through visible, practical wins.
  • Invest in upskilling. Focus on evolving skills and expectations within existing roles so engineers learn how to guide, review, and stay accountable for AI‑assisted work.
  • Embed AI across the full engineering lifecycle. Look beyond isolated productivity gains and integrate AI into design, build, test, operate, and improve workflows to unlock system‑level impact.
  • Measure impact where engineers feel it. Move past usage metrics and track outcomes like reduced toil, faster feedback, and improved flow so teams can see where AI is truly making work better.

Try it out

Try GitHub Copilot.

The post Powering the new age of AI-led engineering in IT at Microsoft appeared first on Inside Track Blog.

]]>
22539
Protecting anonymity at scale: How we built cloud-first hidden membership groups at Microsoft http://approjects.co.za/?big=insidetrack/blog/protecting-anonymity-at-scale-how-we-built-cloud-first-hidden-membership-groups-at-microsoft/ Thu, 26 Feb 2026 17:00:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=22465 Some Microsoft employee groups can’t afford to be visible. For years, we supported email‑based communities internally here at Microsoft whose very existence depends on anonymity. These include employee resource groups, confidential project teams, and other sensitive audiences where simply revealing who belongs can create real‑world risk. Traditional distribution groups make membership discoverable by default. Owners […]

The post Protecting anonymity at scale: How we built cloud-first hidden membership groups at Microsoft appeared first on Inside Track Blog.

]]>
Some Microsoft employee groups can’t afford to be visible.

For years, we supported email‑based communities internally here at Microsoft whose very existence depends on anonymity. These include employee resource groups, confidential project teams, and other sensitive audiences where simply revealing who belongs can create real‑world risk.

Traditional distribution groups make membership discoverable by default. Owners can see members. Admins can see members. In some cases, other users can infer membership through directory queries or tooling.

That model doesn’t work when anonymity is a requirement.

A photo of Reifers.

“When the SFI wave hit, it was made clear to us that we needed to keep our people safe, and to do that, we needed to build a new hidden memberships group MVP. We needed to raise the bar with modern groups, and we needed to do it in six months or miss meeting our goals.”

Brett Reifers, senior product manager, Microsoft Digital

For over 15 years, we relied on a custom, on‑premises solution that enabled employees to send and receive messages through groups with fully hidden memberships.

The system worked, but we were deprecating the Microsoft Exchange servers that it ran on. At the same time, we were also deploying our Secure Future Initiative (SFI), which required us to reassess legacy systems that could expose sensitive data or slow incident response, including hidden membership groups.

The system wasn’t broken, but it represented concentrated risk simply by existing outside our modern cloud controls and monitoring.

“When the SFI wave hit, it was made clear to us that we needed to keep our people safe, and to do that, we needed to build a new hidden memberships group MVP,” says Brett Reifers, a product manager in Microsoft Digital, the company’s IT organization. “We needed to raise the bar with modern groups, and we needed to do it in six months or miss meeting our goals.”

The mandate was clear. Preserve anonymity, eliminate on‑premises dependencies, and do it quickly.

A photo of Carson.

“Our solution would enable us to deprecate our legacy on-premises Exchange hardware while maintaining the privacy of our employee groups, and it would do so in a cloud-first manner.”

Nate Carson, principal service engineer, Microsoft Digital

Instead of retrofitting hidden membership into standard Microsoft 365 groups, we asked a different question: What if the group lived somewhere else entirely? What if users interacted with a simple, secure front end, while all membership expansion and mail flow occurred in a locked‑down tenant built specifically for this purpose?

That idea became the foundation for Hidden Membership Groups: A new cloud‑first architecture that would separate user experience, leverage first‑party Microsoft services, and keep our group memberships hidden from everyone—including owners and administrators—by design.

“Our solution would enable us to deprecate our legacy on-premises Exchange hardware while maintaining the privacy of our employee groups, and it would do so in a cloud-first manner,” says Nate Carson, a principal service engineer in Microsoft Digital.

Once we settled on a solution, our next step was to get support for solving a problem not many people thought much about.

“Not everyone was aware of how serious of a situation we were in,” Carson says. “We had to show everyone what was at stake, and to share our solution with them.”

After taking their plan on the road, the team got the buy in it needed, and that’s when the real work started.  

Planning to solve business problems with security built-in

Before we designed anything, we had to be clear about what success meant.

Hidden Membership Groups aren’t just another collaboration feature. They support scenarios where anonymity wasn’t optional—it’s foundational. That reality shaped every requirement that we built into our solution, including:

1. Absolute privacy

Group membership couldn’t be immediately visible to users, group owners, or administrators–under any circumstances. That requirement immediately ruled out standard group models.

2. Cloud only

Any new solution had to live entirely in our cloud, use first‑party services, and align with modern identity, security, and compliance practices. On‑premises infrastructure wasn’t an option.

3. Scale

Some groups had a handful of members. Others had tens of thousands. Membership changed frequently, and those changes had to propagate safely and predictably without exposing data or degrading performance.

4. Separation of concerns

User interaction and membership truth couldn’t live in the same place. Employees needed a simple way to discover groups, request access, and manage participation, without ever interacting with the system that stored or expanded membership.

5. Self‑service with guardrails

The solution needed to reduce operational overhead, not introduce a new bottleneck. Group lifecycle management had to be automated, auditable, and secure, while still giving teams flexibility.

6. Simple to use

Employees shouldn’t need special training. They shouldn’t need to understand tenants, identity synchronization, or mail routing. The experience needed to be intuitive, consistent, and accessible—without compromising security.

Once those requirements were clear, our solution started to emerge. Incremental changes wouldn’t be enough. A traditional group model wouldn’t work. The solution required a new architecture—one designed around isolation, automation, and intentional limitation.

That’s when we started the engineering work.

Creating a cloud-first architecture

Designing for hidden membership meant eliminating ambiguity. If any surface could reveal membership, even indirectly, it didn’t belong in the design.

That constraint led us toward a model built on strict isolation, explicit APIs, and intentionally narrow interfaces. The result is straightforward to use, but deliberately difficult to interrogate.

Two tenants, with sharply separated responsibilities

At the foundation of the solution is a two‑tenant model.

Our primary Microsoft 365 tenant is where employees authenticate, discover groups, and initiate actions. A secondary, isolated tenant hosts the distribution lists and performs mail expansion for Hidden Membership Groups.

A photo of Mace.

“Tenant isolation is what makes the privacy guarantee real. By moving membership expansion to a tenant that users and owners can’t access, we removed the possibility of accidental exposure. The system simply doesn’t give you a place where membership can be seen.”

Chad Mace, principal architect, Microsoft Digital

That separation matters because the secondary tenant isn’t designed for interactive use. Only Exchange and the minimum directory constructs required for mail routing and expansion are enabled.

Operationally, when an employee sends email to a Hidden Membership Group, they send to a mail contact visible in the corporate tenant. That contact routes to the corresponding distribution group in the isolated tenant, where membership expansion occurs. Expanded messages are then delivered back in recipients’ inboxes in the corporate tenant, so sent and received mail lives where users already work.

“Tenant isolation is what makes the privacy guarantee real,” says Chad Mace, a principal architect in Microsoft Digital. “By moving membership expansion to a tenant that users and owners can’t access, we removed the possibility of accidental exposure. The system simply doesn’t give you a place where membership can be seen.”

Identity without interactive access

This isolated tenant only works if it can resolve recipients. To enable that, our development team used Microsoft Entra ID multi‑tenant organization identity sync to represent corporate users in the secondary tenant.

These identities are treated as business guest identities, and we disable sign‑in to prevent interactive access. The tenant can perform expansion, but nothing more.

However, complete isolation wasn’t technically possible. Privileged access always exists at some level. The design response was to minimize that exposure. Access to the isolated tenant is tightly restricted, and membership changes flow through automation rather than broad UI-based administration.

The goal: reduce exposure to the smallest viable operational group.

API-first automation as the control plane

With tenancy and identity model established, the team needed a single, consistent way to create groups, connect objects across tenants, and manage changes without introducing new administrative workflows. That’s where the APIs come in.

A photo of Pena II.

“We split the backend into multiple APIs so the system could scale without becoming fragile. That let us separate everyday operations from high-volume membership work and keep performance predictable.”

John Pena II, principal software engineer, Microsoft Digital

The backend is intentionally modular, split into three distinct APIs:

  • The control API handles group creation, configuration, and cross‑tenant coordination.
  • The membership API handles standard add and remove operations.
  • The bulk membership APIs handle large‑scale operations involving tens of thousands of users, with services designed to run long‑lived jobs, manage throttling, and recover from partial failures.

“We split the backend into multiple APIs so the system could scale without becoming fragile,” says John Pena II, a principal software engineer in Microsoft Digital. “That let us separate everyday operations from high-volume membership work and keep performance predictable.”

The APIs run as PowerShell-based Azure Functions and use managed identity patterns, including federated identity credentials, to securely connect across tenants.

Creating the user experience with PowerApps

For the front end, we built a Canvas app in Power Apps, backed by Dataverse. The goal was speed and flexibility, without compromising strict privacy boundaries.

By using Power Apps as the primary interaction layer, we deliver a secure, modern experience without unnecessary custom infrastructure. The Canvas app provides a single, focused surface for discovering, joining, and managing hidden membership groups, while all sensitive operations remain behind controlled APIs and tenant boundaries. This separation allows the team to iterate quickly on experience design without weakening the privacy guarantees that the solution depends on.

Power Platform also simplifies how security is being enforced across the solution. Dataverse enables fine‑grained, role‑based access, ensuring users only see data they’re entitled to see—while keeping sensitive membership information entirely out of the client layer. That reduces long‑term maintenance overhead and makes it easier to evolve the solution as requirements change.

“From the beginning, we designed everything with security roles and workflows in mind,” says Shiva Krishna Gollapelly, senior software engineer in Microsoft Digital. “Dataverse let us control who could see or change data without building additional APIs or storage layers, and keeping everything inside the Power Apps ecosystem saved us a lot of maintenance over time.”

Dataverse plays a precise role here: it maintains the datastore the app needs to function without becoming a secondary membership repository.

A photo of Amanishahrak.

“Using the Power Platform let us move fast, integrate deeply with Microsoft identity, and enforce security without building a full web stack from scratch.”

Bita Amanishahrak, software engineer II, Microsoft Digital

From a security posture perspective, Dataverse security is used intentionally to restrict what different users can see and do, and the Power App was developed with security roles and workflows in mind.

Short version: the app brokers intent, the APIs execute it, and all the pieces that need to stay separate do exactly that.

“Using the Power Platform let us move fast, integrate deeply with Microsoft identity, and enforce security without building a full web stack from scratch,” says Bita Amanishahrak, a software engineer in Microsoft Digital.

The architectural intent is consistent throughout—isolate the sensitive plane and ensure the user plane operates only through controlled interfaces.

Benefits and impact

The most important outcome of the new architecture is also the simplest: Hidden membership stays hidden.

Anonymity isn’t enforced by policy. It’s enforced by architecture. Membership data never appears in the user experience or administrative tooling, and it doesn’t surface as a side effect of scale.

“We’re no longer asking people to trust that we’ll handle sensitive membership carefully through process,” Reifers says. “The system makes exposure structurally impossible.”

The impact was immediate.

At launch, we migrated more than 2,200 hidden membership groups, representing over 200,000 users, from the legacy on‑premises system into the new cloud‑first architecture. Groups ranged from small, tightly controlled communities to audiences with tens of thousands of members, all supported without special handling.

“Some of these groups are massive,” Pena says. “We knew from the beginning we were dealing with memberships in the tens of thousands, which is why we designed bulk operations as a first‑class capability instead of an afterthought.”

The separation between routine APIs and bulk‑membership APIs proved critical, enabling large migrations and ongoing changes without degrading day-to-day performance.

Operationally, moving to a cloud‑only model reduced both risk and complexity. Decommissioning the on‑premises Exchange infrastructure eliminated specialized maintenance requirements and improved monitoring, auditing, and access controls alignment with our modern cloud standards.

Delivery speed also mattered. Driven by Secure Future Initiative urgency and strong executive sponsorship, the team designed and delivered a minimum viable product in less than six months.

“That timeline forced discipline,” Reifers says. “We focused on what mattered: Security, privacy guarantees, scale, and a UX that wouldn’t disrupt group owners and/or members that had relied on a 15-year old tool.”

Everything else was secondary.

A photo of Gollapelly.

“Most users never think about tenants or APIs. They just see a clean experience that does what they need, without exposing anything it shouldn’t.”

Shiva Krishna Gollapelly, senior software engineer, Microsoft Digital

From an employee perspective, the experience became simpler and safer. Users now interact through a Power Platform app consistent with the rest of Microsoft 365.

Discovering a group, requesting access, or leaving a group no longer requires understanding the architecture behind it.

“Most users never think about tenants or APIs,” Gollapelly says. “They just see a clean experience that does what they need, without exposing anything it shouldn’t.”

The result is sustainable. The platform protects anonymity at scale, simplifies operations, boosts resiliency, and can evolve without reopening core privacy questions.

Moving forward

Delivering the initial solution was only the beginning.

The team sees Hidden Membership Groups as more than a single solution. It’s a reusable pattern for sensitive collaboration in a cloud‑first world: isolate what matters most, automate everything else, and design experiences that don’t require trust to be safe.

As adoption grows, the team plans to support additional anonymity-sensitive scenarios while maintaining the same underlying model.

“We don’t want every sensitive scenario inventing its own workaround,” Mace says. “This gives us a pattern we can reuse confidently.”

Future priorities include improving lifecycle and ownership experiences, strengthening auditing and reporting for approved administrators, and enhancing self‑service workflows—without compromising membership privacy. If it risks exposing membership, it doesn’t ship.

With the legacy system fully retired, Reifers reflects on what the team accomplished to get here.

“We shipped a new enterprise pattern in six months using our first party tools,” Reifers says. “We achieved this because a stellar team cared about the mission. That’s the takeaway.”

Key takeaways

Use these tips to strengthen your privacy, simplify your operations, and future-proof your organization’s collaboration systems:

  • Prioritize privacy by design. Embed privacy considerations from the start to protect sensitive information in all collaboration scenarios.
  • Architect for scale. Treat bulk operations to support large groups efficiently as a first-class capability.
  • Automate and modernize workflows. Replace legacy systems with cloud-native solutions to reduce risk, improve transparency, and enable continuous improvement.
  • Streamline user experience. Provide intuitive, consistent interfaces that make it easy for users to access, join, or leave groups without requiring technical knowledge.
  • Enforce strict access and auditing controls. Align monitoring and administration with modern cloud standards to maintain security and accountability.
  • Create reusable patterns. Establish and share successful privacy patterns to avoid reinventing solutions for each new case.
  • Focus on operational simplicity and resilience. Design systems that are easy to maintain and improve, freeing up teams to concentrate on innovation rather than upkeep.

The post Protecting anonymity at scale: How we built cloud-first hidden membership groups at Microsoft appeared first on Inside Track Blog.

]]>
22465
Protecting AI conversations at Microsoft with Model Context Protocol security and governance http://approjects.co.za/?big=insidetrack/blog/protecting-ai-conversations-at-microsoft-with-model-context-protocol-security-and-governance/ Thu, 12 Feb 2026 17:05:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=22324 When we gave our Microsoft 365 Copilot agents a simple way to connect to tools and data with Model Context Protocol (MCP), the work spoke for itself. Answers got sharper. Delivery sped up. New patterns of development emerged across teams working with Copilot agents. That ease of communication, however, comes with a responsibility: Protect the […]

The post Protecting AI conversations at Microsoft with Model Context Protocol security and governance appeared first on Inside Track Blog.

]]>
When we gave our Microsoft 365 Copilot agents a simple way to connect to tools and data with Model Context Protocol (MCP), the work spoke for itself.

Answers got sharper. Delivery sped up. New patterns of development emerged across teams working with Copilot agents.

That ease of communication, however, comes with a responsibility: Protect the conversation.

Questions came up like, who’s allowed to speak? What can they say? And what should never leave the room?

Microsoft Digital, the company’s IT organization, and the Chief Information Security Officer (CISO) team, our internal security organization, are leaning on those questions to help us shape our strategy and tooling around MCP internally at Microsoft.

A photo of Kumar.

“With MCP, the problem is not the inherent design; it’s that every improper server implementation becomes a potential vulnerability. Even one misconfigured server can give the AI the keys to your data.”

Swetha Kumar, security assurance engineer, Microsoft CISO

Our approach is intentionally straightforward.

Start secure by default. Use trusted servers. Keep a living catalog so we always know which voices are in the room. Shape how agents communicate by requiring consent before making changes.

We minimize what’s shared outside our walls, watch for drift, and act when something looks off. Our goal is practical governance that lets builders move fast while keeping our data safe.

That’s the risk we design for, and it’s why our controls prioritize clear ownership, simple choices, and visible guardrails.

“With MCP, the problem is not the inherent design; it’s that every improper server implementation becomes a potential vulnerability,” says Swetha Kumar, a security assurance engineer in the Microsoft CISO organization. “Even one misconfigured server can give the AI the keys to your data.”

Understanding MCP and the need for security

MCP is a simple standard that lets AI systems “talk” to the right tools and data without custom integration work. Think of it like USB‑C for AI. Instead of building a new connection every time, teams plug into a common pattern. That standardization delivers speed and flexibility—but it also changes the security equation.

Before MCP, every integration was its own isolated conversation.

“Now, one pattern can unlock many systems,” Kumar says. “It’s a win and a risk. When AI can reach more systems with less effort, we must be precise about who’s allowed to speak, what they can say, and how much gets shared.”

We frame this as communications security.

The question isn’t just, “Is this API secure?” It’s “Is this a conversation we trust?” We want to know which servers are in the room, what actions they’re permitted to take, and how we’ll notice if something changes. At the same time, we keep the cognitive load low for builders. They choose from trusted options, see clear prompts before an agent makes edits, and move on. Simple choices lead to safer outcomes.

“MCP enables granular control over the tools and resources exposed to the Large Language Model,” Kumar says. “But that means the developer is responsible for configuring it correctly—which tools an agent can see, what actions a server can take, and what context is shared.”

This approach helps both sides.

Product teams get a consistent way to extend their agents while security teams get consistent places to add guardrails—at discovery, access, and throughout the flow of requests and responses. Everyone operates from the same playbook.

When we treat MCP this way, we protect the conversation without slowing it down. We know who’s speaking. We know what they can do. And we can prove it.

Assessing MCP security across four layers

Every MCP session creates a conversation graph. An agent discovers a server, ingests its tool descriptions, adds credentials and context, and starts sending requests. Each step—metadata, identity, content, and code—introduces potential risk.

We evaluate those risks across four layers so we can catch failures early, contain blast radius, and keep conversations in bounds.

However, the big picture is just as important as the details.

“We take a holistic view of MCP security: start with the ecosystem, then specify controls across the four layers,” Kumar says. “The layers make the work concrete, but the goal stays the same—unified governance, shared education, and faster detect-and-mitigate when a server is at risk.”

Applications and agents layer

This is where user intent meets execution. Agents parse prompts, discover tools, select actions, and request changes. MCP clients live here, deciding which servers to trust and when to ask for user consent.

  • What can go wrong
    • Tool poisoning or shadowing. A server advertises safe‑looking actions but performs something else.
    • Silent swaps. A tool’s metadata changes and the client keeps trusting an altered “voice.”
    • No sandbox. The agent can request edits or run code without strong guardrails.
  • What we watch for
    • Unexpected tool descriptions or capabilities at connect time.
    • Edit attempts on critical resources without explicit user consent.
    • Abnormal tool‑selection patterns across sessions.

AI platform layer

The AI platform layer includes the AI models and runtimes that interpret prompts and call tools, along with orchestration logic and safety features.

  • What can go wrong
    • Model supply‑chain drift. Unvetted models, unsafe updates, or compromised fine‑tunes change behavior.
    • Prompt injection via tool text. Descriptions and responses steer the model toward unsafe actions.
  • What we watch for
    • Model provenance and update cadence tied to agent behavior changes.
    • Signals of jailbreaks or instruction overrides in prompts and intermediate messages.
    • Output drift linked to specific tools or servers.

Data layer

This layer covers business data, files, and secrets the conversation can touch.

  • What can go wrong
    • Context oversharing. Session data, files, or secrets get packed into the model’s context and leak to a third‑party server.
    • Over‑scoped credentials. Long‑lived tokens, broad scopes, or wrong audience claims enable lateral movement.
  • What we watch for
    • Size and sensitivity of context passed to tools.
    • Token hygiene, including short lifetimes, least‑privilege scopes, and correct audience claims.
    • Data egress patterns that don’t match a tool’s declared purpose.

Infrastructure layer

The infrastructure layer includes compute, network, and runtime environments.

  • What can go wrong
    • Local servers with too much reach. Excessive access to environment variables, file systems, or system processes.
    • Cloud endpoints without a gateway. No TLS enforcement, rate limiting, or centralized logging.
    • Open egress. Servers call out to the internet where they shouldn’t.
  • What we watch for
    • All remote MCP servers registered behind the API gateway.
    • Runtime signals, such as authentication failures, burst traffic, or unusual geographies.
    • Network policies that restrict outbound calls to certain targets.

Across all four layers, the throughline is AI communications security. We decide who can speak and verify what was said—and keep listening for change.

Establishing a secure-by-default strategy

We start by closing the front door. We recommend every remote MCP server sits behind our API gateway, giving us a single place to authenticate, authorize, rate‑limit, and log. There are no direct calls and no blind spots.

A photo of Enjeti

“Everything we do starts with securing the MCP server by default and that begins by registering it in API Center for easier discovery. We rely solely on vetted and attested MCP servers, ensuring every call comes from a trusted footprint.”

Prathiba Enjeti, principal PM manager, Microsoft CISO

Next, we decide who gets a voice.

Teams choose from a vetted list of MCP servers. If someone connects to an unapproved endpoint, they receive a friendly nudge and a clear path to register it. No shaming—just fast correction and a better inventory the next time around.

Identity comes next. Servers expect short‑lived, least‑privilege tokens with the right scopes and audience. Admin paths require strong authentication, and where possible, we use proof‑of‑possession to bind tokens to the client and reduce replay risk. Secrets don’t live in code, keys rotate, and audit trails are in place.

“Everything we do starts with making the MCP server secure by default and that begins by registering it in API Center for easier discovery,” says Prathiba Enjeti, a principal product manager in the Microsoft CISO organization. “We only use vetted and attested MCP servers. That’s how we keep the conversation safe without slowing it down.“

On the client side, we slow agents at the right moments. Agents can’t touch high‑risk tools without explicit consent. Tool descriptions are verified on connection and compared to approved contracts. If a tool’s “voice” drifts, we block the call.

We also minimize what’s shared.

Context is trimmed to what the task requires. Sensitive data isn’t included by default, and third‑party servers get only what they need—not the whole transcript. Output filters and prompt shields sit alongside the model to prevent risky inputs from becoming risky actions.

Isolation completes the design. Local servers run in containers with tight file and network permissions. Hosted servers allow only the outbound calls they need, and inbound traffic flows through the gateway, with TLS and logging enforced.

Simple rules with visible guardrails.

“We only use vetted MCP servers,” Enjeti says. “That’s how we keep the conversation safe without slowing it down.”

How we run MCP at scale: architecture, vetting, and inventory

We keep MCP safe by making three things intentionally boring: architecture, vetting, and inventory. One defined path. One vetting flow. One living catalog.

Architecture

We recommend remote MCP servers sit behind an API gateway, giving us a single place to authenticate, authorize, validate, rate‑limit, and log. Transport Layer Security (TLS) is required by default, and for sensitive endpoints, we can require mutual TLS. Outbound egress is pinned to approved destinations using private endpoints and firewall rules, so servers can’t “call anywhere.” Runtime protection continuously watches for credential abuse, injection patterns, burst traffic, and odd geographies.

Identity is established up front. We issue short‑lived, least‑privilege tokens with the correct audience and scopes, and admin paths require strong authentication. Where supported, tokens are bound to the client to reduce replay risk. Services use managed identities or signed credentials; secrets don’t live in code, and keys rotate on schedule.

Model‑side safety travels with every conversation. Content safety and prompt shields help models ignore risky inputs, while orchestration enforces a per‑tool allowlist, so an agent can’t call tools that aren’t in policy—even if the model suggests it. We also track model versions, allowing behavior changes to be correlated with updates.

Clients enforce consent at the edge. “Ask before edits” is enabled by default for write, delete, and configuration changes. When an agent connects, it verifies tool descriptions against the approved contract.

Observability ties it all together. We’re working toward logging tool calls, resource access, and authorization decisions end‑to‑end with correlation IDs. Detections flag abnormal tool selection, unexpected data egress, or edits without consent. Every server has an owner, a contract, and an approval record, and metadata changes automatically trigger re‑review. Kill switches live at both the client and the gateway when we need them.

Vetting

We don’t “connect and hope.”

Before any MCP server can speak in our environment, it earns trust. Owners declare what the server does (tools and actions), what it touches (data categories and exports), how callers authenticate (scopes and audience), and where it runs (runtime and on‑call ownership).

We start with static checks: manifests must match the contract, side‑effecting actions must be consent‑gated, tokens must be short‑lived and properly scoped. A SBOM (Software Bill of Materials) must be present, dependencies must be current, and no credentials can be embedded in code.

Then we test like a client would. We snapshot tool metadata on connect and compare it to the approved contract, probe for prompt‑injection and tool‑poisoning, and verify that “ask before edits” triggers for destructive actions.

We also confirm context minimization, validate that egress is pinned to approved hosts, and test resilience under load, including health checks, retry behavior, and isolation using containers with least‑privilege file and network access. Servers are published only when security, privacy, and responsible AI reviews are complete, runbooks and on‑call are in place, and the registry entry is created and pinned.

Inventory

A photo of Janardhanan

“Inventory is the foundation—if we miss a server, we miss the conversation. Every server, regardless of where it’s running or how it’s deployed, must be accounted for in our system.”

Priya Janardhanan, principal security assurance engineering manager, Microsoft CISO

You can’t govern what you can’t see, and MCP shows up in more places than a single system of record. To solve that, we’re building the map from signals and stitch them into one catalog.

“Inventory is the foundation—if we miss a server, we miss the conversation,” says Priya Janardhanan, a principal security assurance engineering manager at Microsoft CISO Operations. “Every server, regardless of where it’s running or how it’s deployed, must be accounted for in our system. Without a complete inventory, we lose visibility into critical operations, risk exposing sensitive data, and undermine our ability to ensure compliance and security.”

Our goal state is that Endpoint telemetry catches developer‑run servers on laptops and workstations. Repos and CI pipelines reveal intent before anything ships. IDEs (Integrated Development Environments) surface local extensions and configured endpoints. The gateway and our registries anchor what’s approved for business data, while low‑code environments tell us which connectors are in use and where they point.

We normalize and correlate those signals with stable IDs for servers, tools, and owners. Ownership is proven through repositories, gateway services, and environment administrators—on‑call contacts included. Exposure is scored based on data touches, scopes requested, egress rules, and change history, so high‑risk items rise to the top of the queue.

Freshness is tracked with last‑seen timestamps, and stale entries are retired over time. Builders can discover and reuse approved servers; reviewers can see what changed since the last approval, and admins get instant visibility into coverage and hotspots.

We’re working toward automated identification and notification for unknow servers. In the ideal state, a registration stub is created when we detect an unknown server on an endpoint. Then, the likely owner is notified, and direct calls are blocked until the server is vetted through an automated process. If tool metadata changes after approval, high-risk actions are paused and routed for re-review, then auto-resumed once approved.

“It all revolves around inventory as the foundation,” Janardhanan says. “If we miss a server, we miss the conversation.”

A photo of Hasan

“Agent 365 tooling servers will allow centralized governance for IT admins. That means a single pane where they can see what’s approved, who owns it, what data it touches, and then apply policy.”

Aisha Hasan, principal product manager, Microsoft Digital

Architecture gives us stable choke points. Vetting keeps weak servers out. Inventory keeps our map current. It’s a single pattern for builders and a unified playbook for security.

Governing agents in low‑code and pro-code scenarios

Makers move fast—that’s the point. A Customer Support team needed a Copilot action to pull case history, so they opened Copilot Studio, selected an approved MCP connector, and shipped a first version before lunch. No tickets. No detours. Governance showed up in the flow, not as a blocker.

“Agent 365 tooling servers will allow centralized governance for IT admins,” says Aisha Hasan, a principal product manager at Microsoft Digital. “That means a single pane where they can see what’s approved, who owns it, what data it touches, and then apply policy. We’re moving toward that consolidation so innovation continues while governance gets simpler and more consistent.”

We place guardrails where makers already work. In Copilot Studio, trusted and verified first-party MCP servers are allowed in developer environments to accelerate innovation and encourage experimentation. Riskier or complex MCP integration is available in Copilot Studio custom environments and other pro-code tools such as Microsoft 365 Agent Tool kit in VS Code and Microsoft Foundry, but only with clear checks: service ownership, security and privacy review, responsible AI assessment, and consent gating for high‑impact actions.

The allowlist is our north star.

Approved MCP servers and connectors live in one catalog with documented owners, scopes, and data boundaries. Makers choose from that shelf. If an MCP server uses an unverified tool, we enforce endpoint filtering. If there is misconfiguration, we open a task for the owner and help them build securely.

Permissions stay tight without adding cognitive load. Tokens are short‑lived and scoped to the task. Context is trimmed so only the necessary fields flow to the tool. Third‑party servers never get the full transcript. If a connector’s capabilities change, the runtime compares the new “voice” to what we approved. MCP Clients should pause risky actions, notify the owner, and resume automatically once reviewed.

With agent inventory in Power Platform Admin Center and registry in Agent 365, admins get a clean view on which connectors are active, who owns them, what data they touch, and how often they’re called. Organization policies such as DLP and MIP can be enforced in a unified way , with a re‑review when capabilities change. The goal is simple: let builders innovate confidently and securely while maintaining security and compliance.

“MCP servers are powerful AI tools that enable agents to seamlessly integrate and interact with enterprise data and transform business workflows,” Hasan says. “That means the same enterprise data and governance principles are applied equally to MCP servers and other connectors. A robust inventory, an agile policy framework, and an automated workflow for enforcement are cornerstones for successfully governing agents at scale.”

Securing MCP at scale: Operating, monitoring, and enabling

Our work doesn’t stop at go‑live. Once an MCP server is in the catalog, we operate the conversation like a service: measurable, observable, and responsive. Identity and policy guard the front door, but runtime is where we prove the controls work without slowing anyone down.

In practice, operating MCP at scale comes down to four motions:

Observe every tool call end to end. We make the flow observable. Every tool call carries a correlation ID from client to gateway to server and back. Prompts, tool selections, authorization decisions, and resource access should belogged with consistent schemas. Golden signals—latency, errors, saturation—sit alongside safety signals like unexpected egress or edits without consent. Owners and security teams see the same dashboards.

Detect drift and abnormal behavior early. Detection lives close to the work. We flag abnormal tool patterns, spikes in write operations, burst traffic from new geographies, and context sizes that don’t fit a task. We continuously compare a tool’s “voice” at connect time to the approved version; drift automatically pauses risky actions and pings the owner. Cost controls double as guardrails, using rate limits and budgets to cap blast radius and surface runaway loops early.

Respond with precision instead of blunt shutdowns. Response is graded, not binary. We can block destructive actions and allow reads, or throttle a noisy client without killing the session. Kill switches exist at both the client and the gateway. Playbooks are pre‑approved and integrated into the consoles owners already use, and dry runs are part of muscle memory, so the first switch flip doesn’t happen during an incident.

We treat model behavior as part of operations. Content safety and prompt shields run in production, not just in tests. We pin model versions and watch for output drift after updates. If a model starts suggesting tools out of character, the owner gets paged with the exact prompts and calls that triggered it.

Telemetry respects privacy. Logs avoid sensitive payloads by default and mask what must pass through for forensics. Access is role‑based, retention follows policy, and audit readiness is designed in on day one.

Enable builders through templates, education, and reuse. Adoption and education run in parallel. Builders get templates that enable best practices: sample manifests with consent gates, CI checks for token scope and SBOMs, and gateway stubs with sane defaults. A “ten‑minute preflight” runs locally to verify contracts, test consent flows, and check egress before a pull request is opened. IDE lint rules catch common issues early.

“This is how we operate MCP at scale,” says Janardhanan. “Observe the conversation, detect drift early, respond with precision, and teach habits that make the right path the easy path. We run it like a product because that’s what it is.”

Measuring results and moving forward

This program has changed how we build. Reviews move faster because every server follows the same path. Drift is caught early because clients compare a tool’s “voice” on connection. Shadow servers decline as inventory fills in from endpoint, repo, IDE, and gateway signals. Reuse increases because teams can discover trusted servers instead of creating new ones. Incidents resolve faster with correlation IDs across the conversation and kill switches at both the client and the gateway.

It’s also changed how our admins work. One gateway means one perimeter to manage. Policies land once and apply everywhere. Owners see the same telemetry security sees, so fixes happen where the work happens.

Going forward, we’re focused on more consolidation and automation. We’re moving toward a single pane for MCP governance—approve, monitor, and pause from one place. Policy-as-code will keep allowlists, consent rules, and egress boundaries versioned and testable in CI.

Our preflight checks will get smarter, with stronger injection tests, automatic egress validation, and environment‑aware templates. We’ll expand consent patterns so high‑impact actions remain explicit and auditable, even across multi‑tool chains. And we’ll keep shrinking re‑review time, so drift is measured in minutes, not days.

AI conversations are now part of how we build every day. MCP standardizes how agents talk to tools and data. Secure‑by‑default architecture, rigorous vetting, and a living inventory, ensure the right voices stay in the room, only what’s needed is shared, and drift is caught early.

The result is simple: teams ship faster with fewer surprises, and governance stays visible without getting in the way. We’ll keep tightening the loop, so saying yes remains both easy and safe.

Key takeaways

If you’re implementing MCP security, consider these key actions to ensure secure, efficient adoption in your organization:

  • Build governance into the maker flow. Embed security, consent, and responsible AI checks directly where teams build—so protection shows up by default, not as an afterthought.
  • Maintain a single allowlist and catalog. Centralize approved MCP servers and connectors with clear ownership, scope, and data boundaries.
  • Enforce scoped, short-lived permissions by default. Automatically limit token scope and duration to minimize risk and exposure.
  • Monitor continuously and detect drift early. Observe activity, flag deviations, and pause risky actions until reviewed and approved by owners.
  • Automate incident response and controls. Leverage pre-approved playbooks, kill switches, and rate limits for fast, precise action.
  • Design for privacy and auditability from day one. Mask sensitive data, restrict log access by role, and endure audit readiness.
  • Promote education and reuse. Provide templates, training, and feedback loops to encourage safe development and adoption of trusted servers.

The post Protecting AI conversations at Microsoft with Model Context Protocol security and governance appeared first on Inside Track Blog.

]]>
22324
Moving from a ‘Scream Test’ to holistic lifecycle management: How we manage our Azure services at Microsoft http://approjects.co.za/?big=insidetrack/blog/moving-from-a-scream-test-to-holistic-lifecycle-management-how-we-manage-our-azure-services-at-microsoft/ Thu, 20 Nov 2025 17:05:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=21193 Nearly a decade ago, as we began our journey from relying on on-premises physical computing infrastructure to being a cloud-first organization, our engineers came up with a simple but effective technique to see if a relatively inactive server was really needed. They dubbed it the “Scream Test.” “We didn’t have a great server inventory and […]

The post Moving from a ‘Scream Test’ to holistic lifecycle management: How we manage our Azure services at Microsoft appeared first on Inside Track Blog.

]]>
Nearly a decade ago, as we began our journey from relying on on-premises physical computing infrastructure to being a cloud-first organization, our engineers came up with a simple but effective technique to see if a relatively inactive server was really needed.

They dubbed it the “Scream Test.”

“We didn’t have a great server inventory and tracking system, and we didn’t always know who owned a server,” says Brent Burtness, a principal software engineer in Commerce Financial Platforms, who was one of the leaders for the effort in his group. “So, we essentially just turned them off. If someone screamed—‘Hey, why’d you turn off my server?’—then we’d know it was still being used.”

Today, the basic idea behind the Scream Test is being used across the company, but in a more holistic way. Importantly, it’s been incorporated into the overall lifecycle management of our computing infrastructure. And, through the automation tools provided by Microsoft Azure, we have a much more efficient process for making sure that we’re saving time and money by reducing the number of underused machines we operate, monitor, and maintain.

A photo of Apple

“We thought we were going to get rid of a small number of machines that weren’t being used. But we found the actual share was about 15% of all machines, which saved us a lot of effort of moving those unused machines to the cloud. In other words, we downsized on the way to the cloud, rather than after the fact.”

Pete Apple, cloud network engineering architect, Microsoft Digital

Uncovering more than expected

The Scream Test was part of the huge effort to evaluate our on-premises compute resources before we began moving to the Azure cloud. After all, why spend resources moving something that isn’t needed?

Pete Apple, who helped develop the concept of the Scream Test, is a cloud network engineering architect in Microsoft Digital, the company’s IT organization. Looking back, he remembers the surprising results that emerged when they began shutting down specific servers to see who noticed.

“We thought we were going to get rid of a small number of machines that weren’t being used,” Apple says. “But we found the actual share was about 15% of all machines, which saved us a lot of effort of moving those unused machines to the cloud. In other words, we downsized on the way to the cloud, rather than after the fact.”

As part of this process, Apple explains, our engineers looked at two related factors to reduce inefficiencies in our usage of computing resources.

The first was to identify systems that were used infrequently, at a very low level of CPU (sometimes called “cold” servers). From that, we could determine which systems in our on-premises environments were oversized—meaning someone had purchased physical machines according to what they thought the load would be, but either that estimate was incorrect or the load diminished over time. We took this data and created a set of recommended Microsoft Azure Virtual Machine (VM) sizes for every on-premises system to be migrated.

“We learned that there’s a lot of orphaned, or underutilized, resources out there,” Burtness says. “These were cases where the workload was so small on a server—like under 5% CPU—that it didn’t make sense to host it on its own machine. We could then move the task or application and get it down to just one or two CPUs on a virtual machine.”

At the time, we did much of this work manually, because we were early adopters. The company now has a number of products available to assist with this review of your on-premises environment, led by Azure Migrate.

Another part of the process was determining which systems were being used for only a few days a month or at certain busy times of the year. These development machines, test/QA machines, and user acceptance testing machines (reserved for final verification before moving code to production) were running continuously in the datacenter but were really only needed during limited windows. For these situations, we applied the tools available in Azure Resource Manager Templates and Azure Automation to ensure the machines would only run when needed.

Automating with Azure

Today, we don’t have to rely on anything as crude as the Scream Test to find unused and underused computing resources. With 98% of our IT resources operating in the Azure cloud, we have much greater insight into how efficient our network is, so much of the process can be automated.

“We’ve found this effort much easier to manage in the cloud, because all our computing resources are integrated with the Azure portal,” Apple says. “They have an API system and offer various tools within Azure Update Manager and Azure Advisor to help with cost efficiency. It’s kind of like a modern version of Clippy—’Hey, it looks like your VM isn’t being used much. Do you want to downsize that or turn it off?'”

(For the uninitiated, Clippy was the Microsoft Office animated paperclip assistant introduced in the late 1990s. It offered tips and help with tasks, like writing and formatting documents. Clippy became iconic for its quirky suggestions, including recommending that you remove things from your desktop that you weren’t using.)

Burtness smiles in a portrait photo.

“With everything being in the Azure portal or in Azure Resource Graph, it’s much more streamlined, and makes it easier to get that data out to the teams. They can then go into the portal and clean up the resource.”

Brent Burtness, principal software engineer, Commerce Financial Platforms

And simply taking the step of turning off stuff that we weren’t using turned out to be very effective. Thanks, Clippy!

Today, we approach this challenge in a more efficient and sophisticated way, taking advantage of Azure tools like Update Manager and Advisor.

“With everything being in the Azure portal or in Azure Resource Graph, it’s much more streamlined, and makes it easier to get that data out to the teams,” Burtness says. “We can run automated queries with Azure Resource Graph. Then we bring that information into our internal Service 360 tool, which we use to give action items to our developers. Each item gives them a link to Azure portal, and they can then go into the portal and clean up the resource.”

Managing for the lifecycle

One of the most important things we learned by using the Scream Test to identify inefficiencies and moving our systems from on-premises servers to the cloud was that it’s an ongoing process, not a fixed-end project.

“We had this idea that it was going to be a one-time event, that we’ll move to the cloud and then we’ll be done,” Apple says. “A better understanding is that it’s a lifecycle. We have integrated this concept of continual evaluation into our processes around everything that’s still on-premises, because we still have labs, we still have physical infrastructure.”

We continue to do this evaluation on a regular basis with both physical and virtual computing resources, because needs and usage are constantly changing.

Cutting our cloud costs

A text graphic shows the savings that one group at Microsoft achieved by becoming more efficient in their compute usage.
In a pilot set of Azure subscriptions, the Commerce Financial Platforms team reduced usage by 233 resources across 36 subscriptions and 17 services in 6 team groups, saving more than $15,000 in monthly operating costs.

“Now we have a basic process around a six-month cycle,” Apple says. “So, every six months we ask, does this still need to be on-premises or should we start moving it to the cloud? And we do the same thing with our cloud resources. Who’s still using these VMs? And we still go through the same review process to see if it’s needed, or if we can shut it down or move it.”

This has resulted in significant cost savings for the company. “We’re up to about 15% to 20% less compute cost, depending on the organization, because of this much better understanding of our business needs,” Apple says.

Better governance, increased security

Another major benefit of this process was establishing much stronger governance of compute resources across the entire organization.

“When we first did the Scream Test, we weren’t always really sure who owned what, in some cases,” Apple says. “We’ve fixed that as part of this process. This governance aspect is a key part of being more efficient with our resources.”

Burtness explains why this is so important.

“It’s critical to know exactly who to contact when there’s something wrong with the server,” Burtness says. “Now, with clearer ownership, clearer accountability, and better inventory, it’s a much better experience.”

Better governance also means tighter security, according to both Apple and Burtness.

“This is really important when it comes to threat-actor response,” Apple says. “Unused servers can often be an entry point for hackers. Or, say we discover that a machine or server is getting hacked; you need to talk to who owns it. If you don’t know, it takes you longer to track them down and combat the hack. That’s not great. Improving our governance has definitely made securing our environment easier.”

Key takeaways

Here are some things to keep in mind when managing your own enterprise compute resources for greater efficiency:

  • It’s not a one-time exercise. For the best results, you should be evaluating your computing resources on a regular schedule to identify ”cold” servers and unused infrastructure.
  • Adjust for variable usage patterns. It’s not just about unused servers. Some machines may only be needed for a business function during certain busy times of the year. Consider turning the machines on just to handle the load during those periods and turning them off the rest of the year.
  • Use Azure tools for greater insight. If you’re operating your infrastructure in the Azure cloud, you can much more easily monitor and address orphaned resources using automated tools such as Azure Advisor, Azure Resource Graph, and the Azure portal.
  • Apply your savings to other priorities. “The more efficient you are, the more savings can be applied to other projects or given back to your manager—who is going to be very happy with you,” Apple says.
  • Saving money is not the only benefit. You’ll not only save operating costs, you’ll have a reduced maintenance and monitoring load, better governance, and fewer security vulnerabilities.

The post Moving from a ‘Scream Test’ to holistic lifecycle management: How we manage our Azure services at Microsoft appeared first on Inside Track Blog.

]]>
21193
Vuln.AI: Our AI-powered leap into vulnerability management at Microsoft http://approjects.co.za/?big=insidetrack/blog/vuln-ai-our-ai-powered-leap-into-vulnerability-management-at-microsoft/ Thu, 16 Oct 2025 16:05:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=20623 In today’s hyperconnected enterprise landscape, vulnerability management is no longer a back-office function—it’s a frontline defense. With thousands of devices from a multitude of vendors, and a relentless stream of Common Vulnerabilities and Exposures (CVEs), here at Microsoft we faced a challenge familiar to every IT decision maker: how to scale vulnerability response without scaling […]

The post Vuln.AI: Our AI-powered leap into vulnerability management at Microsoft appeared first on Inside Track Blog.

]]>
In today’s hyperconnected enterprise landscape, vulnerability management is no longer a back-office function—it’s a frontline defense. With thousands of devices from a multitude of vendors, and a relentless stream of Common Vulnerabilities and Exposures (CVEs), here at Microsoft we faced a challenge familiar to every IT decision maker: how to scale vulnerability response without scaling cost, complexity, or risk.

A photo of Fielder.

“While AI enables amazing capabilities for knowledge workers, it also increases the threat landscape, since bad actors using AI are constantly probing for vulnerabilities. Vuln.AI helps keep Microsoft safe by identifying and accelerating the mitigation of vulnerabilities in our environment.”

Brian Fielder, vice president, Microsoft Digital 

Enter Vuln.AI, an intelligent agentic system developed by our team in Microsoft Digital—the company’s IT organization—to transform how we identify, prioritize, and resolve vulnerabilities across our enterprise network.

Manual methods can’t keep up

As a company, we detect over 600 million cybersecurity threats every day, according to our latest Digital Defense Report. Some of those signals are bad actors probing our internal network and infrastructure looking for unpatched vulnerabilities. Our infrastructure supports over 300,000 employees and vendors, 25,000 network devices, and over 560 buildings across 102 countries. This scale means we face a constant stream of vulnerabilities—each requiring triage, impact analysis, and remediation.

“While AI enables amazing capabilities for knowledge workers, it also increases the threat landscape, since bad actors using AI are constantly probing for vulnerabilities. Vuln.AI helps keep Microsoft safe by identifying and accelerating the mitigation of vulnerabilities in our environment,” says Brian Fielder, a vice president within Microsoft Digital. 

Historically, our Infrastructure, Networking, and Tenant team here in Microsoft Digital relied on manual assessments to determine which network devices were impacted by new vulnerabilities. Traditional vulnerability scanning tools generate a lot of false positives and false negatives, and a significant amount of analysis still falls to security engineers, requiring manual validation before any vulnerability impact can be communicated to device owners. These manual methods were time-consuming, error-prone, and reactive—our security engineers were spending hours on each vulnerability, at times missing critical threats or sinking too much time into false alarms.

A photo of Bansal.

“AI’s true power lies in the problem it’s applied to. Start by identifying the most time-consuming or painful task in your organization-then explore how AI can augment or improve it. Begin with a small, targeted enhancement and iterate continuously.”

Ankit Bansal, senior product manager, Microsoft Digital

With the vast number of vulnerabilities coming in every day, security engineers needed a scalable way to quickly analyze, prioritize, and respond.

The solution: Vuln.AI

We already achieved dramatic impact with our AI Ops and Network Infrastructure Copilot, which is on track to save us over 11,000 hours of network service management time per year. We built Vuln.AI on top of that investment:

  1. The Research Agent analyzes vulnerability feeds and network metadata from our Infrastructure Data Lakehouse (IDL) built on top of Azure Data Explorer, which regularly ingests data from our device vendors and other sources. Once new vulnerabilities are detected, it automates the identification of impacted devices and integrates with other internal tooling for validation and reporting.
  2. The Interactive Agent acts as a gateway for engineers and device owners to ask follow-up questions and initiate remediation. Through agent-to-agent interaction, it leverages our Network Infrastructure Copilot to query the research agent’s findings. This agentic interface enables real-time decision-making and contextual insights.

Together, these agents are significantly improving our network security operations. The results we’re seeing so far are compelling:

  • A 70% reduction in time to vulnerability insights, enabling faster prioritization and mitigation, minimizing exposure windows.
  • Lower risk of compromise through increased accuracy, quicker detection, and containment of threats.
  • A stronger compliance posture that supports adherence to financial, legal, and regulatory requirements.
  • Higher accuracy in identifying vulnerable devices, reducing false positives and missed threats
  • Engineering hours saved and reduced fatigue, significantly improving productivity.

Our gains translate to lower operational risk, faster response times, and more resilient infrastructure—critical outcomes for any enterprise navigating today’s threat landscape.

“AI’s true power lies in the problem it’s applied to,” says Ankit Bansal, a senior product manager within Microsoft Digital. “Start by identifying the most time-consuming or painful task in your organization-then explore how AI can augment or improve it. Begin with a small, targeted enhancement and iterate continuously.”

How Vuln.AI works

The system continuously ingests our CVE data from our device suppliers’ API feeds and a publicly available database of known cybersecurity vulnerabilities.  It correlates that data with device attributes such as its hardware model and OS to identify the potential impact on the network and surface actionable insights.

Engineers interact with the system via Copilot, Teams, or custom tooling, which allows seamless integration with our network security teams’ daily workflows.

“We built a hybrid approach in Vuln.AI to guide LLMs through complex security advisories,” says Blaze Kotsenburg, a software engineer in Microsoft Digital. “By combining structured function calls, templated prompts, and data validation, we keep the model focused on producing reliable, actionable insights for vulnerability mitigation.”

A photo of Lollis.

“We chose Durable Functions for Vuln.AI because it allowed us to confidently orchestrate complex, stateful research. The reliability and simplicity of the framework meant we could shift our focus to engineering the intelligence behind the agent, especially the prompting strategies used in Vuln.AI’s backend processing.”

Mike Lollis, a senior software engineer in Microsoft Digital.

When it came to building Vuln.AI, we relied heavily on our own technology platforms, including: 

  • Azure AI Foundry for model development and deployment
  • Azure Data Explorer to store device metadata and CVEs
  • Agent to agent interaction with Network Copilotto query our database for device and inventory knowledge
  • Azure OpenAI models for natural language processing and classification
  • Azure Durable Functions for fine-grained orchestration and custom LLM workflows

“We chose Durable Functions for Vuln.AI because it allowed us to confidently orchestrate complex, stateful research,” says Mike Lollis, a senior software engineer in Microsoft Digital.  “The reliability and simplicity of the framework meant we could shift our focus to engineering the intelligence behind the agent, especially the prompting strategies used in Vuln.AI’s backend processing.”

Vuln.AI in action

Consider a common scenario: a new CVE that affects a network switch has just been published. Vuln.AI’s research agent immediately flags the vulnerability, maps it to potentially affected devices in our network inventory, and pushes the findings to an internal database.

A photo of Lee.

“AI is only as good as the data you provide. Much of the success with Vuln.AI came from our dedicated efforts to source comprehensive vulnerability data and device attributes. For effective AI-powered solutions, you really need to invest in a strong data foundation and a strategy for how to integrate into the rest of your infrastructure.”

Linda Lee, product manager II, Microsoft Digital

This data then becomes immediately accessible in our internal tools, where it is validated and approved by security engineers. Following this, network engineers are provided with precise information about their vulnerable devices.

Engineers can prompt Vuln.AI’s interactive agent to instantly retrieve the following information:

“12 devices impacted by CVE-2025-XXXX. Would you like me to suggest some next steps for mitigation or remediation?”

With Vuln.AI, network engineers can now begin vulnerability response operations much more quickly—no spreadsheet wrangling and no delays.

“AI is only as good as the data you provide,” says Linda Lee, a product manager II within Microsoft Digital. “Much of the success with Vuln.AI came from our dedicated efforts to source comprehensive vulnerability data and device attributes. For effective AI-powered solutions, you really need to invest in a strong data foundation and a strategy for how to integrate into the rest of your infrastructure.”

It’s about automating manual workflows and research.

“Vuln.AI has reduced our triage time by over 50%,” says Vincent Bersagol, a principal security engineer in Microsoft Digital.

This is allowing our engineers to focus on deeper analysis.

“The synergy between security and AI engineering has unlocked a new level of precision in vulnerability insights,” Bersagol says. “This is just the beginning.”

The journey ahead

Our journey with AI-powered vulnerability management has only just begun. Looking ahead, our roadmap for Vuln.AI includes:

  • Extending data coverage to include more hardware suppliers
  • Integrating more detailed device profiles for more targeted vulnerability response
  • Supporting autonomous workflows to streamline network engineers’ remediation efforts
  • Incorporating other AI agents to support more security use cases

These enhancements will further reduce risk, accelerate response times, and empower engineers to focus on more strategic initiatives.

“Trust is the foundation of everything we do in Microsoft Digital,” Bansal says. “Securing our network is essential to upholding that trust. Intelligent solutions like Vuln.AI not only help us stay ahead of emerging threats—they also establish the blueprint for integrating AI more deeply into our security operations.”

For IT leaders, Vuln.AI offers a blueprint for modern vulnerability management:

  • Scalable: Handles thousands of devices and vulnerabilities with ease
  • Accurate: Reduces false positives and missed threats
  • Efficient: Saves time, money, and resources
  • Secure: Built on Microsoft’s trusted AI and security frameworks

In a world where every second counts and any threat can be costly, Vuln.AI transforms vulnerability management from a bottleneck into a competitive advantage for Microsoft.

Key takeaways

As your organization looks for ways to improve security and threat response in a fast-changing landscape, consider the following insights on how AI is reshaping vulnerability management at Microsoft:

  • Fight fire with fire: The threat landscape has broadened dramatically due to bad actors using AI. Supplementing your own efforts with AI can help you manage your risk more effectively than traditional vulnerability management.
  • Agility is key: Effective vulnerability response hinges on acting fast. An AI-powered solution like Vuln.AI can cut the time needed to analyze and mitigate vulnerabilities by over 50%, enabling organizations to enhance security operations at scale.
  • The future is now: Looking ahead, Microsoft Digital will integrate agentic workflows into more security operations, boosting efficiency in risk prevention, threat detection and response, thereby enabling security practitioners and developers to focus on more strategic projects.

The post Vuln.AI: Our AI-powered leap into vulnerability management at Microsoft appeared first on Inside Track Blog.

]]>
20623
Keeping our in-house optical network safe with a Zero Trust mentality http://approjects.co.za/?big=insidetrack/blog/keeping-our-in-house-optical-network-safe-with-a-zero-trust-mentality/ Thu, 16 Oct 2025 16:00:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=20611 When it comes to corporate connectivity at Microsoft, a minute of lost connection can lead to catastrophic disruptions for our product teams, sleepless nights for our network engineers, and millions of dollars of lost value for the company. That’s why we built our own optical network at our headquarters in Washington state, and that’s why […]

The post Keeping our in-house optical network safe with a Zero Trust mentality appeared first on Inside Track Blog.

]]>
When it comes to corporate connectivity at Microsoft, a minute of lost connection can lead to catastrophic disruptions for our product teams, sleepless nights for our network engineers, and millions of dollars of lost value for the company.

That’s why we built our own optical network at our headquarters in Washington state, and that’s why we’re building similar networks at other regional campuses around the United States and the rest of the world.

With so much on the line, we need to make sure these in-house networks never go down.

But how are we doing that?

We’re applying the same robust Zero Trust approach we take to security and identity. While our optical networks are extremely reliable, any complex system can be knocked offline. In alignment with the Zero Trust mentality we have as a company, we trusted the integrity of what we’ve built, but we needed a resilient backup system that went beyond redundancy to provide true resilience.

Driven by this goal, we created a Zero Trust Optical Business Continuity Disaster Recovery (BCDR) network that combines two fully independent optical systems designed to sustain uninterrupted services, even during systemic failures. The result is more confidence for our employees and vendors, less pressure on our network engineers, and comprehensive network resilience that will protect us against a major outage.

The urgency of resilience

In 2021, our team in Microsoft Digital, the company’s IT organization, deployed our first next-generation optical network to serve the exclusive network needs of our Puget Sound metro campuses. It offers more bandwidth on less fiber for a lower operational cost than leasing from traditional carriers.

“Puget Sound is a highly concentrated developer network where we need to provide very high throughput,” says Patrick Alverio, principal group software engineering manager for Infrastructure and Engineering Services within Microsoft Digital. “Our optical system is the backbone of all that traffic.”

Our state-of-the-art optical network fulfills our need for fast and reliable connectivity at up to 400 Gbps between core sites, labs, data centers, and the internet edge. We built this network on the Reconfigurable Optical Add/Drop Multiplexer (ROADM) technology, delivering dynamic reconfiguration, colorless, directionless, contentionless (CDC) capabilities, flexible grid support, remote provisioning, and automation. It also features a full-mesh topology that provides a layer of redundancy.

But what if the entire ROADM-based system fails?

There are plenty of operational risks that can derail even the most robust network. Anything from misconfigured automation scripts to policy changes to misaligned software versioning to simple human error can cause outages.

A photo of Elangovan

“We don’t want even a second of downtime. We needed a life raft for when failures occur that could also function as a standby network for core site migrations or platform upgrades.”

Vinoth Elangovan, senior network engineer, Hybrid Core Network Services, Microsoft Digital

To some degree, those kinds of minor disruptions are inevitable. But catastrophic events like fiber cuts, failures in the ROADM operating system, or even natural disasters have the potential for even more wide-ranging disruption.

During a catastrophic outage, thousands of engineers, developers, researchers, and other technical employees who need access to crucial lab environments and data centers could lose connectivity. That can sabotage feature delivery, disrupt product patches, interrupt updates, and halt all kinds of core product functions.

On top of normal software development operations, new AI tools demand massive bandwidth and consistent uptime. Finally, our hybrid networks feature paths integrated with Microsoft Azure that consume on-premises resources, so they also stand to benefit from increased resilience.

A catastrophic network outage can cause incredible damage to all of these business functions. In fact, we experienced exactly that in 2022.

A fiber cut combined with a ROADM system hardware reboot caused a five-minute outage at our Puget Sound metro region. In this environment, every minute of lost connectivity can result in significant financial impact, making network resilience absolutely essential.

“We don’t want even a second of downtime,” says Vinoth Elangovan, senior network engineer, who designed and implemented the Zero Trust Optical BCDR network for Microsoft. “We needed a life raft for when failures occur that could also function as a standby network for core site migrations or platform upgrades.”

Delivering greater network resilience

To ensure we could deliver uninterrupted network connectivity even in the midst of a catastrophic outage, we needed to consider the technical demands of a truly resilient system. Five design pillars helped us assemble our architectural criteria:

  1. Independent optical systems: To provide true resilience, our primary and BCDR platforms needed to operate autonomously.
  2. Physically independent paths: Circuits should avoid shared conduits, fibers, and splices to operate completely independently.
  3. Separate control software: The primary and backup networks should operate through dedicated network management systems (NMSs), automation, and provisioning domains.
  4. Unified client interface: Both systems needed to terminate into the same interface to unify service for clients and applications.
  5. Survivability by design: We couldn’t assume that any system would be immune to failure. Instead, we built for the best possible outcomes.

The result was the Zero Trust Optical BCDR architecture, a layered approach to optical networking. It consists of our primary, ROADM-based transport layer and a secondary, MUX-based transport layer, both terminating into a single logical port channel.

“Our core responsibility is the employee experience, so our main design thrust was making sure service is seamless and uninterrupted—even during an outage.”

Vinoth Elangovan, senior network engineer, Hybrid Core Network Services, Microsoft Digital

Both systems are live and active, which means they deliver production services through their own independent fibers, power supplies, and software stacks. By layering fully independent optical domains and logically unifying them at the Ethernet edge, the network can sustain a complete failure of one system and maintain continuity.

That physical and operational independence is the difference between simple redundancy and robust resilience.

“Our core responsibility is the employee experience, so our main design thrust was making sure it’s seamless and uninterrupted—even during an outage,” Elangovan says.

Optical network backed by a BCDR network

A schematic of an optical network running between different nodes and backed up by a BCDR network.
The optical network in our Puget Sound region connects core sites to labs, datacenters, and the internet edge, while the BCDR network provides backup connections to deliver resilience in case of a catastrophic network failure.

A typical ROADM optical network connects campus and data center sites to the internet edge. Our design features three interconnected optical rings, with two internet edges as multi-directional nodes, while other sites operate as dual-degree nodes with bidirectional redundancy. Meanwhile, our campuses and datacenters are designated as critical sites and equipped with Optical BCDR links to ensure enhanced resiliency. In the event of a complete Optical ROADM line failure, these critical sites retain connectivity.

In the event of an outage on the primary network, the port channel handles forward continuity automatically, shifting WAN traffic between optical paths in real time.

The transition occurs seamlessly and transparently, with no noticeable impact to clients.

A photo of Martin

“Our initial goal was to provide high-throughput connectivity for major labs, with less than six minutes of downtime per year. That represents a service level of 99.999% network continuity, and we’re aiming for even better moving forward.”

Blaine Martin, principal engineering manager, Hybrid Core Network Services, Microsoft Digital

Coupling at the Ethernet layer provides clients and applications with one logical interface, automatic load balancing and traffic distribution, and seamless failover, regardless of which optical domain is providing service.

“Our initial goal was to provide high-throughput connectivity for major labs, with less than six minutes of downtime per year,” says Blaine Martin, principal engineering manager for Hybrid Core Network Services in Microsoft Digital. “That represents a service level of 99.999% network continuity, and we’re aiming for even better moving forward.”

A new era of confidence for network engineers

For the network engineers who keep Microsoft employees and resources connected, the Zero Trust Optical BCDR network relieves much of the pressure that comes from resolving outages.

“Before, we were dependent on a single system, even with redundancies, so the human experience was like firefighting. Now, if the primary optical network is having a problem, I don’t even see it.”

Kevin Bullard, principal cloud network engineering manager, Microsoft Digital

When a network goes down, engineers have an enormous set of responsibilities to manage: processing the incident report, assigning severity, performing checks, notifying internal teams, providing updates, and engaging with physical support teams—all with a profound urgency to restore productivity.

Dialing those pressures back has been a huge benefit.

“Before, we were dependent on a single system, even with redundancies, so the human experience was like firefighting,” says Kevin Bullard, Microsoft Digital principal cloud network engineering manager responsible for maintaining WAN interconnectivity between labs. “Now, if the primary optical network is having a problem, I don’t even see it.”

There will always be pressure on network engineers to restore connectivity during an outage, but they can breathe easier knowing it won’t cost the company millions of dollars as the time to resolve ticks away. And in non-emergency situations like core site migrations, the BCDR network provides a much easier way to shunt services while the main network is offline.

“Our internal users have become more confident that they can stay connected, no matter what,” says Chakri Thammineni, principal cloud network engineer for Infrastructure and Engineering Services in Microsoft Digital. “That gives the people responsible for maintaining our enterprise networks incredible peace of mind.”

Fortunately, there hasn’t been a substantial network outage in the Puget Sound metro area since 2022. But our network engineering teams know that if and when it happens, the BCDR network will be ready to maintain service continuity.

A photo of Alverio.

“We’re always looking ahead into industry trends to stay at the bleeding edge, whether that’s in the technology we provide for our customers or the networks we use to do our own work.”

Patrick Alverio, principal group software engineering manager, Infrastructure and Engineering Services, Microsoft Digital

With our Puget Sound network protected, we have plans in place to extend this model to other metro areas. Naturally, we have to balance population, criticality, and the knowledge that elevated reliability and availability come with a cost.

Our selection criteria for new BCDR networks have largely centered around two factors: expansions of AI-critical infrastructure and concentrations of secure access workspaces (SAWs) for technical employees. With these criteria in mind, we’re planning new BCDR networks first in the Bay Area and Dublin, then in Virginia, Atlanta, and London.

Zero Trust optical BCDR architecture represents a paradigm shift in enterprise network resilience, and we’re committed to expanding the model to benefit both conventional workloads and the expanding infrastructure demands of AI.

“We’re always looking ahead into industry trends to stay at the bleeding edge, whether that’s in the technology we provide for our customers or the networks we use to do our own work,” Alverio says. “We refuse to accept the status quo, and we’re elevating the experience for employees across Puget Sound and Microsoft as a whole.”

Driving AI innovation in optical network resilience

Our journey towards an AI-driven optical network is gaining momentum.

As part of our Secure Future initiative, we’ve automated our Optical Management Platform credential rotation and are actively developing intelligent incident management ticket enrichment, auto-remediation, link provisioning, deployment validation, and capacity planning.

AI plays a central role in this transformation.

With Microsoft 365 Copilot and GitHub Copilot integrated into our engineering workflows, we’re accelerating development cycles, improving code accuracy, and uncovering optimization opportunities that would otherwise take hours of manual effort.

These Copilots are also helping our engineers analyze network patterns, simulate outcomes, and validate deployment logic before execution, reducing human error and strengthening our Zero Trust posture. Over time, we’re evolving toward a system where AI not only assists but proactively predicts potential disruptions, recommends remediations, and continuously learns from operational telemetry.

These advancements are paving the way for a future where our optical infrastructure can anticipate issues, recover faster, and operate with the agility and assurance expected in a Zero Trust environment.

Key takeaways

If you’re considering implementing your own optical and BCDR networks, consider these tips:

  • Understand the technical components of resilience: Independent optical systems, physically independent paths, separate control software, a unified client interface, and survivability by design are the key technical components of true resilience.
  • Plan from a preparedness and value perspective: Evaluate the critical points in your infrastructure and determine where you can get the most value out of resilient connectivity.
  • Ensure your teams have the right skillset: Carefully consider the right workforce to run those systems and be accountable for their operation.

The post Keeping our in-house optical network safe with a Zero Trust mentality appeared first on Inside Track Blog.

]]>
20611