Network and infrastructure Archives - Inside Track Blog http://approjects.co.za/?big=insidetrack/blog/tag/network-and-infrastructure/ How Microsoft does IT Thu, 09 Apr 2026 16:34:58 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.4 137088546 Harnessing AI: How a data council is powering our unified data strategy at Microsoft http://approjects.co.za/?big=insidetrack/blog/harnessing-ai-how-a-data-council-is-powering-our-unified-data-strategy-at-microsoft/ Thu, 09 Apr 2026 16:00:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=23030 Information technology is an ever-evolving landscape. Artificial Intelligence is accelerating that evolution, providing employees with unprecedented access to information and insights. Data-driven decision making has never been more critical for businesses to achieve their goals. In light of this priority, we have established a Microsoft Digital Data Council to help accelerate our companywide AI-powered transformation. […]

The post Harnessing AI: How a data council is powering our unified data strategy at Microsoft appeared first on Inside Track Blog.

]]>
Information technology is an ever-evolving landscape. Artificial Intelligence is accelerating that evolution, providing employees with unprecedented access to information and insights. Data-driven decision making has never been more critical for businesses to achieve their goals.

In light of this priority, we have established a Microsoft Digital Data Council to help accelerate our companywide AI-powered transformation.

Our data council is a cross-functional team with representation from multiple domains within Microsoft, including Microsoft Digital, the company’s IT organization; Corporate, External, and Legal Affairs (CELA); and Finance.

A photo of Tripathi.

“By championing robust data governance, literacy, and responsible data practices, our data council is a crucial part of our AI-powered transformation. It turns enterprise data into a strategic capability that fuels predictive insights and intelligent outcomes across the organization.”

Naval Tripathi, principal engineering manager, Microsoft Digital

Our data council’s mission is to drive transformative business impact by establishing a cohesive data strategy across Microsoft Digital, empowering interconnected analytics and AI at scale. Our vision is to guide our organization toward Frontier Firm maturity through a clear blueprint for high-quality, reliable, AI-ready data delivered on trusted, scalable platforms.

“By championing robust data governance, literacy, and responsible data practices, our data council is a crucial part of our AI-powered transformation,” says Naval Tripathi, principal engineering manager in Microsoft Digital. “It turns enterprise data into a strategic capability that fuels predictive insights and intelligent outcomes across the organization.”

Enterprise IT maturity

This article is part of series on Enterprise IT maturity in the era of agents. We recommend reading all four of these articles to gain a comprehensive view of how your organization can transform with the help of AI and become a Frontier Firm.

  1. Becoming a Frontier Firm: Our IT playbook for the AI era
  2. Enterprise AI maturity in five steps: Our guide for IT leaders
  3. The agentic future: How we’re becoming an AI-first Frontier Firm at Microsoft
  4. AI at scale: How we’re transforming our enterprise IT operations at Microsoft (this story)

Our evolving data strategy

Over the past two decades, we at Microsoft—along with other large enterprises—have continuously evolved our data strategies in search of the right balance between control and agility. Early approaches were highly decentralized, with different teams owning and managing their own data assets. While this enabled local optimization, it also resulted in inconsistent quality and limited enterprise-wide insight.

Our subsequent shift toward centralized data platforms brought much-needed standardization, security, and scalability. However, as data platforms grew more sophisticated, ownership often drifted away from the business domains closest to the data, slowing responsiveness and diluting accountability.

Today, we and other leading companies are embracing a more balanced, federated approach, often described as a data mesh. Rather than forcing all our data into a single centralized system or allowing unchecked decentralization, the data mesh formalizes domain ownership while embedding governance, quality, and interoperability directly into shared platforms.

With this approach, our domain teams publish data as well-defined, discoverable products, while common standards for security, metadata, and compliance are enforced through automation rather than manual processes. This model preserves enterprise trust and consistency without sacrificing speed or autonomy.

By adopting a data mesh mindset, we can scale analytics and AI more effectively across the organization while still keeping ownership closely connected to the business focus. The result is a system that supports innovation at the edges, strong governance at the core, and seamless collaboration across domains, enabling the transformation of data from a technical asset to a strategic, enterprise-wide capability.

Quality, accessibility, and governance

To scale enterprise data and AI, organizations must first ensure their data is trusted, discoverable, and responsibly governed. At Microsoft Digital, our data strategy is designed to create data foundations that power intelligent applications and effective decision making across the company.

A photo of Uribe.

“High-quality, well-governed data is essential to accelerate implementation and adoption of AI tools. Data quality, accessibility, and governance are imperatives for AI systems to function effectively, and recognizing that is propelling our data strategy.”

Miguel Uribe, principal PM manager, Microsoft Digital

By implementing a data mesh strategy at scale, we aim to unlock valuable data insights and analytics, enabling advanced AI scenarios. Our data council focuses on three core dimensions that make AI-ready data possible:

  • Quality: Making sure enterprise data is reliable and complete
  • Accessibility: Enabling secure and discoverable access to data
  • Governance: Protecting and managing our data responsibly

Together, these dimensions form the foundation for scalable innovation and AI-powered data use. They connect data silos and ensure consistent, high‑quality access across the enterprise—enabling both humans and AI systems to work from the same trusted data foundation. As AI use cases mature, this foundation allows AI agents to retrieve and reason over data through enterprise endpoints, while supporting advanced analytics, data science, and broader technology.

“High-quality, well-governed data is essential to accelerate implementation and adoption of AI tools,” says Miguel Uribe, a principal PM manager in Microsoft Digital. “Data quality, accessibility, and governance are imperatives for AI systems to function effectively, and recognizing that is propelling our data strategy.”

Quality

AI-ready data is available, complete, accurate, and high-quality. By adopting this standard, our data scientists, engineers, and even our AI agents are better able to locate, process, and govern the information needed to drive our organization and maximize AI efficiencies.

By utilizing Microsoft Purview, our data council can oversee the monitoring of data attributes to ensure fidelity. It also monitors parameters to enforce standards for accuracy and completeness.

Accessibility

Ensuring that our employees get access to the information they need while prioritizing security is a foundational element of our enterprise data strategy. Microsoft Fabric allows us to unify our organization’s siloed data in a single “mesh” that enables advanced analytics, data science, data visualization and other connected scenarios.

Microsoft Purview then gives us the ability to democratize that data responsibly. By implementing a data mesh architecture, our employees can work confidently, unencumbered by siloed or inaccessible data, and with the assurance that the data they’re working with is secure.

A graphic shows how the data mesh architecture allows employees to access data they need, with platform services and data management zones surrounding this architecture.
The data mesh architecture enables our employees to do their work efficiently while preventing the data they’re working on from becoming siloed.

The data mesh connects and distributes data products across domains, enabling shared data access and compute while scaling beyond centralized architectures.

Platform services are standardized blueprints that embed security, interoperability, policies, standards, and core capabilities—providing guardrails that enable speed without fragmentation.

Data management zones provide centralized governance capabilities for policy enforcement, lineage, observability, compliance, and enterprise-wide trust.  

Governance

As organizations scale AI capabilities, strong governance becomes essential to ensure security, compliance, and ethical data use. Data governance—which includes establishing data policies, ensuring data privacy and security, and promoting ethical AI usage—is critical, as is compliance with General Data Protection Regulation (GDPR) and Consumer Data Protection Act (CDPA) regulations, among others.

However, governance is not only a technical capability; it’s also a cultural commitment.

Responsible data use must be embedded into the way teams manage data and build AI solutions. Through Microsoft Purview, we implemented an end-to-end governance framework that automates the discovery, classification, and protection of sensitive data across the enterprise data landscape.

This unified approach allows teams to innovate confidently, knowing that the data powering their insights and AI systems is trusted and protected, as well as responsibly managed.

“AI systems are only as reliable as the data that powers them,” Uribe says. “By investing in trusted and well-managed data, we accelerate not only the adoption of AI tools but our ability to generate meaningful insights and intelligent outcomes.”

The data catalog as the discovery layer

By serving as a common discovery layer for humans and AI, the data catalog ensures that governance translates directly into speed, accuracy, and trust at scale.

A unified data strategy only succeeds if both people and AI systems can consistently find the right data. At Microsoft, this is enabled by our enterprise data catalog, which operationalizes the standards set by our data council. 

For business users, the catalog provides intuitive search, ownership transparency, and trust signals—enabling confident self‑service analytics. For AI agents, the same catalog exposes machine‑readable metadata, allowing agents to programmatically discover canonical datasets, validate schema and freshness, and respect governance constraints.

Our role as Customer Zero

In Microsoft Digital, we operate as Customer Zero for the company’s enterprise solutions, so that our customers don’t have to.

That means we do more than adopt new products early. We deploy them at enterprise-scale, operate them under real‑world constraints, and hold them to the same standards our customers expect. The result is more resilient, ready‑to‑use solutions and a higher quality bar for every enterprise customer we serve.

A photo of Baccino.

“When we engage product teams with real telemetry from how data is created, governed, and consumed at scale, we move the conversation from theory to execution. That’s how enterprise readiness becomes real.”

Diego Baccino, principal software engineering manager, Microsoft Digital

Our data council embodies this Customer Zero mindset through its Enterprise Readiness initiative. By engaging product engineering as a unified enterprise voice, the council drives strategic conversations that surface operational blockers, influence roadmap prioritization, and ensure new and existing data solutions are truly ready for enterprise use.

These learnings are then shared broadly across Microsoft Digital to accelerate adoption, reduce duplication, and scale proven patterns across teams.

“When we engage product teams with real telemetry from how data is created, governed, and consumed at scale, we move the conversation from theory to execution,” says Diego Baccino, a principal software engineering manager in Microsoft Digital and a member of the council. “That’s how enterprise readiness becomes real.”

This work is deeply integrated with our AI Center of Excellence (CoE), where Customer Zero principles are applied to accelerate AI outcomes responsibly. Together, the AI CoE and the data council focus on improving data documentation and quality—foundational capabilities that are required to make AI feasible, trustworthy, and scalable across the enterprise.

By grounding AI innovation in measurable data quality and governance standards, Microsoft Digital ensures that experimentation can safely mature into production‑ready solutions. The partnership between our data council, our AI CoE, and our Responsible AI (RAI) Council is essential to our broader data and AI strategy.

“AI readiness isn’t aspirational—it’s operational,” Baccino says. “By measuring the health of our data, setting clear quality baselines, and using those signals to guide product and platform decisions, we turn data into a strategic asset and AI into a repeatable capability.”

Together, these teams exemplify what it means to be Customer Zero: Transforming enterprise experience into action, governance into acceleration, and data into durable competitive advantage.

Advancing our data culture

Our data council plays a pivotal role in advancing the organization transition from data literacy to enterprise data and AI capability. In conjunction with our AI CoE, it creates curricula and sponsors learning pathways, operational practices, and community programs to equip our employees with the skills and mindset required to thrive in a data- and AI-centric world.

While early efforts focused on improving data literacy, our data council ’s mission has evolved to enable data and AI capability at scale together with our AI CoE—where employees not only understand data but can effectively apply it to build, operate, and govern intelligent solutions.

“Our focus is not just teaching our teams about data. It is enabling employees to apply data to create AI-driven outcomes. When teams understand how data powers AI systems, they can make better decisions, design better products, and build more responsible AI experiences.”

Miguel Uribe, principal product manager, Microsoft Digital

Our curriculum includes high-level courses on data concepts, applications, and extensibility of AI tools like Microsoft 365 Copilot, as well as data products like Microsoft Purview and Microsoft Fabric.

By facilitating AI and data training, offering internally focused data and AI certifications, and internal community engagement, our council ensures that employees develop the capabilities required to responsibly build and operate AI-powered solutions. Achieving data and AI certifications not only promotes career development through improved data literacy, it also enhances the broader data-driven culture within our organization.

“We recognize that AI capability is built when data skills are applied directly to real AI scenarios and business outcomes—not when learning exists in isolation,” Uribe says. “Our focus is not just teaching our teams about data; it is enabling employees to apply data to create AI‑driven outcomes. When teams understand how data powers AI systems, they can make better decisions, design better products, and build more responsible AI experiences.”

Lessons learned

Our data council was created to develop and execute a cohesive data strategy across Microsoft Digital and to foster a strong data culture within our organization. Over time, several critical lessons have emerged.

Executive sponsorship enables transformation

Executive sponsorship is a key element to ensure implementation and adoption of a data strategy. Our leaders are committed to delivering and sustaining a robust data strategy and culture and have been effective champions of the council’s work.

“Leadership provides support and reinforcement of the council’s mission, as well as guidance and clarity related to diverse organizational priorities,” Baccino says.

Cross-functional collaboration accelerates impact

Our council’s work has also benefited from the diverse representation offered by different disciplines across our organization. Embracing diverse perspectives and understanding various organizational priorities is critical to implementing a successful data strategy and culture in a large and complex organization like Microsoft Digital.

Modern platforms allow for scalable AI productivity

Technology and architecture also play a critical role in enabling enterprise data and AI capability. Platforms like Microsoft Purview and Microsoft Fabric provide the governance, discovery, and analytics infrastructure required to create trusted, AI-ready data ecosystems.

Combined with strong leadership support and community engagement, these platforms allow our organization to move beyond isolated data projects toward connected, enterprise-wide intelligence.

As our organization continues to evolve, our data council’s strategic work and valuable insights will be crucial in shaping the future of data-driven decision making and AI transformation at Microsoft.

Key takeaways

Here are some things to keep in mind as you contemplate forming a data council to help you manage and scale AI impacts responsibly at your own organization:

  • A data mesh strikes the balance enterprises have been chasing. By formalizing domain ownership while enforcing standards through shared platforms, you avoid both chaotic decentralization and slow, over-centralized control.
  • Governance is an accelerator when it’s automated and embedded. Using platforms like Microsoft Purview and Microsoft Fabric, governance shifts from a manual gatekeeping function to a built‑in capability that enables faster, trusted analytics and AI.
  • AI systems are only as strong as their discovery layer. A unified enterprise data catalog allows both people and AI agents to find, trust, and use data consistently—turning standards into operational speed.
  • Customer Zero turns theory into enterprise‑ready execution. By operating its own data and AI platforms at scale, Microsoft Digital provides real telemetry and practical feedback that directly shapes product readiness.
  • Building AI capability is a cultural effort, not just a technical one. Our data council’s focus on applied learning, certification, and real-world AI scenarios ensures data skills translate into durable business outcomes.
  • AI scale exposes the cost of fragmented data ownership. A data council cuts through silos by aligning priorities, resolving tradeoffs, and concentrating investment on the data assets that matter most for AI impact.
  • Shared metrics create shared ownership. Publishing data quality and AI‑readiness scores at the leadership level reinforces accountability and positions data as a core enterprise asset.

The post Harnessing AI: How a data council is powering our unified data strategy at Microsoft appeared first on Inside Track Blog.

]]>
23030
Powering the new age of AI-led engineering in IT at Microsoft http://approjects.co.za/?big=insidetrack/blog/powering-the-new-age-of-ai-led-engineering-in-it-at-microsoft/ Thu, 05 Mar 2026 17:05:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=22539 When generative AI burst into the mainstream, it landed in our IT engineering organization like a shockwave. There was excitement, curiosity, skepticism, and no shortage of questions about what this technology meant for the future of IT. At Microsoft Digital—the company’s IT organization—we didn’t start with a grand transformation plan. Instead, we started with a […]

The post Powering the new age of AI-led engineering in IT at Microsoft appeared first on Inside Track Blog.

]]>
When generative AI burst into the mainstream, it landed in our IT engineering organization like a shockwave.

There was excitement, curiosity, skepticism, and no shortage of questions about what this technology meant for the future of IT.

At Microsoft Digital—the company’s IT organization—we didn’t start with a grand transformation plan. Instead, we started with a realization: AI wasn’t just another tool to roll out. It was a fundamental shift in how engineering work could happen.

For years, our IT teams have been focused on scale, reliability, and operational excellence. Those priorities didn’t change. What changed were the possibilities.

Suddenly, engineers could draft code in seconds, summarize complex systems instantly, or automate work that had once consumed hours or days. It was an opportunity to take the skills and capabilities of our people and amplify them with AI.

That realization forced us to step back and ask harder questions.

How do you help thousands of engineers understand what AI can actually do to impact their day-to-day work? How do you move from experimentation to trust? And how do you adopt AI in a way that strengthens engineering fundamentals instead of eroding them?

The answer came in the form of a phased journey grounded in people, culture, and continuous learning.

Phase 1: Awareness and access

It might sound surprising when speaking about engineering processes, but our first challenge wasn’t technology; it was understanding.

When generative AI entered the conversation, most engineers saw the headlines and dabbled in various tools, but few understood fully what it meant for their work. Some were excited, others were wary. Many simply didn’t know where to start. That gap between awareness and practical value was the first barrier we had to address.

We realized early that top-down mandates wouldn’t work. Telling engineers to “use AI” without context or relevance would only deepen skepticism. Instead, we focused on something both simpler and more difficult: Exposure.

We started by making AI visible and accessible in the tools engineers already used. GitHub Copilot. Microsoft 365 Copilot. Early copilots embedded directly into engineering workflows. The goal wasn’t immediate productivity gains. It was familiarity. Letting engineers see, firsthand, what AI could and couldn’t do.

A photo of Singhal.

“We encouraged tool usage and adoption so people would at least play around with AI. And once they did, they started seeing the value. That’s when the mindset shifted from ‘AI might replace me’ to ‘AI can be my companion.’”

Mukul Singhal, partner group engineering manager, Microsoft Digital

Just as important, we talked openly about limitations.

AI wasn’t perfect. It hallucinated. It made confident mistakes. And that honesty mattered. By framing AI as an assistant, we reinforced the role of engineering judgment. Engineers didn’t need to fear losing control. They needed to understand how to stay in control.

We also made experimentation safe.

No quotas. No forced adoption metrics. Engineers were encouraged to try AI on low‑risk tasks: summarizing documentation, generating test cases, or exploring unfamiliar codebases. Small wins built confidence, confidence built curiosity, and curiosity drove organic adoption.

As that experimentation took hold, the mindset began to shift.

“We encouraged tool usage and adoption so people would at least play around with AI,” says Mukul Singhal, a partner group engineering manager in Microsoft Digital. “And once they did, they started seeing the value. That’s when the mindset shifted from ‘AI might replace me’ to ‘AI can be my companion.’”

Over time, conversations changed from ‘Should we use AI?’ to ‘Where does AI help most?’

Engineers began sharing prompts, tips, and lessons learned with one another. What started as individual exploration turned into community learning. Awareness gave way to momentum.

Phase one was about providing access to explore, to question, and to learn. And that foundation made everything that followed possible.

Phase 2: Culture shift

Access created awareness and awareness created curiosity.

As more engineers began experimenting with AI, we noticed a pattern. Some teams were moving faster, learning faster, and reducing friction in their day‑to‑day work. Others stalled after initial trials. The difference wasn’t technical skill or capability, it was mindset.

A photo of Mamilla.

“People started shifting from the mindset of ‘Will AI work?’ to ‘AI is working for me.’ I think that was a very transformational shift, to where I believe a lot of engineers in the organization started believing in AI.”

Veera Mamilla, principal group engineering manager, Microsoft Digital

To move forward, we had to shift how AI was perceived from something optional or experimental to something that was simply part of how modern engineering gets done.

That meant normalizing AI as a trusted partner in the engineering process.

Leaders played a critical role in that shift. Rather than positioning AI as a productivity shortcut, they framed it as a way to strengthen engineering fundamentals: clearer design discussions, better documentation, faster feedback loops, and more time for deep problem‑solving. The message was intentional and consistent. Using AI wasn’t about cutting corners, it was about reimagining how work gets done.

We also had to address a fear that surfaced early: that AI adoption was a signal of replacement rather than empowerment.

“People started shifting from the mindset of ‘Will AI work?’ to ‘AI is working for me,’” says Veera Mamilla, a principal group engineering manager in Microsoft Digital. “I think that was a very transformational shift, to where I believe a lot of engineers in the organization started believing in AI.”

That framing mattered.

As engineers incorporated AI into their workflows, success stopped being measured by output alone. The focus shifted to outcomes. Did AI help you understand a system faster? Did it surface risks earlier? Did it free up time to focus on higher‑value work?

Over time, AI stopped feeling like a novelty. It became part of the engineering fabric. We reinforced it through leadership modeling, peer learning, and shared success stories. Teams no longer asked whether AI belonged in their workflows. They asked how to use it responsibly and effectively.

Phase 3: Upskilling and role evolution

Once AI moved from curiosity to expectation, the challenge of skill building became unavoidable.

From the start, we made a deliberate choice: This would be an upskilling and reskilling journey, not a wholesale replacement of roles. The goal wasn’t a new workforce. It was an investment in the one we had.

That decision shaped everything that followed.

Early upskilling efforts focused on practical entry points. Prompt engineering. Tool literacy. Understanding how copilots and early agents behaved in real engineering workflows. We treated these as something every engineer needed to experiment with, regardless of discipline.

But it quickly became clear that skills alone weren’t the full story. Roles themselves were starting to evolve.

A photo of Singh.

“Your title might still be software engineer or principal engineer. But if you’re acting like an AI engineer, what does that actually mean? That question helped us start defining how these roles were evolving.”

Ragini Singh, partner group engineering manager, Microsoft Digital

Across software development, service engineering, and cloud network engineering, the work was shifting from manual execution toward orchestration and oversight. Engineers were no longer expected to do every task end‑to‑end by hand. Instead, they were learning how to guide AI, review its output, and decide where automation made sense and where it didn’t.

As part of this shift, we began researching how the industry itself was redefining engineering roles. Leaders examined emerging job descriptions from across the market and compared them with Microsoft’s own role frameworks. At the time, there was no formal “AI engineer” role in the internal job library. Rather than creating a new title, the focus stayed on evolving expectations within existing roles.

The idea of an “AI‑native engineer” emerged not as a job description, but as a mindset.

An AI‑native engineer still understands systems, architecture, and risk. What’s different is how that expertise gets applied. Routine tasks are delegated to AI. Judgment, design, and accountability stay with the human. Engineers move from doing all the work themselves to supervising work done in partnership with AI.

“Your title might still be software engineer or principal engineer,” says Ragini Singh, a partner group engineering manager in Microsoft Digital. “But if you’re acting like an AI engineer, what does that actually mean? That question helped us start defining how these roles were evolving.”

This evolution looked different across disciplines. Software engineers focused on AI‑assisted coding, test generation, and spec‑driven development. Service engineers leaned into AI for incident response, knowledge capture, and operational decision support. Cloud network engineers began moving from manual intervention toward intelligent orchestration and agent‑assisted troubleshooting. The common thread wasn’t identical tooling, it was a shared shift toward higher‑order work and reduced toil.

Phase 4: Embedding AI across the engineering lifecycle

By this phase, we knew individual productivity gains were simply the starting point for larger and broader benefits.

Early on, most AI usage showed up in familiar places: Code suggestions, documentation summaries, quick answers. Useful, but fragmented. The bigger opportunity emerged when we stepped back and asked a harder question: What would it look like if AI were embedded across the entire engineering lifecycle, not just used at isolated moments?

We stopped thinking in terms of tools and started thinking in terms of flow. Design. Build. Test. Deploy. Operate. Improve. AI needed to show up across all of it, in ways that reinforced how engineers already worked.

A photo of Sadasivuni.

“If AI is only showing up at one step, you don’t get the full value. The real impact comes when it’s integrated across the lifecycle, where engineers can design, build, operate, and learn faster as a system.”

Sudhakar Sadasivuni, principal group engineering manager, Microsoft Digital

In software engineering, that meant pulling AI earlier into the process. We began using it to help draft requirements, reason through design options, and review code with broader system context to accelerate how quickly we could get to informed decisions. Coding assistance mattered, but it was no longer the center of gravity.

Testing and quality followed a similar pattern. AI supported test generation, defect analysis, and code review, reducing repetitive effort and helping issues surface sooner. That gave engineers more time to focus on quality and architecture instead of cleanup.

In service engineering, we embedded AI into incident management and operational workflows. Engineers used it to summarize incidents, surface relevant knowledge, and analyze signals across systems. In cloud network engineering, AI helped shift work away from manual intervention toward orchestration and intelligent troubleshooting. Across disciplines, the principle stayed the same: AI should reduce friction, not introduce it.

As we scaled this approach, one thing became clear. Embedding AI wasn’t just a technical exercise. It was a systems change.

“If AI is only showing up at one step, you don’t get the full value,” says Sudhakar Sadasivuni, a principal group engineering manager in Microsoft Digital. “The real impact comes when it’s integrated across the lifecycle, where engineers can design, build, operate, and learn faster as a system.”

As AI became part of core workflows, engineers remained accountable for outcomes. AI output was reviewed, tested, and validated like any other engineering input. Embedding AI didn’t lower the bar for rigor. It raised expectations around judgment, oversight, and data quality. We became more deliberate about responsibility and governance.

Over time, these integrations created compound benefits.

Faster design cycles reduced downstream rework. Better testing lowered operational noise. Improved operational insight shortened recovery times. AI stopped being something we used occasionally and became something the engineering system itself was built around.

Phase 5: Eliminating toil and accelerating outcomes

At some point, every AI story hits the same test. Does it actually make engineers’ days better? For us, that proof showed up fastest in elimination of toil.

Across Microsoft Digital, engineers have always spent time on work that was necessary but draining. It included tasks such as manual troubleshooting, repetitive diagnostics, log analysis, and routine operational tasks that kept systems running but didn’t move the organization forward.

AI gave us a chance to change that.

A photo of Garrison.

“Toil reduction is the biggest thing. That’s where engineers’ eyes light up. If we can eliminate toil, people engineers will flock to use AI. I really believe it.”

Beth Garrison, principal cloud network engineer, Microsoft Digital

In cloud network engineering, for example, troubleshooting used to require manually reconstructing what happened, such as logging into devices, chasing configurations, and piecing together context after the fact. As we began introducing agents and machine learning into these workflows, that work shifted. Instead of spending time assembling the picture, engineers could generate the views they needed faster and focus on resolving issues.

The same shift showed up in how we used operational data.

Rather than reacting to incidents after impact, we started using machine learning to analyze logs, identify patterns, and surface anomalies earlier. That moved teams from reactive response toward proactive monitoring and prevention.

One thing became clear very quickly: Toil reduction wasn’t just a benefit; it was the catalyst for adoption.

“Toil reduction is the biggest thing. That’s where engineers’ eyes light up,” says Beth Garrison, a principal cloud network engineer at Microsoft Digital. “If we can eliminate toil, people engineers will flock to use AI. I really believe it.”

Service engineering followed a similar arc.

Across governance, operations, productivity, and cost management, we began applying agents and automation to simplify complex work and reduce manual review cycles. Governance and compliance workflows became faster and more consistent. Operational processes benefited from guided remediation and earlier insight. Knowledge capture improved as documentation and remediation guidance could be generated and updated automatically.

When we removed repetitive work such as manual triage, rote diagnostics, endless documentation cleanup, we transformed how engineers spent their time. More focus on design. More proactive problem‑solving. More energy directed toward improving systems instead of just maintaining them.

Toil reduction made the value of AI tangible. It’s the moment AI stopped being interesting and became indispensable, and our engineering teams started asking where else we can apply it next.

Measuring what matters

By the time AI was embedded across our engineering lifecycle, a new question came into focus: “How do we know it’s working?”

In the early days, we paid close attention to usage. Which tools engineers were trying, where adoption was growing, or where it stalled. Those signals mattered and adoption was the leading indicator that people were getting comfortable and starting to integrate AI into real work.

“Adoption was always the starting point. But we were clear from the beginning that usage isn’t the destination. The real goal is impact; more time for engineers to focus on the work that truly matters.”

Ullas Kumble, principal group software engineering manager, Microsoft Digital

But using AI doesn’t automatically mean better outcomes. So, we shifted the conversation and started asking, “What’s different now that our engineers are using AI?”

That change reframed how we thought about measurement. We began looking beyond tool activity to understand impact across the engineering system. Faster design cycles. Earlier defect detection. Reduced time spent on repetitive operational work. Shorter incident resolution. Clearer documentation. Fewer handoffs. Less rework.

These weren’t abstract metrics. They showed up in the flow of work.

We were intentional about not forcing a single definition of value across every role. Software engineers, service engineers, and cloud network engineers experience impact differently. What mattered was that each team could point to tangible improvements in how work moved through the system.

That perspective shaped how leadership talked about success.

“Adoption was always the starting point,” says Ullas Kumble, a principal group software engineering manager at Microsoft Digital. “But we were clear from the beginning that usage isn’t the destination. The real goal is impact; more time for engineers to focus on the work that truly matters.”

Over time, this approach changed the quality of our conversations. Instead of debating whether AI was worth the investment, teams talked about where it was removing friction and where it still wasn’t delivering enough value. Measurement became a tool for learning and prioritization.

Moving forward

Looking ahead, one lesson stands out: this journey isn’t complete.

AI tools will continue to evolve. Agents will become more capable. Roles will keep shifting. What it means to be an engineer will continue to change. And that means our approach must stay grounded in the same principles that guided us from the start: invest in people, reinforce fundamentals, embed AI into real workflows, and stay honest about what’s working and what isn’t.

We didn’t set out to build an AI‑driven engineering organization overnight, we built it phase by phase.

By meeting engineers where they were
By reshaping culture before redefining roles.
By embedding AI across the lifecycle, not bolting it on.
By reducing toil and measuring impact where it mattered most.

The result is better engineering: powered by AI, guided by human judgment, and built to keep evolving.

Key takeaways

Here’s a set of approaches you can take to establish AI-led engineering for your organization:

  • Start with access and understanding. Give engineers safe, easy access to AI in the tools they already use so curiosity and confidence can develop organically before you push for outcomes.
  • Frame AI as a partner, not a replacement. Position AI as an assistant that strengthens engineering judgment and fundamentals rather than a shortcut or a threat to roles.
  • Normalize experimentation without pressure. Encourage low‑risk experimentation and peer sharing instead of mandates, allowing adoption to grow through visible, practical wins.
  • Invest in upskilling. Focus on evolving skills and expectations within existing roles so engineers learn how to guide, review, and stay accountable for AI‑assisted work.
  • Embed AI across the full engineering lifecycle. Look beyond isolated productivity gains and integrate AI into design, build, test, operate, and improve workflows to unlock system‑level impact.
  • Measure impact where engineers feel it. Move past usage metrics and track outcomes like reduced toil, faster feedback, and improved flow so teams can see where AI is truly making work better.

Try it out

Try GitHub Copilot.

The post Powering the new age of AI-led engineering in IT at Microsoft appeared first on Inside Track Blog.

]]>
22539
Protecting anonymity at scale: How we built cloud-first hidden membership groups at Microsoft http://approjects.co.za/?big=insidetrack/blog/protecting-anonymity-at-scale-how-we-built-cloud-first-hidden-membership-groups-at-microsoft/ Thu, 26 Feb 2026 17:00:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=22465 Some Microsoft employee groups can’t afford to be visible. For years, we supported email‑based communities internally here at Microsoft whose very existence depends on anonymity. These include employee resource groups, confidential project teams, and other sensitive audiences where simply revealing who belongs can create real‑world risk. Traditional distribution groups make membership discoverable by default. Owners […]

The post Protecting anonymity at scale: How we built cloud-first hidden membership groups at Microsoft appeared first on Inside Track Blog.

]]>
Some Microsoft employee groups can’t afford to be visible.

For years, we supported email‑based communities internally here at Microsoft whose very existence depends on anonymity. These include employee resource groups, confidential project teams, and other sensitive audiences where simply revealing who belongs can create real‑world risk.

Traditional distribution groups make membership discoverable by default. Owners can see members. Admins can see members. In some cases, other users can infer membership through directory queries or tooling.

That model doesn’t work when anonymity is a requirement.

A photo of Reifers.

“When the SFI wave hit, it was made clear to us that we needed to keep our people safe, and to do that, we needed to build a new hidden memberships group MVP. We needed to raise the bar with modern groups, and we needed to do it in six months or miss meeting our goals.”

Brett Reifers, senior product manager, Microsoft Digital

For over 15 years, we relied on a custom, on‑premises solution that enabled employees to send and receive messages through groups with fully hidden memberships.

The system worked, but we were deprecating the Microsoft Exchange servers that it ran on. At the same time, we were also deploying our Secure Future Initiative (SFI), which required us to reassess legacy systems that could expose sensitive data or slow incident response, including hidden membership groups.

The system wasn’t broken, but it represented concentrated risk simply by existing outside our modern cloud controls and monitoring.

“When the SFI wave hit, it was made clear to us that we needed to keep our people safe, and to do that, we needed to build a new hidden memberships group MVP,” says Brett Reifers, a product manager in Microsoft Digital, the company’s IT organization. “We needed to raise the bar with modern groups, and we needed to do it in six months or miss meeting our goals.”

The mandate was clear. Preserve anonymity, eliminate on‑premises dependencies, and do it quickly.

A photo of Carson.

“Our solution would enable us to deprecate our legacy on-premises Exchange hardware while maintaining the privacy of our employee groups, and it would do so in a cloud-first manner.”

Nate Carson, principal service engineer, Microsoft Digital

Instead of retrofitting hidden membership into standard Microsoft 365 groups, we asked a different question: What if the group lived somewhere else entirely? What if users interacted with a simple, secure front end, while all membership expansion and mail flow occurred in a locked‑down tenant built specifically for this purpose?

That idea became the foundation for Hidden Membership Groups: A new cloud‑first architecture that would separate user experience, leverage first‑party Microsoft services, and keep our group memberships hidden from everyone—including owners and administrators—by design.

“Our solution would enable us to deprecate our legacy on-premises Exchange hardware while maintaining the privacy of our employee groups, and it would do so in a cloud-first manner,” says Nate Carson, a principal service engineer in Microsoft Digital.

Once we settled on a solution, our next step was to get support for solving a problem not many people thought much about.

“Not everyone was aware of how serious of a situation we were in,” Carson says. “We had to show everyone what was at stake, and to share our solution with them.”

After taking their plan on the road, the team got the buy in it needed, and that’s when the real work started.  

Planning to solve business problems with security built-in

Before we designed anything, we had to be clear about what success meant.

Hidden Membership Groups aren’t just another collaboration feature. They support scenarios where anonymity wasn’t optional—it’s foundational. That reality shaped every requirement that we built into our solution, including:

1. Absolute privacy

Group membership couldn’t be immediately visible to users, group owners, or administrators–under any circumstances. That requirement immediately ruled out standard group models.

2. Cloud only

Any new solution had to live entirely in our cloud, use first‑party services, and align with modern identity, security, and compliance practices. On‑premises infrastructure wasn’t an option.

3. Scale

Some groups had a handful of members. Others had tens of thousands. Membership changed frequently, and those changes had to propagate safely and predictably without exposing data or degrading performance.

4. Separation of concerns

User interaction and membership truth couldn’t live in the same place. Employees needed a simple way to discover groups, request access, and manage participation, without ever interacting with the system that stored or expanded membership.

5. Self‑service with guardrails

The solution needed to reduce operational overhead, not introduce a new bottleneck. Group lifecycle management had to be automated, auditable, and secure, while still giving teams flexibility.

6. Simple to use

Employees shouldn’t need special training. They shouldn’t need to understand tenants, identity synchronization, or mail routing. The experience needed to be intuitive, consistent, and accessible—without compromising security.

Once those requirements were clear, our solution started to emerge. Incremental changes wouldn’t be enough. A traditional group model wouldn’t work. The solution required a new architecture—one designed around isolation, automation, and intentional limitation.

That’s when we started the engineering work.

Creating a cloud-first architecture

Designing for hidden membership meant eliminating ambiguity. If any surface could reveal membership, even indirectly, it didn’t belong in the design.

That constraint led us toward a model built on strict isolation, explicit APIs, and intentionally narrow interfaces. The result is straightforward to use, but deliberately difficult to interrogate.

Two tenants, with sharply separated responsibilities

At the foundation of the solution is a two‑tenant model.

Our primary Microsoft 365 tenant is where employees authenticate, discover groups, and initiate actions. A secondary, isolated tenant hosts the distribution lists and performs mail expansion for Hidden Membership Groups.

A photo of Mace.

“Tenant isolation is what makes the privacy guarantee real. By moving membership expansion to a tenant that users and owners can’t access, we removed the possibility of accidental exposure. The system simply doesn’t give you a place where membership can be seen.”

Chad Mace, principal architect, Microsoft Digital

That separation matters because the secondary tenant isn’t designed for interactive use. Only Exchange and the minimum directory constructs required for mail routing and expansion are enabled.

Operationally, when an employee sends email to a Hidden Membership Group, they send to a mail contact visible in the corporate tenant. That contact routes to the corresponding distribution group in the isolated tenant, where membership expansion occurs. Expanded messages are then delivered back in recipients’ inboxes in the corporate tenant, so sent and received mail lives where users already work.

“Tenant isolation is what makes the privacy guarantee real,” says Chad Mace, a principal architect in Microsoft Digital. “By moving membership expansion to a tenant that users and owners can’t access, we removed the possibility of accidental exposure. The system simply doesn’t give you a place where membership can be seen.”

Identity without interactive access

This isolated tenant only works if it can resolve recipients. To enable that, our development team used Microsoft Entra ID multi‑tenant organization identity sync to represent corporate users in the secondary tenant.

These identities are treated as business guest identities, and we disable sign‑in to prevent interactive access. The tenant can perform expansion, but nothing more.

However, complete isolation wasn’t technically possible. Privileged access always exists at some level. The design response was to minimize that exposure. Access to the isolated tenant is tightly restricted, and membership changes flow through automation rather than broad UI-based administration.

The goal: reduce exposure to the smallest viable operational group.

API-first automation as the control plane

With tenancy and identity model established, the team needed a single, consistent way to create groups, connect objects across tenants, and manage changes without introducing new administrative workflows. That’s where the APIs come in.

A photo of Pena II.

“We split the backend into multiple APIs so the system could scale without becoming fragile. That let us separate everyday operations from high-volume membership work and keep performance predictable.”

John Pena II, principal software engineer, Microsoft Digital

The backend is intentionally modular, split into three distinct APIs:

  • The control API handles group creation, configuration, and cross‑tenant coordination.
  • The membership API handles standard add and remove operations.
  • The bulk membership APIs handle large‑scale operations involving tens of thousands of users, with services designed to run long‑lived jobs, manage throttling, and recover from partial failures.

“We split the backend into multiple APIs so the system could scale without becoming fragile,” says John Pena II, a principal software engineer in Microsoft Digital. “That let us separate everyday operations from high-volume membership work and keep performance predictable.”

The APIs run as PowerShell-based Azure Functions and use managed identity patterns, including federated identity credentials, to securely connect across tenants.

Creating the user experience with PowerApps

For the front end, we built a Canvas app in Power Apps, backed by Dataverse. The goal was speed and flexibility, without compromising strict privacy boundaries.

By using Power Apps as the primary interaction layer, we deliver a secure, modern experience without unnecessary custom infrastructure. The Canvas app provides a single, focused surface for discovering, joining, and managing hidden membership groups, while all sensitive operations remain behind controlled APIs and tenant boundaries. This separation allows the team to iterate quickly on experience design without weakening the privacy guarantees that the solution depends on.

Power Platform also simplifies how security is being enforced across the solution. Dataverse enables fine‑grained, role‑based access, ensuring users only see data they’re entitled to see—while keeping sensitive membership information entirely out of the client layer. That reduces long‑term maintenance overhead and makes it easier to evolve the solution as requirements change.

“From the beginning, we designed everything with security roles and workflows in mind,” says Shiva Krishna Gollapelly, senior software engineer in Microsoft Digital. “Dataverse let us control who could see or change data without building additional APIs or storage layers, and keeping everything inside the Power Apps ecosystem saved us a lot of maintenance over time.”

Dataverse plays a precise role here: it maintains the datastore the app needs to function without becoming a secondary membership repository.

A photo of Amanishahrak.

“Using the Power Platform let us move fast, integrate deeply with Microsoft identity, and enforce security without building a full web stack from scratch.”

Bita Amanishahrak, software engineer II, Microsoft Digital

From a security posture perspective, Dataverse security is used intentionally to restrict what different users can see and do, and the Power App was developed with security roles and workflows in mind.

Short version: the app brokers intent, the APIs execute it, and all the pieces that need to stay separate do exactly that.

“Using the Power Platform let us move fast, integrate deeply with Microsoft identity, and enforce security without building a full web stack from scratch,” says Bita Amanishahrak, a software engineer in Microsoft Digital.

The architectural intent is consistent throughout—isolate the sensitive plane and ensure the user plane operates only through controlled interfaces.

Benefits and impact

The most important outcome of the new architecture is also the simplest: Hidden membership stays hidden.

Anonymity isn’t enforced by policy. It’s enforced by architecture. Membership data never appears in the user experience or administrative tooling, and it doesn’t surface as a side effect of scale.

“We’re no longer asking people to trust that we’ll handle sensitive membership carefully through process,” Reifers says. “The system makes exposure structurally impossible.”

The impact was immediate.

At launch, we migrated more than 2,200 hidden membership groups, representing over 200,000 users, from the legacy on‑premises system into the new cloud‑first architecture. Groups ranged from small, tightly controlled communities to audiences with tens of thousands of members, all supported without special handling.

“Some of these groups are massive,” Pena says. “We knew from the beginning we were dealing with memberships in the tens of thousands, which is why we designed bulk operations as a first‑class capability instead of an afterthought.”

The separation between routine APIs and bulk‑membership APIs proved critical, enabling large migrations and ongoing changes without degrading day-to-day performance.

Operationally, moving to a cloud‑only model reduced both risk and complexity. Decommissioning the on‑premises Exchange infrastructure eliminated specialized maintenance requirements and improved monitoring, auditing, and access controls alignment with our modern cloud standards.

Delivery speed also mattered. Driven by Secure Future Initiative urgency and strong executive sponsorship, the team designed and delivered a minimum viable product in less than six months.

“That timeline forced discipline,” Reifers says. “We focused on what mattered: Security, privacy guarantees, scale, and a UX that wouldn’t disrupt group owners and/or members that had relied on a 15-year old tool.”

Everything else was secondary.

A photo of Gollapelly.

“Most users never think about tenants or APIs. They just see a clean experience that does what they need, without exposing anything it shouldn’t.”

Shiva Krishna Gollapelly, senior software engineer, Microsoft Digital

From an employee perspective, the experience became simpler and safer. Users now interact through a Power Platform app consistent with the rest of Microsoft 365.

Discovering a group, requesting access, or leaving a group no longer requires understanding the architecture behind it.

“Most users never think about tenants or APIs,” Gollapelly says. “They just see a clean experience that does what they need, without exposing anything it shouldn’t.”

The result is sustainable. The platform protects anonymity at scale, simplifies operations, boosts resiliency, and can evolve without reopening core privacy questions.

Moving forward

Delivering the initial solution was only the beginning.

The team sees Hidden Membership Groups as more than a single solution. It’s a reusable pattern for sensitive collaboration in a cloud‑first world: isolate what matters most, automate everything else, and design experiences that don’t require trust to be safe.

As adoption grows, the team plans to support additional anonymity-sensitive scenarios while maintaining the same underlying model.

“We don’t want every sensitive scenario inventing its own workaround,” Mace says. “This gives us a pattern we can reuse confidently.”

Future priorities include improving lifecycle and ownership experiences, strengthening auditing and reporting for approved administrators, and enhancing self‑service workflows—without compromising membership privacy. If it risks exposing membership, it doesn’t ship.

With the legacy system fully retired, Reifers reflects on what the team accomplished to get here.

“We shipped a new enterprise pattern in six months using our first party tools,” Reifers says. “We achieved this because a stellar team cared about the mission. That’s the takeaway.”

Key takeaways

Use these tips to strengthen your privacy, simplify your operations, and future-proof your organization’s collaboration systems:

  • Prioritize privacy by design. Embed privacy considerations from the start to protect sensitive information in all collaboration scenarios.
  • Architect for scale. Treat bulk operations to support large groups efficiently as a first-class capability.
  • Automate and modernize workflows. Replace legacy systems with cloud-native solutions to reduce risk, improve transparency, and enable continuous improvement.
  • Streamline user experience. Provide intuitive, consistent interfaces that make it easy for users to access, join, or leave groups without requiring technical knowledge.
  • Enforce strict access and auditing controls. Align monitoring and administration with modern cloud standards to maintain security and accountability.
  • Create reusable patterns. Establish and share successful privacy patterns to avoid reinventing solutions for each new case.
  • Focus on operational simplicity and resilience. Design systems that are easy to maintain and improve, freeing up teams to concentrate on innovation rather than upkeep.

The post Protecting anonymity at scale: How we built cloud-first hidden membership groups at Microsoft appeared first on Inside Track Blog.

]]>
22465
Protecting AI conversations at Microsoft with Model Context Protocol security and governance http://approjects.co.za/?big=insidetrack/blog/protecting-ai-conversations-at-microsoft-with-model-context-protocol-security-and-governance/ Thu, 12 Feb 2026 17:05:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=22324 When we gave our Microsoft 365 Copilot agents a simple way to connect to tools and data with Model Context Protocol (MCP), the work spoke for itself. Answers got sharper. Delivery sped up. New patterns of development emerged across teams working with Copilot agents. That ease of communication, however, comes with a responsibility: Protect the […]

The post Protecting AI conversations at Microsoft with Model Context Protocol security and governance appeared first on Inside Track Blog.

]]>
When we gave our Microsoft 365 Copilot agents a simple way to connect to tools and data with Model Context Protocol (MCP), the work spoke for itself.

Answers got sharper. Delivery sped up. New patterns of development emerged across teams working with Copilot agents.

That ease of communication, however, comes with a responsibility: Protect the conversation.

Questions came up like, who’s allowed to speak? What can they say? And what should never leave the room?

Microsoft Digital, the company’s IT organization, and the Chief Information Security Officer (CISO) team, our internal security organization, are leaning on those questions to help us shape our strategy and tooling around MCP internally at Microsoft.

A photo of Kumar.

“With MCP, the problem is not the inherent design; it’s that every improper server implementation becomes a potential vulnerability. Even one misconfigured server can give the AI the keys to your data.”

Swetha Kumar, security assurance engineer, Microsoft CISO

Our approach is intentionally straightforward.

Start secure by default. Use trusted servers. Keep a living catalog so we always know which voices are in the room. Shape how agents communicate by requiring consent before making changes.

We minimize what’s shared outside our walls, watch for drift, and act when something looks off. Our goal is practical governance that lets builders move fast while keeping our data safe.

That’s the risk we design for, and it’s why our controls prioritize clear ownership, simple choices, and visible guardrails.

“With MCP, the problem is not the inherent design; it’s that every improper server implementation becomes a potential vulnerability,” says Swetha Kumar, a security assurance engineer in the Microsoft CISO organization. “Even one misconfigured server can give the AI the keys to your data.”

Understanding MCP and the need for security

MCP is a simple standard that lets AI systems “talk” to the right tools and data without custom integration work. Think of it like USB‑C for AI. Instead of building a new connection every time, teams plug into a common pattern. That standardization delivers speed and flexibility—but it also changes the security equation.

Before MCP, every integration was its own isolated conversation.

“Now, one pattern can unlock many systems,” Kumar says. “It’s a win and a risk. When AI can reach more systems with less effort, we must be precise about who’s allowed to speak, what they can say, and how much gets shared.”

We frame this as communications security.

The question isn’t just, “Is this API secure?” It’s “Is this a conversation we trust?” We want to know which servers are in the room, what actions they’re permitted to take, and how we’ll notice if something changes. At the same time, we keep the cognitive load low for builders. They choose from trusted options, see clear prompts before an agent makes edits, and move on. Simple choices lead to safer outcomes.

“MCP enables granular control over the tools and resources exposed to the Large Language Model,” Kumar says. “But that means the developer is responsible for configuring it correctly—which tools an agent can see, what actions a server can take, and what context is shared.”

This approach helps both sides.

Product teams get a consistent way to extend their agents while security teams get consistent places to add guardrails—at discovery, access, and throughout the flow of requests and responses. Everyone operates from the same playbook.

When we treat MCP this way, we protect the conversation without slowing it down. We know who’s speaking. We know what they can do. And we can prove it.

Assessing MCP security across four layers

Every MCP session creates a conversation graph. An agent discovers a server, ingests its tool descriptions, adds credentials and context, and starts sending requests. Each step—metadata, identity, content, and code—introduces potential risk.

We evaluate those risks across four layers so we can catch failures early, contain blast radius, and keep conversations in bounds.

However, the big picture is just as important as the details.

“We take a holistic view of MCP security: start with the ecosystem, then specify controls across the four layers,” Kumar says. “The layers make the work concrete, but the goal stays the same—unified governance, shared education, and faster detect-and-mitigate when a server is at risk.”

Applications and agents layer

This is where user intent meets execution. Agents parse prompts, discover tools, select actions, and request changes. MCP clients live here, deciding which servers to trust and when to ask for user consent.

  • What can go wrong
    • Tool poisoning or shadowing. A server advertises safe‑looking actions but performs something else.
    • Silent swaps. A tool’s metadata changes and the client keeps trusting an altered “voice.”
    • No sandbox. The agent can request edits or run code without strong guardrails.
  • What we watch for
    • Unexpected tool descriptions or capabilities at connect time.
    • Edit attempts on critical resources without explicit user consent.
    • Abnormal tool‑selection patterns across sessions.

AI platform layer

The AI platform layer includes the AI models and runtimes that interpret prompts and call tools, along with orchestration logic and safety features.

  • What can go wrong
    • Model supply‑chain drift. Unvetted models, unsafe updates, or compromised fine‑tunes change behavior.
    • Prompt injection via tool text. Descriptions and responses steer the model toward unsafe actions.
  • What we watch for
    • Model provenance and update cadence tied to agent behavior changes.
    • Signals of jailbreaks or instruction overrides in prompts and intermediate messages.
    • Output drift linked to specific tools or servers.

Data layer

This layer covers business data, files, and secrets the conversation can touch.

  • What can go wrong
    • Context oversharing. Session data, files, or secrets get packed into the model’s context and leak to a third‑party server.
    • Over‑scoped credentials. Long‑lived tokens, broad scopes, or wrong audience claims enable lateral movement.
  • What we watch for
    • Size and sensitivity of context passed to tools.
    • Token hygiene, including short lifetimes, least‑privilege scopes, and correct audience claims.
    • Data egress patterns that don’t match a tool’s declared purpose.

Infrastructure layer

The infrastructure layer includes compute, network, and runtime environments.

  • What can go wrong
    • Local servers with too much reach. Excessive access to environment variables, file systems, or system processes.
    • Cloud endpoints without a gateway. No TLS enforcement, rate limiting, or centralized logging.
    • Open egress. Servers call out to the internet where they shouldn’t.
  • What we watch for
    • All remote MCP servers registered behind the API gateway.
    • Runtime signals, such as authentication failures, burst traffic, or unusual geographies.
    • Network policies that restrict outbound calls to certain targets.

Across all four layers, the throughline is AI communications security. We decide who can speak and verify what was said—and keep listening for change.

Establishing a secure-by-default strategy

We start by closing the front door. We recommend every remote MCP server sits behind our API gateway, giving us a single place to authenticate, authorize, rate‑limit, and log. There are no direct calls and no blind spots.

A photo of Enjeti

“Everything we do starts with securing the MCP server by default and that begins by registering it in API Center for easier discovery. We rely solely on vetted and attested MCP servers, ensuring every call comes from a trusted footprint.”

Prathiba Enjeti, principal PM manager, Microsoft CISO

Next, we decide who gets a voice.

Teams choose from a vetted list of MCP servers. If someone connects to an unapproved endpoint, they receive a friendly nudge and a clear path to register it. No shaming—just fast correction and a better inventory the next time around.

Identity comes next. Servers expect short‑lived, least‑privilege tokens with the right scopes and audience. Admin paths require strong authentication, and where possible, we use proof‑of‑possession to bind tokens to the client and reduce replay risk. Secrets don’t live in code, keys rotate, and audit trails are in place.

“Everything we do starts with making the MCP server secure by default and that begins by registering it in API Center for easier discovery,” says Prathiba Enjeti, a principal product manager in the Microsoft CISO organization. “We only use vetted and attested MCP servers. That’s how we keep the conversation safe without slowing it down.“

On the client side, we slow agents at the right moments. Agents can’t touch high‑risk tools without explicit consent. Tool descriptions are verified on connection and compared to approved contracts. If a tool’s “voice” drifts, we block the call.

We also minimize what’s shared.

Context is trimmed to what the task requires. Sensitive data isn’t included by default, and third‑party servers get only what they need—not the whole transcript. Output filters and prompt shields sit alongside the model to prevent risky inputs from becoming risky actions.

Isolation completes the design. Local servers run in containers with tight file and network permissions. Hosted servers allow only the outbound calls they need, and inbound traffic flows through the gateway, with TLS and logging enforced.

Simple rules with visible guardrails.

“We only use vetted MCP servers,” Enjeti says. “That’s how we keep the conversation safe without slowing it down.”

How we run MCP at scale: architecture, vetting, and inventory

We keep MCP safe by making three things intentionally boring: architecture, vetting, and inventory. One defined path. One vetting flow. One living catalog.

Architecture

We recommend remote MCP servers sit behind an API gateway, giving us a single place to authenticate, authorize, validate, rate‑limit, and log. Transport Layer Security (TLS) is required by default, and for sensitive endpoints, we can require mutual TLS. Outbound egress is pinned to approved destinations using private endpoints and firewall rules, so servers can’t “call anywhere.” Runtime protection continuously watches for credential abuse, injection patterns, burst traffic, and odd geographies.

Identity is established up front. We issue short‑lived, least‑privilege tokens with the correct audience and scopes, and admin paths require strong authentication. Where supported, tokens are bound to the client to reduce replay risk. Services use managed identities or signed credentials; secrets don’t live in code, and keys rotate on schedule.

Model‑side safety travels with every conversation. Content safety and prompt shields help models ignore risky inputs, while orchestration enforces a per‑tool allowlist, so an agent can’t call tools that aren’t in policy—even if the model suggests it. We also track model versions, allowing behavior changes to be correlated with updates.

Clients enforce consent at the edge. “Ask before edits” is enabled by default for write, delete, and configuration changes. When an agent connects, it verifies tool descriptions against the approved contract.

Observability ties it all together. We’re working toward logging tool calls, resource access, and authorization decisions end‑to‑end with correlation IDs. Detections flag abnormal tool selection, unexpected data egress, or edits without consent. Every server has an owner, a contract, and an approval record, and metadata changes automatically trigger re‑review. Kill switches live at both the client and the gateway when we need them.

Vetting

We don’t “connect and hope.”

Before any MCP server can speak in our environment, it earns trust. Owners declare what the server does (tools and actions), what it touches (data categories and exports), how callers authenticate (scopes and audience), and where it runs (runtime and on‑call ownership).

We start with static checks: manifests must match the contract, side‑effecting actions must be consent‑gated, tokens must be short‑lived and properly scoped. A SBOM (Software Bill of Materials) must be present, dependencies must be current, and no credentials can be embedded in code.

Then we test like a client would. We snapshot tool metadata on connect and compare it to the approved contract, probe for prompt‑injection and tool‑poisoning, and verify that “ask before edits” triggers for destructive actions.

We also confirm context minimization, validate that egress is pinned to approved hosts, and test resilience under load, including health checks, retry behavior, and isolation using containers with least‑privilege file and network access. Servers are published only when security, privacy, and responsible AI reviews are complete, runbooks and on‑call are in place, and the registry entry is created and pinned.

Inventory

A photo of Janardhanan

“Inventory is the foundation—if we miss a server, we miss the conversation. Every server, regardless of where it’s running or how it’s deployed, must be accounted for in our system.”

Priya Janardhanan, principal security assurance engineering manager, Microsoft CISO

You can’t govern what you can’t see, and MCP shows up in more places than a single system of record. To solve that, we’re building the map from signals and stitch them into one catalog.

“Inventory is the foundation—if we miss a server, we miss the conversation,” says Priya Janardhanan, a principal security assurance engineering manager at Microsoft CISO Operations. “Every server, regardless of where it’s running or how it’s deployed, must be accounted for in our system. Without a complete inventory, we lose visibility into critical operations, risk exposing sensitive data, and undermine our ability to ensure compliance and security.”

Our goal state is that Endpoint telemetry catches developer‑run servers on laptops and workstations. Repos and CI pipelines reveal intent before anything ships. IDEs (Integrated Development Environments) surface local extensions and configured endpoints. The gateway and our registries anchor what’s approved for business data, while low‑code environments tell us which connectors are in use and where they point.

We normalize and correlate those signals with stable IDs for servers, tools, and owners. Ownership is proven through repositories, gateway services, and environment administrators—on‑call contacts included. Exposure is scored based on data touches, scopes requested, egress rules, and change history, so high‑risk items rise to the top of the queue.

Freshness is tracked with last‑seen timestamps, and stale entries are retired over time. Builders can discover and reuse approved servers; reviewers can see what changed since the last approval, and admins get instant visibility into coverage and hotspots.

We’re working toward automated identification and notification for unknow servers. In the ideal state, a registration stub is created when we detect an unknown server on an endpoint. Then, the likely owner is notified, and direct calls are blocked until the server is vetted through an automated process. If tool metadata changes after approval, high-risk actions are paused and routed for re-review, then auto-resumed once approved.

“It all revolves around inventory as the foundation,” Janardhanan says. “If we miss a server, we miss the conversation.”

A photo of Hasan

“Agent 365 tooling servers will allow centralized governance for IT admins. That means a single pane where they can see what’s approved, who owns it, what data it touches, and then apply policy.”

Aisha Hasan, principal product manager, Microsoft Digital

Architecture gives us stable choke points. Vetting keeps weak servers out. Inventory keeps our map current. It’s a single pattern for builders and a unified playbook for security.

Governing agents in low‑code and pro-code scenarios

Makers move fast—that’s the point. A Customer Support team needed a Copilot action to pull case history, so they opened Copilot Studio, selected an approved MCP connector, and shipped a first version before lunch. No tickets. No detours. Governance showed up in the flow, not as a blocker.

“Agent 365 tooling servers will allow centralized governance for IT admins,” says Aisha Hasan, a principal product manager at Microsoft Digital. “That means a single pane where they can see what’s approved, who owns it, what data it touches, and then apply policy. We’re moving toward that consolidation so innovation continues while governance gets simpler and more consistent.”

We place guardrails where makers already work. In Copilot Studio, trusted and verified first-party MCP servers are allowed in developer environments to accelerate innovation and encourage experimentation. Riskier or complex MCP integration is available in Copilot Studio custom environments and other pro-code tools such as Microsoft 365 Agent Tool kit in VS Code and Microsoft Foundry, but only with clear checks: service ownership, security and privacy review, responsible AI assessment, and consent gating for high‑impact actions.

The allowlist is our north star.

Approved MCP servers and connectors live in one catalog with documented owners, scopes, and data boundaries. Makers choose from that shelf. If an MCP server uses an unverified tool, we enforce endpoint filtering. If there is misconfiguration, we open a task for the owner and help them build securely.

Permissions stay tight without adding cognitive load. Tokens are short‑lived and scoped to the task. Context is trimmed so only the necessary fields flow to the tool. Third‑party servers never get the full transcript. If a connector’s capabilities change, the runtime compares the new “voice” to what we approved. MCP Clients should pause risky actions, notify the owner, and resume automatically once reviewed.

With agent inventory in Power Platform Admin Center and registry in Agent 365, admins get a clean view on which connectors are active, who owns them, what data they touch, and how often they’re called. Organization policies such as DLP and MIP can be enforced in a unified way , with a re‑review when capabilities change. The goal is simple: let builders innovate confidently and securely while maintaining security and compliance.

“MCP servers are powerful AI tools that enable agents to seamlessly integrate and interact with enterprise data and transform business workflows,” Hasan says. “That means the same enterprise data and governance principles are applied equally to MCP servers and other connectors. A robust inventory, an agile policy framework, and an automated workflow for enforcement are cornerstones for successfully governing agents at scale.”

Securing MCP at scale: Operating, monitoring, and enabling

Our work doesn’t stop at go‑live. Once an MCP server is in the catalog, we operate the conversation like a service: measurable, observable, and responsive. Identity and policy guard the front door, but runtime is where we prove the controls work without slowing anyone down.

In practice, operating MCP at scale comes down to four motions:

Observe every tool call end to end. We make the flow observable. Every tool call carries a correlation ID from client to gateway to server and back. Prompts, tool selections, authorization decisions, and resource access should belogged with consistent schemas. Golden signals—latency, errors, saturation—sit alongside safety signals like unexpected egress or edits without consent. Owners and security teams see the same dashboards.

Detect drift and abnormal behavior early. Detection lives close to the work. We flag abnormal tool patterns, spikes in write operations, burst traffic from new geographies, and context sizes that don’t fit a task. We continuously compare a tool’s “voice” at connect time to the approved version; drift automatically pauses risky actions and pings the owner. Cost controls double as guardrails, using rate limits and budgets to cap blast radius and surface runaway loops early.

Respond with precision instead of blunt shutdowns. Response is graded, not binary. We can block destructive actions and allow reads, or throttle a noisy client without killing the session. Kill switches exist at both the client and the gateway. Playbooks are pre‑approved and integrated into the consoles owners already use, and dry runs are part of muscle memory, so the first switch flip doesn’t happen during an incident.

We treat model behavior as part of operations. Content safety and prompt shields run in production, not just in tests. We pin model versions and watch for output drift after updates. If a model starts suggesting tools out of character, the owner gets paged with the exact prompts and calls that triggered it.

Telemetry respects privacy. Logs avoid sensitive payloads by default and mask what must pass through for forensics. Access is role‑based, retention follows policy, and audit readiness is designed in on day one.

Enable builders through templates, education, and reuse. Adoption and education run in parallel. Builders get templates that enable best practices: sample manifests with consent gates, CI checks for token scope and SBOMs, and gateway stubs with sane defaults. A “ten‑minute preflight” runs locally to verify contracts, test consent flows, and check egress before a pull request is opened. IDE lint rules catch common issues early.

“This is how we operate MCP at scale,” says Janardhanan. “Observe the conversation, detect drift early, respond with precision, and teach habits that make the right path the easy path. We run it like a product because that’s what it is.”

Measuring results and moving forward

This program has changed how we build. Reviews move faster because every server follows the same path. Drift is caught early because clients compare a tool’s “voice” on connection. Shadow servers decline as inventory fills in from endpoint, repo, IDE, and gateway signals. Reuse increases because teams can discover trusted servers instead of creating new ones. Incidents resolve faster with correlation IDs across the conversation and kill switches at both the client and the gateway.

It’s also changed how our admins work. One gateway means one perimeter to manage. Policies land once and apply everywhere. Owners see the same telemetry security sees, so fixes happen where the work happens.

Going forward, we’re focused on more consolidation and automation. We’re moving toward a single pane for MCP governance—approve, monitor, and pause from one place. Policy-as-code will keep allowlists, consent rules, and egress boundaries versioned and testable in CI.

Our preflight checks will get smarter, with stronger injection tests, automatic egress validation, and environment‑aware templates. We’ll expand consent patterns so high‑impact actions remain explicit and auditable, even across multi‑tool chains. And we’ll keep shrinking re‑review time, so drift is measured in minutes, not days.

AI conversations are now part of how we build every day. MCP standardizes how agents talk to tools and data. Secure‑by‑default architecture, rigorous vetting, and a living inventory, ensure the right voices stay in the room, only what’s needed is shared, and drift is caught early.

The result is simple: teams ship faster with fewer surprises, and governance stays visible without getting in the way. We’ll keep tightening the loop, so saying yes remains both easy and safe.

Key takeaways

If you’re implementing MCP security, consider these key actions to ensure secure, efficient adoption in your organization:

  • Build governance into the maker flow. Embed security, consent, and responsible AI checks directly where teams build—so protection shows up by default, not as an afterthought.
  • Maintain a single allowlist and catalog. Centralize approved MCP servers and connectors with clear ownership, scope, and data boundaries.
  • Enforce scoped, short-lived permissions by default. Automatically limit token scope and duration to minimize risk and exposure.
  • Monitor continuously and detect drift early. Observe activity, flag deviations, and pause risky actions until reviewed and approved by owners.
  • Automate incident response and controls. Leverage pre-approved playbooks, kill switches, and rate limits for fast, precise action.
  • Design for privacy and auditability from day one. Mask sensitive data, restrict log access by role, and endure audit readiness.
  • Promote education and reuse. Provide templates, training, and feedback loops to encourage safe development and adoption of trusted servers.

The post Protecting AI conversations at Microsoft with Model Context Protocol security and governance appeared first on Inside Track Blog.

]]>
22324
Moving from a ‘Scream Test’ to holistic lifecycle management: How we manage our Azure services at Microsoft http://approjects.co.za/?big=insidetrack/blog/moving-from-a-scream-test-to-holistic-lifecycle-management-how-we-manage-our-azure-services-at-microsoft/ Thu, 20 Nov 2025 17:05:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=21193 Nearly a decade ago, as we began our journey from relying on on-premises physical computing infrastructure to being a cloud-first organization, our engineers came up with a simple but effective technique to see if a relatively inactive server was really needed. Engage with our experts! Customers or Microsoft account team representatives from Fortune 500 companies […]

The post Moving from a ‘Scream Test’ to holistic lifecycle management: How we manage our Azure services at Microsoft appeared first on Inside Track Blog.

]]>
Nearly a decade ago, as we began our journey from relying on on-premises physical computing infrastructure to being a cloud-first organization, our engineers came up with a simple but effective technique to see if a relatively inactive server was really needed.

They dubbed it the “Scream Test.”

“We didn’t have a great server inventory and tracking system, and we didn’t always know who owned a server,” says Brent Burtness, a principal software engineer in Commerce Financial Platforms, who was one of the leaders for the effort in his group. “So, we essentially just turned them off. If someone screamed—‘Hey, why’d you turn off my server?’—then we’d know it was still being used.”

Today, the basic idea behind the Scream Test is being used across the company, but in a more holistic way. Importantly, it’s been incorporated into the overall lifecycle management of our computing infrastructure. And, through the automation tools provided by Microsoft Azure, we have a much more efficient process for making sure that we’re saving time and money by reducing the number of underused machines we operate, monitor, and maintain.

A photo of Apple

“We thought we were going to get rid of a small number of machines that weren’t being used. But we found the actual share was about 15% of all machines, which saved us a lot of effort of moving those unused machines to the cloud. In other words, we downsized on the way to the cloud, rather than after the fact.”

Pete Apple, cloud network engineering architect, Microsoft Digital

Uncovering more than expected

The Scream Test was part of the huge effort to evaluate our on-premises compute resources before we began moving to the Azure cloud. After all, why spend resources moving something that isn’t needed?

Pete Apple, who helped develop the concept of the Scream Test, is a cloud network engineering architect in Microsoft Digital, the company’s IT organization. Looking back, he remembers the surprising results that emerged when they began shutting down specific servers to see who noticed.

“We thought we were going to get rid of a small number of machines that weren’t being used,” Apple says. “But we found the actual share was about 15% of all machines, which saved us a lot of effort of moving those unused machines to the cloud. In other words, we downsized on the way to the cloud, rather than after the fact.”

As part of this process, Apple explains, our engineers looked at two related factors to reduce inefficiencies in our usage of computing resources.

The first was to identify systems that were used infrequently, at a very low level of CPU (sometimes called “cold” servers). From that, we could determine which systems in our on-premises environments were oversized—meaning someone had purchased physical machines according to what they thought the load would be, but either that estimate was incorrect or the load diminished over time. We took this data and created a set of recommended Microsoft Azure Virtual Machine (VM) sizes for every on-premises system to be migrated.

“We learned that there’s a lot of orphaned, or underutilized, resources out there,” Burtness says. “These were cases where the workload was so small on a server—like under 5% CPU—that it didn’t make sense to host it on its own machine. We could then move the task or application and get it down to just one or two CPUs on a virtual machine.”

At the time, we did much of this work manually, because we were early adopters. The company now has a number of products available to assist with this review of your on-premises environment, led by Azure Migrate.

Another part of the process was determining which systems were being used for only a few days a month or at certain busy times of the year. These development machines, test/QA machines, and user acceptance testing machines (reserved for final verification before moving code to production) were running continuously in the datacenter but were really only needed during limited windows. For these situations, we applied the tools available in Azure Resource Manager Templates and Azure Automation to ensure the machines would only run when needed.

Automating with Azure

Today, we don’t have to rely on anything as crude as the Scream Test to find unused and underused computing resources. With 98% of our IT resources operating in the Azure cloud, we have much greater insight into how efficient our network is, so much of the process can be automated.

“We’ve found this effort much easier to manage in the cloud, because all our computing resources are integrated with the Azure portal,” Apple says. “They have an API system and offer various tools within Azure Update Manager and Azure Advisor to help with cost efficiency. It’s kind of like a modern version of Clippy—’Hey, it looks like your VM isn’t being used much. Do you want to downsize that or turn it off?'”

(For the uninitiated, Clippy was the Microsoft Office animated paperclip assistant introduced in the late 1990s. It offered tips and help with tasks, like writing and formatting documents. Clippy became iconic for its quirky suggestions, including recommending that you remove things from your desktop that you weren’t using.)

Burtness smiles in a portrait photo.

“With everything being in the Azure portal or in Azure Resource Graph, it’s much more streamlined, and makes it easier to get that data out to the teams. They can then go into the portal and clean up the resource.”

Brent Burtness, principal software engineer, Commerce Financial Platforms

And simply taking the step of turning off stuff that we weren’t using turned out to be very effective. Thanks, Clippy!

Today, we approach this challenge in a more efficient and sophisticated way, taking advantage of Azure tools like Update Manager and Advisor.

“With everything being in the Azure portal or in Azure Resource Graph, it’s much more streamlined, and makes it easier to get that data out to the teams,” Burtness says. “We can run automated queries with Azure Resource Graph. Then we bring that information into our internal Service 360 tool, which we use to give action items to our developers. Each item gives them a link to Azure portal, and they can then go into the portal and clean up the resource.”

Managing for the lifecycle

One of the most important things we learned by using the Scream Test to identify inefficiencies and moving our systems from on-premises servers to the cloud was that it’s an ongoing process, not a fixed-end project.

“We had this idea that it was going to be a one-time event, that we’ll move to the cloud and then we’ll be done,” Apple says. “A better understanding is that it’s a lifecycle. We have integrated this concept of continual evaluation into our processes around everything that’s still on-premises, because we still have labs, we still have physical infrastructure.”

We continue to do this evaluation on a regular basis with both physical and virtual computing resources, because needs and usage are constantly changing.

Cutting our cloud costs

A text graphic shows the savings that one group at Microsoft achieved by becoming more efficient in their compute usage.
In a pilot set of Azure subscriptions, the Commerce Financial Platforms team reduced usage by 233 resources across 36 subscriptions and 17 services in 6 team groups, saving more than $15,000 in monthly operating costs.

“Now we have a basic process around a six-month cycle,” Apple says. “So, every six months we ask, does this still need to be on-premises or should we start moving it to the cloud? And we do the same thing with our cloud resources. Who’s still using these VMs? And we still go through the same review process to see if it’s needed, or if we can shut it down or move it.”

This has resulted in significant cost savings for the company. “We’re up to about 15% to 20% less compute cost, depending on the organization, because of this much better understanding of our business needs,” Apple says.

Better governance, increased security

Another major benefit of this process was establishing much stronger governance of compute resources across the entire organization.

“When we first did the Scream Test, we weren’t always really sure who owned what, in some cases,” Apple says. “We’ve fixed that as part of this process. This governance aspect is a key part of being more efficient with our resources.”

Burtness explains why this is so important.

“It’s critical to know exactly who to contact when there’s something wrong with the server,” Burtness says. “Now, with clearer ownership, clearer accountability, and better inventory, it’s a much better experience.”

Better governance also means tighter security, according to both Apple and Burtness.

“This is really important when it comes to threat-actor response,” Apple says. “Unused servers can often be an entry point for hackers. Or, say we discover that a machine or server is getting hacked; you need to talk to who owns it. If you don’t know, it takes you longer to track them down and combat the hack. That’s not great. Improving our governance has definitely made securing our environment easier.”

Key takeaways

Here are some things to keep in mind when managing your own enterprise compute resources for greater efficiency:

  • It’s not a one-time exercise. For the best results, you should be evaluating your computing resources on a regular schedule to identify ”cold” servers and unused infrastructure.
  • Adjust for variable usage patterns. It’s not just about unused servers. Some machines may only be needed for a business function during certain busy times of the year. Consider turning the machines on just to handle the load during those periods and turning them off the rest of the year.
  • Use Azure tools for greater insight. If you’re operating your infrastructure in the Azure cloud, you can much more easily monitor and address orphaned resources using automated tools such as Azure Advisor, Azure Resource Graph, and the Azure portal.
  • Apply your savings to other priorities. “The more efficient you are, the more savings can be applied to other projects or given back to your manager—who is going to be very happy with you,” Apple says.
  • Saving money is not the only benefit. You’ll not only save operating costs, you’ll have a reduced maintenance and monitoring load, better governance, and fewer security vulnerabilities.

The post Moving from a ‘Scream Test’ to holistic lifecycle management: How we manage our Azure services at Microsoft appeared first on Inside Track Blog.

]]>
21193
Vuln.AI: Our AI-powered leap into vulnerability management at Microsoft http://approjects.co.za/?big=insidetrack/blog/vuln-ai-our-ai-powered-leap-into-vulnerability-management-at-microsoft/ Thu, 16 Oct 2025 16:05:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=20623 In today’s hyperconnected enterprise landscape, vulnerability management is no longer a back-office function—it’s a frontline defense. With thousands of devices from a multitude of vendors, and a relentless stream of Common Vulnerabilities and Exposures (CVEs), here at Microsoft we faced a challenge familiar to every IT decision maker: how to scale vulnerability response without scaling […]

The post Vuln.AI: Our AI-powered leap into vulnerability management at Microsoft appeared first on Inside Track Blog.

]]>
In today’s hyperconnected enterprise landscape, vulnerability management is no longer a back-office function—it’s a frontline defense. With thousands of devices from a multitude of vendors, and a relentless stream of Common Vulnerabilities and Exposures (CVEs), here at Microsoft we faced a challenge familiar to every IT decision maker: how to scale vulnerability response without scaling cost, complexity, or risk.

A photo of Fielder.

“While AI enables amazing capabilities for knowledge workers, it also increases the threat landscape, since bad actors using AI are constantly probing for vulnerabilities. Vuln.AI helps keep Microsoft safe by identifying and accelerating the mitigation of vulnerabilities in our environment.”

Brian Fielder, vice president, Microsoft Digital 

Enter Vuln.AI, an intelligent agentic system developed by our team in Microsoft Digital—the company’s IT organization—to transform how we identify, prioritize, and resolve vulnerabilities across our enterprise network.

Manual methods can’t keep up

As a company, we detect over 600 million cybersecurity threats every day, according to our latest Digital Defense Report. Some of those signals are bad actors probing our internal network and infrastructure looking for unpatched vulnerabilities. Our infrastructure supports over 300,000 employees and vendors, 25,000 network devices, and over 560 buildings across 102 countries. This scale means we face a constant stream of vulnerabilities—each requiring triage, impact analysis, and remediation.

“While AI enables amazing capabilities for knowledge workers, it also increases the threat landscape, since bad actors using AI are constantly probing for vulnerabilities. Vuln.AI helps keep Microsoft safe by identifying and accelerating the mitigation of vulnerabilities in our environment,” says Brian Fielder, a vice president within Microsoft Digital. 

Historically, our Infrastructure, Networking, and Tenant team here in Microsoft Digital relied on manual assessments to determine which network devices were impacted by new vulnerabilities. Traditional vulnerability scanning tools generate a lot of false positives and false negatives, and a significant amount of analysis still falls to security engineers, requiring manual validation before any vulnerability impact can be communicated to device owners. These manual methods were time-consuming, error-prone, and reactive—our security engineers were spending hours on each vulnerability, at times missing critical threats or sinking too much time into false alarms.

A photo of Bansal.

“AI’s true power lies in the problem it’s applied to. Start by identifying the most time-consuming or painful task in your organization-then explore how AI can augment or improve it. Begin with a small, targeted enhancement and iterate continuously.”

Ankit Bansal, senior product manager, Microsoft Digital

With the vast number of vulnerabilities coming in every day, security engineers needed a scalable way to quickly analyze, prioritize, and respond.

The solution: Vuln.AI

We already achieved dramatic impact with our AI Ops and Network Infrastructure Copilot, which is on track to save us over 11,000 hours of network service management time per year. We built Vuln.AI on top of that investment:

  1. The Research Agent analyzes vulnerability feeds and network metadata from our Infrastructure Data Lakehouse (IDL) built on top of Azure Data Explorer, which regularly ingests data from our device vendors and other sources. Once new vulnerabilities are detected, it automates the identification of impacted devices and integrates with other internal tooling for validation and reporting.
  2. The Interactive Agent acts as a gateway for engineers and device owners to ask follow-up questions and initiate remediation. Through agent-to-agent interaction, it leverages our Network Infrastructure Copilot to query the research agent’s findings. This agentic interface enables real-time decision-making and contextual insights.

Together, these agents are significantly improving our network security operations. The results we’re seeing so far are compelling:

  • A 70% reduction in time to vulnerability insights, enabling faster prioritization and mitigation, minimizing exposure windows.
  • Lower risk of compromise through increased accuracy, quicker detection, and containment of threats.
  • A stronger compliance posture that supports adherence to financial, legal, and regulatory requirements.
  • Higher accuracy in identifying vulnerable devices, reducing false positives and missed threats
  • Engineering hours saved and reduced fatigue, significantly improving productivity.

Our gains translate to lower operational risk, faster response times, and more resilient infrastructure—critical outcomes for any enterprise navigating today’s threat landscape.

“AI’s true power lies in the problem it’s applied to,” says Ankit Bansal, a senior product manager within Microsoft Digital. “Start by identifying the most time-consuming or painful task in your organization-then explore how AI can augment or improve it. Begin with a small, targeted enhancement and iterate continuously.”

How Vuln.AI works

The system continuously ingests our CVE data from our device suppliers’ API feeds and a publicly available database of known cybersecurity vulnerabilities.  It correlates that data with device attributes such as its hardware model and OS to identify the potential impact on the network and surface actionable insights.

Engineers interact with the system via Copilot, Teams, or custom tooling, which allows seamless integration with our network security teams’ daily workflows.

“We built a hybrid approach in Vuln.AI to guide LLMs through complex security advisories,” says Blaze Kotsenburg, a software engineer in Microsoft Digital. “By combining structured function calls, templated prompts, and data validation, we keep the model focused on producing reliable, actionable insights for vulnerability mitigation.”

A photo of Lollis.

“We chose Durable Functions for Vuln.AI because it allowed us to confidently orchestrate complex, stateful research. The reliability and simplicity of the framework meant we could shift our focus to engineering the intelligence behind the agent, especially the prompting strategies used in Vuln.AI’s backend processing.”

Mike Lollis, a senior software engineer in Microsoft Digital.

When it came to building Vuln.AI, we relied heavily on our own technology platforms, including: 

  • Azure AI Foundry for model development and deployment
  • Azure Data Explorer to store device metadata and CVEs
  • Agent to agent interaction with Network Copilotto query our database for device and inventory knowledge
  • Azure OpenAI models for natural language processing and classification
  • Azure Durable Functions for fine-grained orchestration and custom LLM workflows

“We chose Durable Functions for Vuln.AI because it allowed us to confidently orchestrate complex, stateful research,” says Mike Lollis, a senior software engineer in Microsoft Digital.  “The reliability and simplicity of the framework meant we could shift our focus to engineering the intelligence behind the agent, especially the prompting strategies used in Vuln.AI’s backend processing.”

Vuln.AI in action

Consider a common scenario: a new CVE that affects a network switch has just been published. Vuln.AI’s research agent immediately flags the vulnerability, maps it to potentially affected devices in our network inventory, and pushes the findings to an internal database.

A photo of Lee.

“AI is only as good as the data you provide. Much of the success with Vuln.AI came from our dedicated efforts to source comprehensive vulnerability data and device attributes. For effective AI-powered solutions, you really need to invest in a strong data foundation and a strategy for how to integrate into the rest of your infrastructure.”

Linda Lee, product manager II, Microsoft Digital

This data then becomes immediately accessible in our internal tools, where it is validated and approved by security engineers. Following this, network engineers are provided with precise information about their vulnerable devices.

Engineers can prompt Vuln.AI’s interactive agent to instantly retrieve the following information:

“12 devices impacted by CVE-2025-XXXX. Would you like me to suggest some next steps for mitigation or remediation?”

With Vuln.AI, network engineers can now begin vulnerability response operations much more quickly—no spreadsheet wrangling and no delays.

“AI is only as good as the data you provide,” says Linda Lee, a product manager II within Microsoft Digital. “Much of the success with Vuln.AI came from our dedicated efforts to source comprehensive vulnerability data and device attributes. For effective AI-powered solutions, you really need to invest in a strong data foundation and a strategy for how to integrate into the rest of your infrastructure.”

It’s about automating manual workflows and research.

“Vuln.AI has reduced our triage time by over 50%,” says Vincent Bersagol, a principal security engineer in Microsoft Digital.

This is allowing our engineers to focus on deeper analysis.

“The synergy between security and AI engineering has unlocked a new level of precision in vulnerability insights,” Bersagol says. “This is just the beginning.”

The journey ahead

Our journey with AI-powered vulnerability management has only just begun. Looking ahead, our roadmap for Vuln.AI includes:

  • Extending data coverage to include more hardware suppliers
  • Integrating more detailed device profiles for more targeted vulnerability response
  • Supporting autonomous workflows to streamline network engineers’ remediation efforts
  • Incorporating other AI agents to support more security use cases

These enhancements will further reduce risk, accelerate response times, and empower engineers to focus on more strategic initiatives.

“Trust is the foundation of everything we do in Microsoft Digital,” Bansal says. “Securing our network is essential to upholding that trust. Intelligent solutions like Vuln.AI not only help us stay ahead of emerging threats—they also establish the blueprint for integrating AI more deeply into our security operations.”

For IT leaders, Vuln.AI offers a blueprint for modern vulnerability management:

  • Scalable: Handles thousands of devices and vulnerabilities with ease
  • Accurate: Reduces false positives and missed threats
  • Efficient: Saves time, money, and resources
  • Secure: Built on Microsoft’s trusted AI and security frameworks

In a world where every second counts and any threat can be costly, Vuln.AI transforms vulnerability management from a bottleneck into a competitive advantage for Microsoft.

Key takeaways

As your organization looks for ways to improve security and threat response in a fast-changing landscape, consider the following insights on how AI is reshaping vulnerability management at Microsoft:

  • Fight fire with fire: The threat landscape has broadened dramatically due to bad actors using AI. Supplementing your own efforts with AI can help you manage your risk more effectively than traditional vulnerability management.
  • Agility is key: Effective vulnerability response hinges on acting fast. An AI-powered solution like Vuln.AI can cut the time needed to analyze and mitigate vulnerabilities by over 50%, enabling organizations to enhance security operations at scale.
  • The future is now: Looking ahead, Microsoft Digital will integrate agentic workflows into more security operations, boosting efficiency in risk prevention, threat detection and response, thereby enabling security practitioners and developers to focus on more strategic projects.

The post Vuln.AI: Our AI-powered leap into vulnerability management at Microsoft appeared first on Inside Track Blog.

]]>
20623
Keeping our in-house optical network safe with a Zero Trust mentality http://approjects.co.za/?big=insidetrack/blog/keeping-our-in-house-optical-network-safe-with-a-zero-trust-mentality/ Thu, 16 Oct 2025 16:00:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=20611 When it comes to corporate connectivity at Microsoft, a minute of lost connection can lead to catastrophic disruptions for our product teams, sleepless nights for our network engineers, and millions of dollars of lost value for the company. That’s why we built our own optical network at our headquarters in Washington state, and that’s why […]

The post Keeping our in-house optical network safe with a Zero Trust mentality appeared first on Inside Track Blog.

]]>
When it comes to corporate connectivity at Microsoft, a minute of lost connection can lead to catastrophic disruptions for our product teams, sleepless nights for our network engineers, and millions of dollars of lost value for the company.

That’s why we built our own optical network at our headquarters in Washington state, and that’s why we’re building similar networks at other regional campuses around the United States and the rest of the world.

With so much on the line, we need to make sure these in-house networks never go down.

But how are we doing that?

We’re applying the same robust Zero Trust approach we take to security and identity. While our optical networks are extremely reliable, any complex system can be knocked offline. In alignment with the Zero Trust mentality we have as a company, we trusted the integrity of what we’ve built, but we needed a resilient backup system that went beyond redundancy to provide true resilience.

Driven by this goal, we created a Zero Trust Optical Business Continuity Disaster Recovery (BCDR) network that combines two fully independent optical systems designed to sustain uninterrupted services, even during systemic failures. The result is more confidence for our employees and vendors, less pressure on our network engineers, and comprehensive network resilience that will protect us against a major outage.

The urgency of resilience

In 2021, our team in Microsoft Digital, the company’s IT organization, deployed our first next-generation optical network to serve the exclusive network needs of our Puget Sound metro campuses. It offers more bandwidth on less fiber for a lower operational cost than leasing from traditional carriers.

“Puget Sound is a highly concentrated developer network where we need to provide very high throughput,” says Patrick Alverio, principal group software engineering manager for Infrastructure and Engineering Services within Microsoft Digital. “Our optical system is the backbone of all that traffic.”

Our state-of-the-art optical network fulfills our need for fast and reliable connectivity at up to 400 Gbps between core sites, labs, data centers, and the internet edge. We built this network on the Reconfigurable Optical Add/Drop Multiplexer (ROADM) technology, delivering dynamic reconfiguration, colorless, directionless, contentionless (CDC) capabilities, flexible grid support, remote provisioning, and automation. It also features a full-mesh topology that provides a layer of redundancy.

But what if the entire ROADM-based system fails?

There are plenty of operational risks that can derail even the most robust network. Anything from misconfigured automation scripts to policy changes to misaligned software versioning to simple human error can cause outages.

A photo of Elangovan

“We don’t want even a second of downtime. We needed a life raft for when failures occur that could also function as a standby network for core site migrations or platform upgrades.”

Vinoth Elangovan, senior network engineer, Hybrid Core Network Services, Microsoft Digital

To some degree, those kinds of minor disruptions are inevitable. But catastrophic events like fiber cuts, failures in the ROADM operating system, or even natural disasters have the potential for even more wide-ranging disruption.

During a catastrophic outage, thousands of engineers, developers, researchers, and other technical employees who need access to crucial lab environments and data centers could lose connectivity. That can sabotage feature delivery, disrupt product patches, interrupt updates, and halt all kinds of core product functions.

On top of normal software development operations, new AI tools demand massive bandwidth and consistent uptime. Finally, our hybrid networks feature paths integrated with Microsoft Azure that consume on-premises resources, so they also stand to benefit from increased resilience.

A catastrophic network outage can cause incredible damage to all of these business functions. In fact, we experienced exactly that in 2022.

A fiber cut combined with a ROADM system hardware reboot caused a five-minute outage at our Puget Sound metro region. In this environment, every minute of lost connectivity can result in significant financial impact, making network resilience absolutely essential.

“We don’t want even a second of downtime,” says Vinoth Elangovan, senior network engineer, who designed and implemented the Zero Trust Optical BCDR network for Microsoft. “We needed a life raft for when failures occur that could also function as a standby network for core site migrations or platform upgrades.”

Delivering greater network resilience

To ensure we could deliver uninterrupted network connectivity even in the midst of a catastrophic outage, we needed to consider the technical demands of a truly resilient system. Five design pillars helped us assemble our architectural criteria:

  1. Independent optical systems: To provide true resilience, our primary and BCDR platforms needed to operate autonomously.
  2. Physically independent paths: Circuits should avoid shared conduits, fibers, and splices to operate completely independently.
  3. Separate control software: The primary and backup networks should operate through dedicated network management systems (NMSs), automation, and provisioning domains.
  4. Unified client interface: Both systems needed to terminate into the same interface to unify service for clients and applications.
  5. Survivability by design: We couldn’t assume that any system would be immune to failure. Instead, we built for the best possible outcomes.

The result was the Zero Trust Optical BCDR architecture, a layered approach to optical networking. It consists of our primary, ROADM-based transport layer and a secondary, MUX-based transport layer, both terminating into a single logical port channel.

“Our core responsibility is the employee experience, so our main design thrust was making sure service is seamless and uninterrupted—even during an outage.”

Vinoth Elangovan, senior network engineer, Hybrid Core Network Services, Microsoft Digital

Both systems are live and active, which means they deliver production services through their own independent fibers, power supplies, and software stacks. By layering fully independent optical domains and logically unifying them at the Ethernet edge, the network can sustain a complete failure of one system and maintain continuity.

That physical and operational independence is the difference between simple redundancy and robust resilience.

“Our core responsibility is the employee experience, so our main design thrust was making sure it’s seamless and uninterrupted—even during an outage,” Elangovan says.

Optical network backed by a BCDR network

A schematic of an optical network running between different nodes and backed up by a BCDR network.
The optical network in our Puget Sound region connects core sites to labs, datacenters, and the internet edge, while the BCDR network provides backup connections to deliver resilience in case of a catastrophic network failure.

A typical ROADM optical network connects campus and data center sites to the internet edge. Our design features three interconnected optical rings, with two internet edges as multi-directional nodes, while other sites operate as dual-degree nodes with bidirectional redundancy. Meanwhile, our campuses and datacenters are designated as critical sites and equipped with Optical BCDR links to ensure enhanced resiliency. In the event of a complete Optical ROADM line failure, these critical sites retain connectivity.

In the event of an outage on the primary network, the port channel handles forward continuity automatically, shifting WAN traffic between optical paths in real time.

The transition occurs seamlessly and transparently, with no noticeable impact to clients.

A photo of Martin

“Our initial goal was to provide high-throughput connectivity for major labs, with less than six minutes of downtime per year. That represents a service level of 99.999% network continuity, and we’re aiming for even better moving forward.”

Blaine Martin, principal engineering manager, Hybrid Core Network Services, Microsoft Digital

Coupling at the Ethernet layer provides clients and applications with one logical interface, automatic load balancing and traffic distribution, and seamless failover, regardless of which optical domain is providing service.

“Our initial goal was to provide high-throughput connectivity for major labs, with less than six minutes of downtime per year,” says Blaine Martin, principal engineering manager for Hybrid Core Network Services in Microsoft Digital. “That represents a service level of 99.999% network continuity, and we’re aiming for even better moving forward.”

A new era of confidence for network engineers

For the network engineers who keep Microsoft employees and resources connected, the Zero Trust Optical BCDR network relieves much of the pressure that comes from resolving outages.

“Before, we were dependent on a single system, even with redundancies, so the human experience was like firefighting. Now, if the primary optical network is having a problem, I don’t even see it.”

Kevin Bullard, principal cloud network engineering manager, Microsoft Digital

When a network goes down, engineers have an enormous set of responsibilities to manage: processing the incident report, assigning severity, performing checks, notifying internal teams, providing updates, and engaging with physical support teams—all with a profound urgency to restore productivity.

Dialing those pressures back has been a huge benefit.

“Before, we were dependent on a single system, even with redundancies, so the human experience was like firefighting,” says Kevin Bullard, Microsoft Digital principal cloud network engineering manager responsible for maintaining WAN interconnectivity between labs. “Now, if the primary optical network is having a problem, I don’t even see it.”

There will always be pressure on network engineers to restore connectivity during an outage, but they can breathe easier knowing it won’t cost the company millions of dollars as the time to resolve ticks away. And in non-emergency situations like core site migrations, the BCDR network provides a much easier way to shunt services while the main network is offline.

“Our internal users have become more confident that they can stay connected, no matter what,” says Chakri Thammineni, principal cloud network engineer for Infrastructure and Engineering Services in Microsoft Digital. “That gives the people responsible for maintaining our enterprise networks incredible peace of mind.”

Fortunately, there hasn’t been a substantial network outage in the Puget Sound metro area since 2022. But our network engineering teams know that if and when it happens, the BCDR network will be ready to maintain service continuity.

A photo of Alverio.

“We’re always looking ahead into industry trends to stay at the bleeding edge, whether that’s in the technology we provide for our customers or the networks we use to do our own work.”

Patrick Alverio, principal group software engineering manager, Infrastructure and Engineering Services, Microsoft Digital

With our Puget Sound network protected, we have plans in place to extend this model to other metro areas. Naturally, we have to balance population, criticality, and the knowledge that elevated reliability and availability come with a cost.

Our selection criteria for new BCDR networks have largely centered around two factors: expansions of AI-critical infrastructure and concentrations of secure access workspaces (SAWs) for technical employees. With these criteria in mind, we’re planning new BCDR networks first in the Bay Area and Dublin, then in Virginia, Atlanta, and London.

Zero Trust optical BCDR architecture represents a paradigm shift in enterprise network resilience, and we’re committed to expanding the model to benefit both conventional workloads and the expanding infrastructure demands of AI.

“We’re always looking ahead into industry trends to stay at the bleeding edge, whether that’s in the technology we provide for our customers or the networks we use to do our own work,” Alverio says. “We refuse to accept the status quo, and we’re elevating the experience for employees across Puget Sound and Microsoft as a whole.”

Driving AI innovation in optical network resilience

Our journey towards an AI-driven optical network is gaining momentum.

As part of our Secure Future initiative, we’ve automated our Optical Management Platform credential rotation and are actively developing intelligent incident management ticket enrichment, auto-remediation, link provisioning, deployment validation, and capacity planning.

AI plays a central role in this transformation.

With Microsoft 365 Copilot and GitHub Copilot integrated into our engineering workflows, we’re accelerating development cycles, improving code accuracy, and uncovering optimization opportunities that would otherwise take hours of manual effort.

These Copilots are also helping our engineers analyze network patterns, simulate outcomes, and validate deployment logic before execution, reducing human error and strengthening our Zero Trust posture. Over time, we’re evolving toward a system where AI not only assists but proactively predicts potential disruptions, recommends remediations, and continuously learns from operational telemetry.

These advancements are paving the way for a future where our optical infrastructure can anticipate issues, recover faster, and operate with the agility and assurance expected in a Zero Trust environment.

Key takeaways

If you’re considering implementing your own optical and BCDR networks, consider these tips:

  • Understand the technical components of resilience: Independent optical systems, physically independent paths, separate control software, a unified client interface, and survivability by design are the key technical components of true resilience.
  • Plan from a preparedness and value perspective: Evaluate the critical points in your infrastructure and determine where you can get the most value out of resilient connectivity.
  • Ensure your teams have the right skillset: Carefully consider the right workforce to run those systems and be accountable for their operation.

The post Keeping our in-house optical network safe with a Zero Trust mentality appeared first on Inside Track Blog.

]]>
20611
Unleashing API-powered agents at Microsoft: Our internal learnings and a step-by-step guide http://approjects.co.za/?big=insidetrack/blog/unleashing-api-powered-agents-at-microsoft-our-internal-learnings-and-a-step-by-step-guide/ Thu, 02 Oct 2025 16:05:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=19793 Agentic AI is the frontier of the AI landscape. These tools show enormous promise, but harnessing their power isn’t always as straightforward as prompting a model or accessing data from Microsoft 365 apps. To reach their full potential in the enterprise, agents sometimes need access to data beyond Microsoft Graph. But giving them access to […]

The post Unleashing API-powered agents at Microsoft: Our internal learnings and a step-by-step guide appeared first on Inside Track Blog.

]]>
Agentic AI is the frontier of the AI landscape. These tools show enormous promise, but harnessing their power isn’t always as straightforward as prompting a model or accessing data from Microsoft 365 apps. To reach their full potential in the enterprise, agents sometimes need access to data beyond Microsoft Graph. But giving them access to that data relies on an extra layer of extensibility.

To meet these demands, many of our teams within Microsoft Digital, the company’s IT organization, have been experimenting with API-based agents. This approach combines the best of two worlds: accessing diverse apps and data repositories and eliminating the need to build an agent from the ground up.

We want to empower every organization to unlock the full power of agents through APIs. The lessons we’ve learned on our journey can help you get there.

The need for API-based agents

The vision for Microsoft 365 Copilot is to serve as the enterprise UX. Within that framework, agents serve as the background applications that streamline workflows and save our employees time.

For many users, the out-of-the-box access Copilot provides to Microsoft Graph is enough to support their work. It surfaces the data and content they need while providing a foundational orchestration layer with built-in capabilities around compliance, responsible AI, and more.

But there are plenty of scenarios that require access to other data sources.

“Copilot provides you with data that’s fairly static as it stands in Microsoft Graph,” says Shadab Beg, principal software engineering manager on our International Sovereign Cloud Expansion team. “If you need to query from a data store or want to make changes to the data, you’ll need an API layer.”

By using APIs to extend agents built on the Copilot orchestration layer, organizations can apply its reasoning capabilities to new data without the need to fine-tune their models or create new ones from scratch. The possibilities these capabilities unlock are driving a boom in API-based agents for key functions and processes.

“Cost is one of the most critical dimensions in how we design, deploy, and scale our solutions. Declarative API-driven agents in Microsoft 365 Copilot offer a path to unify agentic experiences while leveraging shared AI compute and infrastructure.”

A photo of Nasir.
Faisal Nasir, AI Center of Excellence and Data Council lead, Microsoft Employee Experience

In many ways, IT organizations like ours are the ideal places to implement API-based agents. Our teams are adept at creating and deploying internal solutions to solve technical challenges, and IT work is often about enablement and efficiency—exactly what agents do best.

“Cost is one of the most critical dimensions in how we design, deploy, and scale our solutions,” says Faisal Nasir, AI Center of Excellence and Data Council lead in Microsoft Employee Experience. “Declarative API-driven agents in Microsoft 365 Copilot offer a path to unify agentic experiences while leveraging shared AI compute and infrastructure. By aligning with core architectural principles such as efficiency, scalability, and sustainability, we can ensure these agents not only drive intelligent outcomes but also maximize value across service areas with minimal overhead.”

{Learn more about our vision and strategy around deploying agents internally at Microsoft.}

The Azure FinOps Budget Agent

Our Azure FinOps Budget Agent is a perfect example of a scenario for API-based agents.

The team responsible for managing our Microsoft Azure budget for IT services was looking for ways to reduce costs by 10–20 percent. To do that effectively, service and finance managers needed the ability to track their spending quickly, accurately, and easily.

The conventional approach to solving this problem would be creating a dashboard with access to the relevant data. The problem with a UI-based approach is that it tends to cater to more specific personas by providing data only they need while oversaturating others with information that’s irrelevant to their work.

Azure spend is basically the lifeline for our services,” says Faris Mango, principal software engineering manager for infrastructure and engineering services within Microsoft Digital. “Getting the information you need in a concise format that provides a nice, holistic view can be challenging.”

With the advent of generative AI and Microsoft 365 Copilot, the team knew that a natural language interface would be much more intuitive. The result was the Azure FinOps Budget Agent.

The team created the agent and the necessary APIs using Microsoft Visual Studio Code. Its tables and functions run on Azure Data Explorer, allowing the APIs and their consumers to access data almost instantaneously, thanks to its low latency and rapid read speeds.

The tool retrieves data by running Azure Data Factory pipelines that pull and transform data from three sources:

  • Our SQL Server for service budget and forecast data
  • Azure Spend for the actual spending amounts
  • Projected spending, a separate service stored in other Azure Data Explorer tables

Processing the information relies on our business logic’s join operations, followed by aggregations by fiscal year and service tree levels. These summarize the data per service, team group, service group, and organization.

After the back end processes the day’s data, it ingests the information into our Azure Data Explorer tables, which the agent accesses by calling via Kusto functions (the query language for Azure Data Explorer). The outcome is very low latency. Typically, the agent returns results in under 500 milliseconds.

For users, the tool is stunningly simple. They simply access Copilot and navigate to the Azure FinOps Budget Agent.

The agent provides three core prompts at the very top of the interface: “My budgets,” “Service budget information,” and “Service group budget information.” Clicking on one of these pre-loaded prompts returns role-specific information around budget, forecasts, actuals, projections, and variance, all at a single glance. The interface even includes graphs to help people track spending visually.

If users are looking for more specific information, they can input their own queries. For example:

  • “Get me the monthly breakdown of service Azure Optimization Assessment analytics.”
  • “Find me the service in this tree with the highest budget.”
  • “Show me the Azure budget for our facilities reporting portal.”
  • “Which service deviates most from its budget forecasts?”

The Azure FinOps Budget Agent primarily serves two groups: service managers who directly oversee spend for Azure-based services and FinOps managers responsible for larger budget silos.

Mango is responsible for the internal UI that helps network employees access parts of the Microsoft network. With 18–20K users per month, budgeting and forecasting are highly dynamic due to traffic fluctuations and the resourcing that supports them. He also oversees the internal portal that helps service engineers manage our networks. The tool is growing rapidly as we onboard more teams, so forecasting is anything but linear.

For both of these services, keeping close track of spending is essential. Mango finds himself checking the Azure FinOps Budget Agent about twice a month to gauge how his services are trending.

“It’s taking me less time to do analysis and come up with accurate numbers. And the enhanced user experience just feels more natural, like you’re asking questions conversationally rather than engaging with a dashboard.”

A photo of Mango.
Faris Mango, principal software engineering manager for infrastructure and engineering services, Microsoft Digital

For FinOps managers, the value is more high-level. They are responsible for overseeing tens of services featuring vast volumes of Azure usage across storage and compute while managing strict budgets. That requires constant vigilance.

Switching context from one dashboard to another to track different Azure management groups was a constant hassle for them. Now, they use the Azure FinOps Budget Agent to get an up-to-date view of the overall spend picture. It gives them a place to start. From there, they can drill down if he sees any abnormalities.

“It’s taking me less time to do analysis and come up with accurate numbers,” Mango says. “And the enhanced user experience just feels more natural, like you’re asking questions conversationally rather than engaging with a dashboard.”

The arrival of the Azure FinOps Budget Agent is just one example of how agents take your context and get your people the answers they care about faster at less cost.

Benefits like these are spreading across teams throughout Microsoft. Overall, we’ve been able to save 10–12 percent of our overall Azure cost footprint for Microsoft Digital, and individual users are thrilled at the amount of time and effort they’re saving.

“Now the info is at people’s fingertips. The advantage of an agent is that users don’t have to understand a complex UI, so they can get quick answers and get back to work.”

A photo of Beg.
Shadab Beg, principal software engineering manager, International Sovereign Cloud Expansion

Five key strategies for building an API-based agent

After seeing what we’ve accomplished with API-based agents, you might be wondering how to put them into action at your organization. This step-by-step guide can help you get there.

Building an API-based agent needs to fulfill multiple requirements. It has to expose APIs, align with real user needs, integrate seamlessly with Microsoft 365 Copilot, and work reliably, efficiently, and scalably. Achieving those outcomes depends on five key strategies.

Start with user intent, not the API

Start by asking a simple but powerful question: What will users actually ask your agent? Instead of designing the API first, flip the process:

  • Gather real user queries to understand actual use cases.
  • Refine the queries using prompt engineering techniques to align them with expected AI behavior.
  • Design the API to provide structured responses to those refined queries.

By starting with user intent, you ensure your agent answers real user questions directly, avoids over-engineering unnecessary endpoints, and delivers meaningful results without excessive back-end processing.

“Now the info is at people’s fingertips,” Beg says. “The advantage of an agent is that users don’t have to understand a complex UI, so they can get quick answers and get back to work.”

The advantage of an agent is that users don’t have to understand a complex UI, so they can get quick answers and get back to work.”

Key learning: An API that doesn’t align with user intent won’t be effective—even if you design it well.

Design APIs for Microsoft 365 Copilot Integration

It’s important to build an API schema that returns precise and structured data to make it easy for Copilot to consume. This ensures your APIs return data in a format that directly answers user queries. Copilot expects responses in under three seconds, so focus on optimizing API responses for low latency.

Once you have your list of key questions, design your API schema to return the exact data you need to answer those questions. Your goal should be to ensure every API response has a structure that makes it easy for Copilot to understand.

Teach Microsoft 365 Copilot to call your API

Copilot needs to know how to call your API. Manifests and OpenAPI descriptions accomplish that training.

Create detailed OpenAPI documentation and plugin manifests so Copilot knows what your API does, how to invoke it, and what responses to expect. You’ll likely need to adjust to these files through a process of trial and error.  

Scale APIs for performance and reliability

Once you have your schema and integration in place, it’s time to move on to the primary engineering challenge: making your API scalable, efficient, and reliable.

Prioritize the following goals:

  • Fast response times: Copilot expects quick answers.
  • High scalability: This ensures seamless performance at scale.
  • Reliable uptime: The system needs to remain robust.

We recommend setting a very strict latency limit while implementing your API to retrieve data, since Copilot needs time to generate its response. Existing API endpoints often involve complex data joins rather than simply returning rows from data tables. This complexity can lead to longer processing times, particularly with intricate queries that involve multiple data stores.

To address these potential delays, pre-cache results to significantly enhance performance. This can help overcome the latency requirements imposed by Copilot.

At this point, you’ll see why starting with user intent and iteratively refining API design is important. By grounding your work in user behaviors, you’ll align with the following best practices:

  • Structure your response to directly address user queries.
    Instead of just returning raw data, the API should provide meaningful insights Copilot can interpret. Prompt engineering marries user intent with the most understandable API schema.
  • Keep your API flexible enough to adapt to evolving business needs.
    Real-world workflows change over time, and an API should be able to support those changes without massive refactoring.
  • Avoid performance bottlenecks caused by unnecessary complexity.
    Understanding the exact data requirements up front prevents heavy joins, excessive filtering, and inefficient data retrieval logic.
  • Optimize for Copilot’s real-time response constraints.
    With a strict limit on latency, consider pre-optimization techniques like pre-caching results and simplifying query logic from the very beginning of your API implementation.

If you attempt to build a scalable, reliable API without first understanding how users will interact with your agent, you’ll spend months reworking the schema, debugging inefficiencies, and struggling with integration challenges.

Key learning: A fast, scalable, and reliable API isn’t just about technical optimization. It starts with a deep understanding of the questions it needs to answer and how to structure responses so Copilot can interpret them correctly.

Consider compliance and responsible AI

Unlike custom agents or OpenAI API integrations, knowledge-only agents require far less effort to meet Microsoft’s Responsible AI Standard. Microsoft tools’ built-in compliance capabilities handle much of the complexity. As a result, you can focus on efficiency and optimization rather than regulatory hurdles.

“Agent-based automation must balance speed with responsibility,” Nasir says. “We embed compliance, cost control, and telemetry from the start, so our systems don’t just scale, they mature.”

Key learning: It’s helpful to revisit your existing compliance, governance, and responsible AI processes and policies before implementing AI solutions. Copilot adheres to protective structures within your Microsoft technology ecosystem, so this process will ensure you’re starting from the most secure position.

APIs and the agentic future

Building API-based agents is more than just an integration exercise. It’s about creating scalable, intelligent, and compliant AI-driven workflows. By aligning your API design with user intent, you set Microsoft 365 Copilot free to retrieve and interpret information accurately. That leads to a seamless AI experience for your employees.

Thanks to Copilot’s built-in security and compliance features, API-based Copilot agents are some of the most efficient, compliant, and enterprise-ready ways to deploy AI solutions. They represent another step into an AI-first future tailored to your employees’ and organization’s needs.

Tools like API-based agents democratize the information we all need to do our jobs better, because we’re all getting the same data from the same place. This is why an AI-first mindset is actually human-first.

Key takeaways

Here are some things to keep in mind when designing agent-powered experiences that are fast, reliable, and aligned with user expectations.

  • Response time is key. Choose single APIs that have low latency to facilitate both the technical requirements of Copilot and users’ needs.
  • Consider the source. Data has to be high-quality on the backend. It’s worth reviewing your data and ensuring the hygiene is good.
  • Agents and APIs need to align. Design with task-centric, well-structured agents. Determine your high-level goals, then use the OpenAI standard, OpenAPI, or graph schemas to describe task endpoints. Define each API’s capability, input schema, and expected outcome very clearly.
  • Plan ahead to avoid surprises. Design your APIs to minimize potential side effects, especially through enabling natural-language-to-API mapping, because that’s the biggest change in methodology.
  • Design for visibility. Agents need to be observable and explainable, so implement metrics-driven monitoring. Having API-level telemetry in addition to Copilot-level telemetry enables continuous improvement.

The post Unleashing API-powered agents at Microsoft: Our internal learnings and a step-by-step guide appeared first on Inside Track Blog.

]]>
19793
Modernizing IT infrastructure at Microsoft: A cloud-native journey with Azure http://approjects.co.za/?big=insidetrack/blog/modernizing-it-infrastructure-at-microsoft-a-cloud-native-journey-with-azure/ Thu, 04 Sep 2025 16:00:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=20125 Engage with our experts! Customers or Microsoft account team representatives from Fortune 500 companies are welcome to request a virtual engagement on this topic with experts from our Microsoft Digital team. At Microsoft, we are proudly a cloud-first organization: Today, 98% of our IT infrastructure—which serves more than 200,000 employees and incorporates over 750,000 managed […]

The post Modernizing IT infrastructure at Microsoft: A cloud-native journey with Azure appeared first on Inside Track Blog.

]]>

Engage with our experts!

Customers or Microsoft account team representatives from Fortune 500 companies are welcome to request a virtual engagement on this topic with experts from our Microsoft Digital team.

At Microsoft, we are proudly a cloud-first organization: Today, 98% of our IT infrastructure—which serves more than 200,000 employees and incorporates over 750,000 managed devices—runs on the Microsoft Azure cloud.

The company’s massive transition from traditional datacenters to a cloud-native infrastructure on Azure has fundamentally reshaped our IT operations. By adopting a cloud-first, DevOps-driven model, we’ve realized significant gains in agility, scalability, reliability, operational efficiency, and cost savings.

“We’ve created a customer-focused, self-serve management environment centered around Azure DevOps and modern engineering principles,” says Pete Apple, a technical program manager and cloud architect in Microsoft Digital, the company’s IT organization. “It has really transformed how we do IT at Microsoft.”

“Our service teams don’t have to worry about the operating system. They just go to a website, fill in their info, add their data, and away they go. That’s a big advantage in terms of flexibility.”

Apple is shown in a portrait photo.
Pete Apple, technical program manager and cloud architect, Microsoft Digital

What it means to move from the datacenter to the cloud

Historically, our IT environment was anchored in centralized, on-premises datacenters. The initial phase of our cloud transition involved a lift-and-shift approach, migrating workloads to Azure’s infrastructure as a service (IaaS) offerings. Over time, the company evolved toward more of a decentralized, platform as a service (PaaS) DevOps model.

“In the last six or seven years we’ve seen a lot more focus on PaaS and serverless offerings,” says Faisal Nasir, a principal architect in Microsoft Digital. “The evolution is also marked by extensibility—the ability to create enterprise-grade applications in the cloud—and how we can design well-architected end-to-end services.”

Because we’ve moved nearly all our systems to the cloud, we have a very high level of visibility into our network operations, according to Nasir. We can now leverage Azure’s native observability platforms, extending them to enable end-to-end monitoring, debugging, and data collection on service usage and performance. This capability supports high-quality operations and continuous improvement of cloud services.

“Observability means having complete oversight in terms of monitoring, assessments, compliance, and actionability,” Nasir says. “It’s about being able to see across all aspects of our systems and our environments, and even from a customer lens.”

Decentralizing our IT services with Azure

As Microsoft was becoming a cloud-first organization, the nature of the cloud and how we use it changed. As Microsoft Azure matured and more of our infrastructure and services moved to the cloud, we began to move away from IT-owned applications and services.

The strengths of the Azure self-service and management features means that individual business groups can handle many of the duties that Microsoft Digital formerly offered as an IT service provider—which enables each group to build agile solutions to match their specific needs.

“Our goal with our modern cloud infrastructure continues to be a solution that transforms IT tasks into self-service, native cloud solutions for monitoring, management, backup, and security across our entire environment,” Apple says. “This way, our business groups and service lines have reliable, standardized management tools, and we can still maintain control over and visibility into security and compliance for our entire organization.”

The benefits to our businesses of this decentralized model of IT services include:

  • Empowered, flexible DevOps teams
  • A native cloud experience: subscription owners can use features as soon as they’re available
  • Freedom to choose from marketplace solutions
  • Minimal subscription limit issues
  • Greater control over groups and permissions
  • Better insights into Microsoft Azure provisioning and subscriptions
  • Business group ownership of billing and capacity management

“With the PaaS model, and SaaS (software as a service), it’s more DIY,” Apple says. “Our service teams don’t have to worry about the operating system. They just go to a website, fill in their info, add their data, and away they go. That’s a big advantage in terms of flexibility.”

“The idea of centralized monitoring is gone. The new approach is that service teams monitor their own applications, and they know best how to do that.”

Delamarter is shown in a portrait photo.
Cory Delamarter, principal software engineering manager, Microsoft Digital

Leveraging the power of Azure Monitor

Microsoft Azure Monitor is a comprehensive monitoring solution for collecting, analyzing, and responding to monitoring data from cloud and on-premises environments. Across Microsoft, we use Azure Monitor to ensure the highest level of reliability for our services and applications.

Specifically, we rely on Azure Monitor to:

Create visibility. There’s instant access to fundamental metrics, alerts, and notifications across core Azure services for all business units. Azure Monitor also covers production and non-production environments as well as native monitoring support across Microsoft Azure DevOps.

Provide insight. Business groups and service lines can view rich analytics and diagnostics across applications and their compute, storage, and network resources, including anomaly detection and proactive alerting.

Enable optimization. Monitoring results help our business groups and service lines understand how users are engaging with their applications, identify sticking points, develop cohorts, and optimize the business impact of their solutions.

Deliver extensibility. Azure Monitor is designed for extensibility to enable support for custom event ingestion and broader analytics scenarios.

Because we’ve moved to a decentralized IT model, much of the monitoring work has moved to the service team level as well.

“The idea of centralized monitoring is gone,” says Cory Delamarter, a principal software engineering manager in Microsoft Digital. “The new approach is that service teams monitor their own applications, and they know best how to do that.”

Patching and updating, simplified

Moving our operations to the cloud also means a simpler and more automated approach to patching and updating. The shift to PaaS and serverless networking has allowed us to manage infrastructure patching centrally, which is much more scalable and efficient. The extensibility of our cloud platforms reduces integration complexity and accelerates deployment.

“It depends on the model you’re using,” Nasir says. “With the PaaS and serverless networks, the service teams don’t need to worry about patching. With hybrid infrastructure systems, being in the cloud helps with automation of patching and updating. There’s a lot of reusable automation layers that help us build end-to-end patching processes in a faster and more reliable manner.”

Apple stresses the flexibility that this offers across a large organization when it comes to allowing teams to choose how they do their patching and updating.

“In the datacenter days, we ran our own centralized patching service, and we picked the patching windows for the entire company,” Apple says. “By moving to more automated self-service, we provide the tools and the teams can pick their own patching windows. That also allowed us to have better conversations, asking the teams if they want to keep doing the patching or if they want to move up the stack and hand it off to us. So, we continue to empower the service teams to do more and give them that flexibility.”

Securing our infrastructure in a cloud-first environment

As security has become an absolute priority for Microsoft, it’s also been a foundational element of our cloud strategy.

Being a cloud-first company has made it easier to be a security-first organization as well.

“The cloud enables us to embed security by design into everything we build,” Nasir says. “At enterprise scale, adopting Zero Trust and strong governance becomes seamless, with controls engineered in from the start, not retrofitted later. That same foundation also prepares us for an AI-first future, where resilience, compliance, and automation are built into every system.”

Cloud-native security features combined with integrated observability allow for better compliance and risk management. Delamarter agrees that the cloud has had huge benefits when it comes to enhancing network security.

“Our code lives in repositories now, and so there’s a tremendous amount of security governance that we’ve shifted upstream, which is huge,” Delamarter says. “There are studies that show that the earlier you can find defects and address them, the less expensive they are to deal with. We’re able to catch security issues much earlier than before.”

“There are less and less manual actions required, and we’re automating a lot of business processes. It basically gives us a huge scale of automation on top of the cloud.”

Nasir is shown in a portrait photo.
Faisal Nasir, principal architect, Microsoft Digital

We use Azure Policy, which helps enforce organizational standards and assess compliance at scale using dashboards and other monitoring tools.

“Azure Policy was a key part of our security approach, because it essentially offers guardrails—a set of rules that says, ‘Here’s the defaults you must use,’” Apple says. “You have to use a strong password, for example, and it has to be tied to an Azure Active Directory ID. We can dictate really strong standards for everything and mandate that all our service teams follow these rules.”

AI-driven operations in the cloud

Just like its impact on the rest of the technology world, AI is in the process of transforming infrastructure management at Microsoft. Tasks that used to be manual and laborious are being automated in many areas of the company, including network operations.

“AI is creating a new interface of agents that allow users to interact with large ecosystems of applications, and there’s much easier and more scalable integration,” says Nasir. “There are less and less manual actions required, and we’re automating a lot of business processes. Microsoft 365 Copilot, Security Copilot, and other AI tools are giving us shared compute and extensibility to produce different agents. It basically gives us a huge scale of automation on top of the cloud.”

Apple notes that powerful AI tools can be combined with the incredible amount of data that the Microsoft IT infrastructure generates to gain insights that simply weren’t possible before.

“We can integrate AI with our infrastructure data lakes and use tools like Network Copilot to query the data using natural language,” Apple says. “I can ask questions like, ‘How many of our virtual machines need to be patched?’ and get an answer. It’s early, and we’re still experimenting, but the potential to interact with this data in a more automated fashion is exciting.”

Ultimately, Microsoft has become a cloud-first company, and that has allowed us to work toward an AI-first mentality in everything we do.

“Having a complete observability strategy across our infrastructure modernization helps us to make sure that whatever changes we’re making, we have a design-first approach and a cloud-first mindset,” Nasir says. “And now that focus is shifting towards an AI-first mindset as well.”

Key takeaways

Here are some of the benefits we’ve accrued by becoming a cloud-first IT organization at Microsoft:

  • Transformed operations: By moving from our legacy on-premises datacenters, through Azure’s infrastructure as a service (IaaS) offerings, and eventually to a platform as a service (PaaS) DevOps model, we’ve reaped great gains in reliability, efficiency, scalability, and cost savings.
  • A clear view: With 98% of our organization’s IT infrastructure running in the Azure cloud, we have a huge level of observability into our systems—complete oversight into network assessment, monitoring, compliance, patching/updating, and many other aspects of operations.
  • Empowered teams: Operating a cloud-first environment allows us to have a more decentralized approach to IT infrastructure. This means we can offer our business groups and service lines more self-service, cloud-native solutions for monitoring, management, patching, and backup while still maintaining control over and visibility into security and compliance for our entire organization.
  • Seamless updates: The shift to PaaS and serverless networking has enabled a more planned and automated approach to patching and updating our infrastructure, which produces greater efficiency, integration, and speed of deployment.
  • Dependable security: Our cloud environment has allowed us to implement security by design, including tighter control over code repositories and the use of standard security policies across the organization with Azure Policy.
  • Future-proof infrastructure: As we shift to an AI-first mindset across Microsoft, we’re using AI-driven tools to enhance and maintain our native cloud infrastructure and adopt new workflows that will continue to reap dividends for our employees and our organization.  

The post Modernizing IT infrastructure at Microsoft: A cloud-native journey with Azure appeared first on Inside Track Blog.

]]>
20125
Smarter labs, faster fixes: How we’re using AI to provision our virtual labs more effectively http://approjects.co.za/?big=insidetrack/blog/smarter-labs-faster-fixes-how-were-using-ai-to-provision-our-virtual-labs-more-effectively/ Thu, 24 Jul 2025 16:00:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=19628 Microsoft Digital stories Providing technical support at an enterprise of our size here at Microsoft is a constant balancing act between speed, quality, and scalability. Systems grow more complex, user expectations continue to rise, and traditional support models often struggle to keep up. Beyond the usual apps and systems everyone uses, many of our employees […]

The post Smarter labs, faster fixes: How we’re using AI to provision our virtual labs more effectively appeared first on Inside Track Blog.

]]>

Microsoft Digital stories

Providing technical support at an enterprise of our size here at Microsoft is a constant balancing act between speed, quality, and scalability. Systems grow more complex, user expectations continue to rise, and traditional support models often struggle to keep up. Beyond the usual apps and systems everyone uses, many of our employees require virtual provisioning for diverse tasks in many of our businesses. Supporting these virtualized environments is a special challenge.

To meet the growing demand for virtual lab usage across the organization, we turned to AI, not just to automate support responses but to fundamentally rethink how user issues are identified and resolved. This vision came to life through the MyWorkspace platform, where we in Microsoft Digital, the company’s IT organization, introduced a domain-specific AI assistant to streamline how we empower our employees to deploy new virtual labs.

The results have been dramatic: what was once a slow, manual process is now fast, efficient, and frictionless.

But the benefits extend beyond faster resolution times. This transformation represents a new approach to enterprise support—one that uses AI not just as a tool for efficiency, but as a strategic enabler. By embedding intelligence into the support experience, we’re turning complexity into a competitive advantage.

Scaling support in a high-demand environment

MyWorkspace is our internal platform for provisioning virtual labs. Originally developed to support internal testing, diagnostics, and development environments, it has since grown into a critical resource used by thousands of engineers and support personnel across the company.

Scaling the platform infrastructure was straightforward—adding capacity for tens of thousands of virtual labs was a technical challenge we could solve with ease, thanks to our Microsoft Azure backbone. As usage grew, the real strain didn’t show up in CPU load or storage limit, but rather in the support queue—every few months, a new wave of users was onboarded to MyWorkspace: partner teams, internal engineers, and external vendors. These new users, often with minimal experience of the platform, needed fast access and guidance from support.

The questions, though simple, piled up quickly.

Several Tier 1 support engineers repeatedly encountered the same questions from users, such as how to start a lab, what an error meant, and which lab to use for a particular test. These weren’t complex technical issues—they were basic, repetitive onboarding questions that represented a huge opportunity to introduce automation.

“We also found that there were a lot of users who found more niche issues, and those issues had been solved either by our community or by ourselves. In fact, we had a dedicated Teams channel specific to customer issues, and we found that there was a lot of repetition and that other customers were facing similar issues, and we did have a bit of a knowledge base in terms of how to solve these issues.”

A photo of Deans.
Joshua Deans, software engineer, Microsoft Digital

Unblocking a bottleneck with AI

Our support team set out to tackle a familiar but costly challenge: high volumes of low-complexity tickets that consumed valuable time without delivering meaningful impact. Instead of treating this as an unavoidable burden, we saw an opportunity to turn it into a self-scaling solution. If the same questions were being asked repeatedly—and the answers already existed in documentation, internal threads, or institutional knowledge—then an AI system should be able to surface those answers instantly, without human intervention.

“We also found that there were a lot of users who found more niche issues, and those issues had been solved either by our community or by ourselves,” says Joshua Deans, a software engineer within Microsoft Digital. “In fact, we had a dedicated Teams channel specific to customer issues, and we found that there was a lot of repetition and that other customers were facing similar issues, and we did have a bit of a knowledge base in terms of how to solve these issues.”

That insight led the MyWorkspace team to begin building what would become a transformative AI assistant: an automated support layer purpose-built for the MyWorkspace platform. Unlike traditional chatbots that rely on scripted responses or rigid decision trees, this assistant would leverage generative AI trained on a rich dataset of real-world support conversations, internal FAQs, and official documentation.

“So that’s where we found this opportunity to turn this scaling challenge into a scaling advantage, with help from AI. We took all those historical conversations of tier one staff helping new users—trained our AI to provide user education based on that training—and saved our Tier 1 staff from answering potential tickets.”

Vikram Dadwal, principal software engineering manager, Microsoft Digital

The result was a context-aware, responsive system capable of resolving common issues in seconds—not hours or days—dramatically easing the load on support teams while improving the user experience.

Built on Azure and Semantic Kernel

MyWorkspace’s core infrastructure is fully built on Azure services. At any given moment, it manages tens of thousands of virtual machines, scaling up and down with demand. That elasticity, combined with our internal developer tooling and AI orchestration capabilities, provided the perfect environment for an AI-powered support layer.

“So that’s where we found this opportunity to turn this scaling challenge into a scaling advantage, with help from AI,” says Vikram Dadwal, a principal software engineering manager within Microsoft Digital. “We took all those historical conversations of tier one staff helping new users—trained our AI to provide user education based on that training—and saved our Tier 1 staff from answering potential tickets.”

To build the assistant, the team used our Microsoft open-source framework, Semantic Kernel. Designed for generative AI integration, Semantic Kernel allows engineers to create prompt-driven, modular systems that can interact with large language models (LLM) without vendor lock-in.

This approach gave the team several advantages:

  • Flexibility in choosing and switching between LLM providers.
  • Fine-grained control over how prompts were structured and updated.
  • Extensibility through plugins and actions that tie the assistant into the broader ecosystem.

Crucially, the assistant was designed to be part of the platform’s architecture, capable of operating at the same level of scale and responsiveness as the labs it supported. Also, the assistant was initialized with a well-scoped system prompt, limiting its responses strictly to the MyWorkspace domain.

“On average, we measured these interactions at around 20 minutes from ticket submission to problem resolution. Now compare that with a 30-second AI interaction for resolving the same class of issues—that’s a 98% reduction in resolution time, a number we’ve validated with our support team and continue to track.”

Nathan Prentice, senior product manager, Microsoft Digital

Shifting from tickets to conversations

Whether users had questions about lab types, needed clarification on configuration details, or sought guidance during onboarding, the AI provided accurate, interactive responses without requiring human escalation. The experience was both faster and significantly better. Support engineers saw a noticeable reduction in repeat tickets, as common issues were resolved on the spot. Onboarding friction decreased, and users were confident that they could get the answers they needed instantly—no ticket, no delay, no need to track a support contact.

“On average, we measured these interactions at around 20 minutes from ticket submission to problem resolution,” says Nathan Prentice, a senior product manager within Microsoft Digital. “Now compare that with a 30-second AI interaction for resolving the same class of issues—that’s a 98% reduction in resolution time, a number we’ve validated with our support team and continue to track.”

Smart, interactive, and intuitive

Our Microsoft Digital team has recently implemented a new version of the MyWorkspace AI assistant that includes several major enhancements. The assistant now features adaptive cards, polished layouts, and a Microsoft 365 Copilot-aligned user experience, making it feel familiar and trustworthy for internal teams. The assistant can now distinguish between a question and an action. If a user says, “Start a SharePoint lab,” it responds with an interactive card and begins provisioning, bridging the gap between passive support and active enablement.

“One of the primary bottlenecks we previously faced in creating an AI solution to address frequently asked user questions was the lack of technology capable of generating accurate answers for complex technical queries and understanding nuanced user input. With the availability of Azure OpenAI models, we were able to effectively overcome this challenge, enabling our AI solution to deliver precise and context-aware responses at scale.”

A photo of Nair.
Anjali Nair, senior software engineer, Microsoft Digital

To guide our employees and improve discoverability, the assistant offers recommended prompts—just like Copilot does—helping new users understand what they can ask and how to get started.

Users can now rate responses, giving a thumbs up or down. These signals are aggregated and reviewed by the engineering team, ensuring continuous improvement and fine tuning over time.

Intelligent provisioning with multi-agent orchestration 

At Microsoft Digital, we’re reimagining how labs are provisioned by integrating AI-driven intelligence into the process. Traditionally, users are expected to know exactly what kind of lab environment they need. But in complex virtualization and troubleshooting scenarios, these assumptions often fall short. Should a user troubleshooting hybrid issues with Microsoft Exchange spin up a basic Exchange lab, or one that includes Azure AD integration, conditional access policies, and hybrid connectors? To eliminate this guesswork, our team is building a multi-agent system powered by the Semantic Kernel SDK multi-agent framework. This system interprets the user’s support context—often expressed in natural language—and automatically provisions the most relevant lab environment.

For example, a user might say, “I’m seeing sync issues between SharePoint Online and on-prem,” and the assistant will orchestrate the creation of a tailored lab that replicates that exact scenario, enabling faster diagnosis and resolution. With agent orchestration, each agent in the system is specialized: one might handle context interpretation, another lab configuration, and another cost optimization. These agents collaborate to ensure that the lab not only meets technical requirements but is also cost-effective. By leveraging telemetry and historical usage data, the system can recommend leaner configurations—such as using ephemeral VMs, auto-pausing idle resources, or selecting lower-cost SKUs—without compromising diagnostic fidelity. This intelligent provisioning framework is designed to scale, adapt, and continuously learn from usage patterns.

“One of the primary bottlenecks we previously faced in creating an AI solution to address frequently asked user questions was the lack of technology capable of generating accurate answers for complex technical queries and understanding nuanced user input,” says Anjali Nair, a senior software engineer within Microsoft Digital. “With the availability of Azure OpenAI models, we were able to effectively overcome this challenge, enabling our AI solution to deliver precise and context-aware responses at scale.”

With multi-agent orchestration, we’re taking a step towards a future where lab environments are not just automated, but intelligently orchestrated, context-aware, and cost-optimized—empowering engineers to focus on solving problems, not setting up infrastructure.

Scaling support without scaling headcount

The MyWorkspace assistant is a powerful example of how enterprise support can evolve through intelligence. By embedding AI into the support experience, we’ve turned complexity into a competitive edge—reshaping knowledge work and operations through AI’s problem-solving capabilities. As Microsoft advances as a Frontier Firm, MyWorkspace shows how we can scale support on demand, with intelligence built in. Routine queries are offloaded to AI, freeing Tier 1 teams to focus on critical issues and giving Tier 2 engineers space to innovate. Most importantly, support now scales with user demand—not headcount.

But this system does more than just respond—it learns. Every interaction becomes a data point. Each resolved issue feeds back to the assistant, sharpening its accuracy and expanding its knowledge. What started as a reactive Q&A tool is now growing into a proactive orchestrator that surfaces insights and points users to solutions, resolving issues before they ever become tickets.

“We have a lot more telemetry now, so users can provide feedback to our responses—for example, thumbs up or thumbs down feedback,” Deans says. “And we can actually view where the model is giving incorrect or inappropriate information, and we can use that to make adjustments to the prompt provided to the model.”

In this model, support becomes a seamless extension of the user experience. With the right AI architecture in place, it transforms a cost center into a strategic asset. The MyWorkspace assistant fulfills its role as an embedded, intelligent teammate—delivering answers, driving actions, and continuously improving over time.

Ultimately, our journey with MyWorkspace shows that meaningful AI adoption doesn’t have to begin with sweeping transformation. Sometimes, it starts with a helpdesk queue, a recurring issue, and the choice to build something smarter—something that learns, adapts, and empowers at every step.

Key takeaways

Here are some of our top insights from boosting our internal deployment of MyWorkspace with AI and continuous improvement.

  • Start small and specific. Focus on a defined domain—like MyWorkspace—and use existing support logs to train your assistant.
  • Invest in AI infrastructure. Tools like Semantic Kernel provide flexibility, especially in enterprise settings where vendor neutrality and customization matter.
  • Design for trust. Align your assistant’s UI with well-known systems like Microsoft Copilot to build user confidence.
  • Don’t wait for perfection. Launch a V1, gather feedback, and make improvements. AI assistants get better over time if you let them learn.
  • Think outside the ticket queue. The future isn’t just faster support—it’s intelligent, anticipatory systems that eliminate friction before it begins.

The post Smarter labs, faster fixes: How we’re using AI to provision our virtual labs more effectively appeared first on Inside Track Blog.

]]>
19628
Securing the borderless enterprise: How we’re using AI to reinvent our network security http://approjects.co.za/?big=insidetrack/blog/securing-the-borderless-enterprise-how-were-using-ai-to-reinvent-our-network-security/ Thu, 10 Jul 2025 16:00:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=19504 The modern enterprise network is complex, to say the least. Enterprises like ours are increasingly adopting hybrid infrastructures that span on-premises data centers, multiple cloud environments, and a diverse array of remote users. In this context, traditional security tools are still playing checkers while the malicious actors are playing chess. To make matters worse, attacks […]

The post Securing the borderless enterprise: How we’re using AI to reinvent our network security appeared first on Inside Track Blog.

]]>
The modern enterprise network is complex, to say the least.

Enterprises like ours are increasingly adopting hybrid infrastructures that span on-premises data centers, multiple cloud environments, and a diverse array of remote users. In this context, traditional security tools are still playing checkers while the malicious actors are playing chess. To make matters worse, attacks are increasingly enabled by AI tools.

That’s why here in Microsoft Digital, the company’s IT organization, we’re using a modern approach and toolset—including AI—to secure our network environment, turning complexity into clarity, one approach, tool, and insight at a time.

Leaving traditional network security behind

For years, traditional network security relied on a simple but increasingly outdated assumption: everything inside the corporate perimeter can be trusted. This model made sense when networks were static, users were on-premises, and applications lived in a centralized data center.

But that world is gone.

A photo of Venkatraman.

“Implicit trust must be replaced with explicit verification. That means rethinking how we monitor, how we respond, and how we design for resilience from the start.”

Raghavendran Venkatraman, principal cloud network engineering manager, Microsoft Digital

Today’s enterprise is dynamic, decentralized, and borderless. Hybrid work has become the norm. Cloud adoption is accelerating. Teams are globally distributed. Devices and data move constantly across environments. In this new reality, the network perimeter hasn’t just shifted—it has effectively vanished.

That’s where the cracks in legacy security models become impossible to ignore.

Visibility becomes fragmented. Security teams struggle to track what’s happening across a sprawling digital estate. Traditional monitoring tools focus on infrastructure uptime or device health—not on the actual experience of the people using the network. That disconnect creates blind spots, and blind spots create risk.

We know that this model no longer meets the needs of a modern, AI-powered enterprise. Every enterprise needs a new approach—one that assumes breach, enforces least-privilege access, and continuously verifies trust.

“Implicit trust must be replaced with explicit verification,” says Raghavendran Venkatraman, a principal cloud network engineering manager in Microsoft Digital. “That means rethinking how we monitor, how we respond, and how we design for resilience from the start.”

This shift is foundational to our security strategy. It’s not just about securing infrastructure—it’s about securing the experience. Because in a world where users, data, and threats are everywhere, trust has to be proved, not assumed.

Building a resilient and adaptive security strategy

To secure hybrid corporate networks effectively, organizations must go beyond traditional perimeter defenses. They need a comprehensive and adaptive security strategy—one that evolves with the threat landscape and aligns with the complexity of modern enterprise environments. The diversity of hybrid networks introduces new vulnerabilities and expands the attack surface. A static, one-size-fits-all approach simply doesn’t work anymore.

At Microsoft Digital, we’ve embraced a layered, cloud-first security model that integrates identity, access, encryption, and monitoring across every layer of the network. It’s embedded in everything we do. This model includes these key strategies, which we’ll expand upon in the following sections:

  • Adopting Zero Trust principles
  • Establishing identity as the new perimeter 
  • Integrating AI and machine learning
  • Enforcing network segmentation
  • Embracing continuous monitoring

Adopting Zero Trust principles

Zero Trust Architecture (ZTA) operates on a strict principle: “never trust, always verify.” That means no user, device, or application—whether it’s inside or outside the corporate network—is inherently trusted as they are in the traditional network security model.

A photo of McCleery.

“Zero Trust isn’t a product—it’s a mindset. It’s about assuming breach and designing defenses that minimize impact and maximize resilience.”

Tom McCleery, principal group cloud network engineer, Microsoft Digital

Every access request is evaluated against dynamic policies. These policies consider several factors—like user identity, device health, location, and how sensitive the data being accessed is. For example, if an employee tries to access a financial report from a corporate laptop at the office, they might get in, no problem. But that same request from a personal device in another country could get blocked or trigger extra authentication steps.

At the heart of ZTA are policy enforcement points that authorize every data flow. These checkpoints only grant access when all conditions are met, and they log every interaction for auditing and threat detection. This kind of granular control reduces the attack surface and limits lateral movement if there is a breach.

Adopting Zero Trust isn’t just a technical upgrade—it’s a strategic must. It boosts an organization’s ability to defend against modern threats like ransomware, insider attacks, and supply chain compromises.

“Zero Trust isn’t a product—it’s a mindset,” says Tom McCleery, a principal group cloud network engineer in Microsoft Digital. “It’s about assuming breach and designing defenses that minimize impact and maximize resilience.”

By embracing Zero Trust, we strengthen our security posture, lowers the risk of data breaches, and responds more effectively to emerging threats.

Establishing identity as the new perimeter

Identity is no longer just a component of security—it has become the new perimeter. Traditional security models focused on defending the network edge, assuming that everything inside the perimeter could be trusted. But in today’s hybrid and cloud-first environments, the perimeter has dissolved and that assumption is outdated and dangerous. Users, devices, and applications now operate across diverse locations and platforms, making perimeter-based defenses insufficient.

Identity-first security shifts the focus from securing the physical network to securing the identities—both human and machine—that interact with the network. This means every access request is treated as though it originates from an untrusted source, regardless of where it comes from. Whether it’s a remote employee logging in from a personal device or an automated workload accessing cloud resources, the system must verify who or what is making the request, assess the risk, and enforce least-privilege access across the user experience.

This approach enables organizations to implement more granular access controls. For example, a developer might be allowed to access a code repository but not production systems, and only during business hours from a managed device.  Similarly, a service account used by a Continuous Integration and Continuous Deployment CI/CD pipeline might be restricted to specific APIs and monitored for anomalous behavior. A CI/CD pipeline is an automated workflow that takes code from development through testing and into production.

By anchoring network security around verified identities, organizations reduce their attack surface and improve their ability to detect and respond to threats. This identity-centric model is not just a security enhancement—it’s a strategic shift that aligns with how modern enterprises operate.

Integrating AI and machine learning 

AI and machine learning (ML) are foundational pillars in our network security strategy. Intelligent automation and advanced analytics help us not only detect and respond to threats, but also continuously improve our security posture in an ever-changing landscape. Here’s how we’re using AI and ML in some critical aspects of our approach to modern network security:

  • Threat detection and intelligence. We deploy AI-powered monitoring tools that sift through billions of network signals and logs across our hybrid infrastructure. By applying sophisticated ML algorithms, we can identify abnormal behaviors such as unusual login attempts or unexpected data transfers that could indicate a potential breach. These insights allow our security teams to focus on the most critical alerts, reducing noise and accelerating incident investigation.
  • Automated response and containment. Through automation, our security systems can respond to threats in real time. For example, if our AI models detect suspicious activity on a device, automated workflows can immediately isolate the affected endpoint, block malicious traffic, or revoke access privileges, all without waiting for manual intervention. This rapid response capability is essential for minimizing the potential impact of attacks and protecting our critical assets.
  • Predictive analysis and proactive defense. We use predictive analytics to forecast emerging vulnerabilities before they can be exploited. By continuously training our models on the latest threat intelligence and attack patterns, we can anticipate risks and strengthen our defenses proactively—whether that means patching vulnerable systems, adjusting access controls, or updating our security policies.
  • User experience monitoring. We use AI to assess the real experience of our users, a critical measurement in a network environment where identity is the perimeter. By correlating performance metrics with security signals, we ensure that our security mechanisms don’t degrade productivity and that any anomalies impacting user experience are promptly addressed.
  • Continuous learning and improvement. Our AI and ML systems are designed to learn from every incident, adapt to new attack techniques, and evolve with the threat landscape. This continuous improvement loop enables our teams to stay ahead of sophisticated adversaries and maintain robust, resilient network security.

Advanced threats require advanced responses. By integrating AI and ML into our network security strategies, we’re enhancing our ability to detect and respond to threats swiftly, minimize potential damage, and foster a secure environment for innovation and collaboration across our global hybrid infrastructure.

Isolating networks to minimize risk

In a hybrid infrastructure, isolating network segments is a foundational security principle. By segmenting networks, we limit the scope of potential breaches and reduce the risk of lateral movement by attackers. For example, separating employee productivity networks from customer-facing systems ensures that if a vulnerability is exploited in one area, it doesn’t cascade across the entire environment.

This is especially critical in environments where sensitive customer data and internal development systems coexist. Our testing and development environments must remain completely isolated—not only from customer-facing services but also from internal productivity tools like email, collaboration platforms, and identity systems. This prevents test code or experimental configurations from inadvertently exposing production systems to risk.

We also establish policy enforcement points (PEPs) within each network segment. These act as control gates, inspecting and filtering traffic between zones. By placing PEPs at strategic boundaries, we can tightly control what moves between segments and detect anomalies early. This architecture ensures that, if a breach occurs, the “blast radius”—the scope of impact—is minimal and contained.

This layered approach to segmentation and isolation is essential for maintaining the integrity of our production systems, minimizing risk, and ensuring that our hybrid infrastructure remains resilient in the face of evolving threats.

Embracing continuous monitoring 

We’ve stopped thinking of monitoring as a one-time check. Now, it’s a continuous conversation with our network.

A photo of Singh.

“Conventional network performance monitoring—monitoring the systems and infrastructure that support our network—can only tell part of the story. To truly understand and meet our requirements, we must monitor user experiences directly.”

Ragini Singh, partner group engineering manager in Microsoft Digital

Continuous monitoring is how we stay ahead of issues before they impact our people. It’s how we keep our hybrid infrastructure resilient, performant, and secure—every second of every day.

We’ve built a monitoring ecosystem that spans our entire global network from on-premises offices to cloud-based services in Azure and software-as-a-service (SaaS) platforms. With the mindset that identity is the new perimeter, we’re using signals from all aspects of our environment and focusing on the user experience.

“Conventional network performance monitoring—monitoring the systems and infrastructure that support our network—can only tell part of the story,” says Ragini Singh, a partner group engineering manager in Microsoft Digital. “To truly understand and meet our requirements, we must monitor user experiences directly.”

This isn’t just about tools and dashboards. It’s about insight. We’re using synthetic and native metrics to build a hop-by-hop view of the user experience. That lets us pinpoint where things go wrong—and fix them fast. We’re even layering in automation to enable self-healing responses when thresholds are breached.

Continuous monitoring is a strategic shift that helps us protect our people, power our services, and deliver the seamless experience our employees expect.

Looking to the future

As enterprises continue to navigate the complexities of hybrid infrastructures, securing enterprise networks requires an agile, multifaceted approach that integrates Zero Trust principles, identity-first security, and advanced technologies like AI and ML. By shifting the focus from traditional perimeter defenses to a more holistic and adaptive security model, organizations can better protect their assets, maintain operational continuity, and foster innovation in an increasingly interconnected world.

Implementing these strategies not only enhances security but also positions organizations to leverage the full potential of their hybrid infrastructures, driving growth and success in the digital age.

Key takeaways

Here are five key actions you can take to strengthen your organization’s network security and embrace a modern approach to network security:

  • Adopt an identity-first security model. Shift your focus from traditional perimeter-based defenses to verifying and securing every user and device identity—regardless of location or network.
  • Integrate AI and machine learning into your security strategy. Continuously improve your security posture by using intelligent automation and analytics to detect, respond to, and predict threats more effectively.
  • Isolate network segments to minimize risk. Separate critical business functions, customer-facing services, and development environments to contain threats and ensure that any potential breach remains limited in scope.
  • Implement continuous monitoring across your hybrid infrastructure. Move beyond periodic checks by establishing real-time, user-centric monitoring to maintain resilience, performance, and rapid incident response.
  • Embrace a proactive, adaptive mindset. Regularly update your security policies, train your teams, and stay agile to address emerging threats and support innovation as your organization evolves.

The post Securing the borderless enterprise: How we’re using AI to reinvent our network security appeared first on Inside Track Blog.

]]>
19504
Five principles that guided our network journey to Microsoft Azure and the cloud at Microsoft http://approjects.co.za/?big=insidetrack/blog/five-principles-that-guided-our-network-journey-to-microsoft-azure-and-the-cloud-at-microsoft/ Thu, 19 Jun 2025 16:05:00 +0000 http://approjects.co.za/?big=insidetrack/blog/?p=19387 At Microsoft, we operate one of the world’s largest IT infrastructures. So, when we embarked on the journey nearly a decade ago to move from a primarily on-premises network of physical servers to one that now operates almost entirely in the Azure cloud, it was a mammoth undertaking. And like all long and rewarding journeys, […]

The post Five principles that guided our network journey to Microsoft Azure and the cloud at Microsoft appeared first on Inside Track Blog.

]]>
Microsoft digital stories

At Microsoft, we operate one of the world’s largest IT infrastructures. So, when we embarked on the journey nearly a decade ago to move from a primarily on-premises network of physical servers to one that now operates almost entirely in the Azure cloud, it was a mammoth undertaking.

And like all long and rewarding journeys, this one led to many important insights. We’d like to share five overarching principles that we learned along the way with our customers, most of whom are somewhere in the midst of their own organizational transformation into a cloud-first company, or who may be contemplating such a move.

By delineating our guiding principles and major takeaways from our own journey to the cloud, we at Microsoft Digital—the company’s IT organization—hope that other companies can learn from our experience and have a smoother and more efficient transition of their own, saving time, money, and effort.

“Our customers can learn from us having gone through it,” says Pete Apple, a technical program manager and cloud architect in Microsoft Digital. “Because we didn’t do it right the first time, at all. And so that learning process of, ‘This is what we did, this is how we did it, this is what you should think about’ can help them consider their own options.”

Stages in our journey to the cloud

1 to 6 months
  • 10% migrated
  • Retire unused workload
  • Small apps
  • IaaS—lift and shift

(IaaS = Infrastructure as a Service)

7 to 18 months
  • 28% migrated
  • Reduce multiple environments
  • Small and mid-sized apps
  • IaaS and PaaS

(PaaS = Platform as a Service)

19 to 36 months
  • 74% migrated
  • Large, more complex apps
  • Focus on PaaS
37 to 48 months
  • 90%+ migrated
  • Largest, most complex apps
  • Design cloud-native apps

Our journey to transform our on-premises IT infrastructure to a system based in the Microsoft Azure cloud took roughly four years, and we continue to innovate and refine our approach today.

Be vision-led and metric-driven

When setting off on a years-long journey, you don’t just walk out the door with a vague idea of where you’re going. As we embarked on this years-long project, our leadership laid out the overall vision that guided our project plans.

“Our leadership was critical; they gave us the vision of, ‘We’re going to migrate to the cloud, and we want to be first and best. We’re going to be an example for the rest of the industry,’” Apple says. “They made a big bet on it, and then they put the support behind it to hold the teams accountable, tracking against the goals and metrics. This directive went all the way to up to (Microsoft CEO) Satya Nadella; it was an absolute priority from his point of view.”

Apple, Basem, and O’Flaherty are shown in a composite photo.
Pete Apple (left to right), Basma Basem, and Martin O’Flaherty are employees in Microsoft who played important roles in our transformative journey to a cloud-based IT infrastructure.

Martin O’Flaherty, a principal PM manager at Microsoft Digital, explained how important it was that senior leadership stuck to the vision and remained patient during the long journey to the cloud.

“Our executive vice president took the long view of this project, and he backed us as we took the time to work through all the issues and all the times when things failed,” O’Flaherty says. “We had to ‘embrace the red’ by talking about those failures rather than cover things up, in order to keep learning throughout the process. Leadership made it clear that doing the job right was the priority, and that trust gave us the confidence to stay focused and deliver.”

As far as metrics are concerned, consider the size of the Microsoft digital landscape: more than 220,000 employees in over 100 countries using more than 750,000 devices. Moving a supporting infrastructure of this size to the cloud required careful attention to specific metrics throughout the process, both to carefully measure progress and to understand the biggest challenges and potential obstacles along the way.

“We had something like 800-plus different services across the company that we had to deal with in our journey to the cloud, which I like to call the total footprint,” Apple says. “We had to track how many of them were in the cloud, how many were on-premises, and how many were hybrid. And we kept track of that quarter by quarter. We also had to monitor things like the spend for on-prem versus the cloud, and our quality metrics such as service-level agreements and customer satisfaction ratings. We had to keep an eye on all of it.”

Pay attention to people, processes, and technology

Moving a large IT infrastructure to the Azure cloud is a technology challenge, but it’s just as important to think about the people and the processes involved.

“It’s not just about getting everything moved from on-premises servers to a cloud solution,” Apple says. “Once you have it there, it’s about what your staff should look like, the different roles and skills you’ll need to run things in the cloud. Then, how do you plan for the day-to-day operation of it? What kind of processes and monitoring do you need?”

O’Flaherty notes that of these three considerations, transforming your people resources for the move to the cloud might be the biggest task.

“When we talk about ‘people change,’ we mean how people do their work—and frankly, that’s usually the hardest challenge,” O’Flaherty says. “Once we had good momentum in moving our technology to the cloud, we needed to change how people do their work. We needed to modernize.”

Apple says that transitioning the people skills of the organization was a deliberate process.

“We provided training, and we made it very clear that everyone needed to learn to work with infrastructure as code, rather than physical machines,” he says. “And whenever we had the ability to hire new people, we prioritized those DevOps skills. We invested in that, because that was the direction we were going.”

Sometimes, the technology decisions are also what enables the implementation of more effective processes. O’Flaherty explains how one specific decision during the cloud journey made it possible to implement best-practice processes that ensured quality standards were met.

“We decided to use one single instance of Azure DevOps. So, all of our teams—across more than 800 applications—and all our code repos were in one Azure DevOps account,” O’Flaherty says. “This setup allowed us to implement consistent engineering standards, like requiring every code change to be reviewed by two people. Because we could enforce these policies across the board, we achieved a new level of consistency, accountability, and confidence in our development process.”

Confront legacy applications and technical debt

When the time comes to make a major technological transformation, like moving an on-premises infrastructure to the cloud, it provides the perfect opportunity to deal with the challenge of aging legacy applications and technical debt that has accumulated within the organization.

Dealing with legacy applications up front means you can reduce the total load of what you end up moving to the cloud.

“The first thing we asked was, ‘What do we not need anymore?’” O’Flaherty says. “We were able to identify something like 30% of tools and services that could be retired or consolidated. We also looked at other SaaS solutions as replacements for things we were building ourselves, which removed about 15% more of the portfolio. So we had almost halved the total burden at that point.”

Strategic approach for moving our IT infrastructure to the cloud

Graphic shows the different segments of our network services in terms of how they are handled during the move from on-premises to the cloud.
One key benefit of moving our IT infrastructure to the Microsoft Azure cloud was that we were able to strategically reduce by nearly half the total amount of services we eventually moved to the cloud. This was achieved by eliminating legacy services, dealing with accumulated technical debt, and leveraging first- and third-party SaaS (Software as a Service) solutions instead of lifting and shifting them to the cloud.

Apple explained the benefits of starting with a clean slate when you move to the cloud.

“There’s always that backlog of work items and legacy things, and the idea is that you don’t want to bring your bad habits with you to the cloud,” Apple says. “So, if you’ve got a solution that is still using COBOL or Windows 2008, maybe it’s time to pull off the Band-Aid? That’s a good investment of your developer capacity.”

There were also the significant challenges that Microsoft faced with addressing years of technical debt—which O’Flaherty describes as technical issues resulting from past development decisions that weren’t as robust or maintainable as they could have been—during the early stages of the journey to the cloud.

“We knew the scale of the technical debt we had—it was kind of like an iceberg, with a huge amount of work below the surface. And we knew it was going to take several years to get through it all,” he says. “The key was understanding that we were going to have to invest a significant amount of engineering time to get there—that we needed to put 30% to 40% of our engineering resources behind this effort for well over a year just to get on top of the problem. We had to take that hit up front, or we’d still be in the same boat today.”

Transform your operations with end-to-end thinking

In the old world of on-premises network infrastructure, services were often siloed. Different departments ran their own systems and tools, and employees couldn’t always access data and technologies that were needed to gain a bigger picture or develop cross-disciplinary solutions.

Enter the cloud-based network, which opens up the ability for end-to-end thinking and working.

“In the old days, the interactions between applications were pretty monolithic,” Apple says. “With the move to the cloud and engineering modernization, you open up new kinds of compute and access to data. Developers can use APIs, containers, Power Apps and more to access the various data lakes we have across the company. There’s a lot more flexibility, and they can work much faster.”

Another area where having a cloud-based network allows us to take more of an end-to-end approach is security, which has become a major priority at Microsoft in recent years.

“End-to-end thinking means I can do a multi-layer defense and comprehensive security implementation in the cloud,” says Basma Basem, a senior program manager in Microsoft Security. “I can make sure that there’s a security implementation from an architecture and design standpoint on each layer of the services I’m building in the Azure cloud. And you have such a wide variety of security solutions in the cloud, it makes it much easier to find the right solution and ensure that you have good security posture management.”

Consistently prioritize your goals and metrics

When it comes to tackling such a tremendously huge project, it’s vital to understand your priorities and keep them front and center as you move through the process.

“We had a lot of priorities around financial considerations in moving away from the physical infrastructure model,” Apple says. “That was number one. Then we had priorities around efficiency and modernization. And we had to find ways to measure those priorities and ensure we were hitting our targets.”

Of course, prioritization also means that you can’t take on all your challenges at once. Your leads have to make sure that they communicate effectively so everyone understands the priorities, the pace of progress, and when different issues will be addressed.

“There’s a tendency to kind of try to boil the ocean and fix everything at once,” O’Flaherty says. “We really had to temper people’s expectations, even within our own leadership, and say that this is going to take a while. If there were 50 compliance problems, we couldn’t tackle all 50 at the same time—the leads would identify the top 3, and we’d do those 3, then move on to the next batch. We really had to set specific goals and follow our metrics along the way.”

And there’s one overall metric that Apple likes to keep top-of-mind when discussing what moving our network to the Azure cloud has meant for Microsoft—cost.

“We’re spending 20% less on our infrastructure costs than we did when we were operating on-premises,” Apple says. “When you look at what we were spending on physical infrastructure versus today, in the cloud, it’s a significant savings.”

Every cloud journey has its own path

Today, we operate roughly 98 percent of the Microsoft corporate infrastructure in the cloud, and we are continually looking for strategies to be more efficient, more automated, and less costly. Apple notes that the company decided to push hard to get to this level (“to the point of heartburn for some people”) and show what was possible, but that not every organization will need or want to go this far in their own cloud transition.

“We are the extreme in terms of pushing the bar,” Apple says. “We’ve been very innovative in this space, because we wanted to prove our point in terms of how much we could put on the cloud. We realize every business has to make tradeoffs, and some may want to keep a certain percentage of their infrastructure still on-premises. But the flexibility of the cloud and the cost savings are real, and we want our customers to understand that and take advantage of it.”

Key Takeaways

Here are some of the major insights we took from the process of moving our network into the Azure cloud:

  • Confront your technical debt. Be prepared to do the upfront work of addressing your technical backlog and getting into a better state before you make the transition to the cloud. You’ll not only avoid major headaches—you’ll also reduce the total network footprint that you’ll be moving.
  • Invest for the long term. Leadership has to be willing to devote significant resources over the course of the project, and to understand that the results might not be realized in the short term. But the overall payoff will be worth it once you’ve completed the work.
  • Get employees on board. Make training and upskilling a priority as you transition your workforce to a cloud-first mindset. Incorporate the shift into individual reviews and goal-setting so that everyone is pointed in the right direction.
  • Take the opportunity to instill a “secure by default” philosophy. As you move to the cloud, you can proactively create and deploy strong security architecture, keeping compliance requirements top of mind, continuously monitoring your organization’s security posture, and fostering a culture where everyone factors security risk into their work and decision making.
  • Embrace “the red.” Create a culture where teams are comfortable with revealing when they are falling short on their metrics (being “in the red”). Being open about those issues will help others avoid the same pitfalls in their own areas and significantly increase overall quality.
  • Keep your goals and metrics front and center. On a long and complicated journey, it’s vital to keep everyone focused on the destination—your goals, sometimes called objectives and key results (or OKRs). Defining and carefully tracking the right metrics (also known as key performance indicators, or KPIs) is another essential part of this process.

The post Five principles that guided our network journey to Microsoft Azure and the cloud at Microsoft appeared first on Inside Track Blog.

]]>
19387