{"id":20611,"date":"2025-10-16T09:00:00","date_gmt":"2025-10-16T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/insidetrack\/blog\/?p=20611"},"modified":"2026-06-10T16:57:01","modified_gmt":"2026-06-10T23:57:01","slug":"keeping-our-in-house-optical-network-safe-with-a-zero-trust-mentality","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/insidetrack\/blog\/keeping-our-in-house-optical-network-safe-with-a-zero-trust-mentality\/","title":{"rendered":"Keeping our in-house optical network safe with a Zero Trust mentality"},"content":{"rendered":"\n
When it comes to corporate connectivity at Microsoft, a minute of lost connection can lead to catastrophic disruptions for our product teams, sleepless nights for our network engineers, and millions of dollars of lost value for the company.<\/p>\n\n\n\n
That\u2019s why we built our own optical network at our headquarters in Washington state, and that\u2019s why we\u2019re building similar networks at other regional campuses around the United States and the rest of the world. <\/p>\n\n\n\n
With so much on the line, we need to make sure these in-house networks never go down. <\/p>\n\n\n\n
But how are we doing that?<\/p>\n\n\n\n
We\u2019re applying the same robust Zero Trust approach we take to security and identity. While our optical networks are extremely reliable, any complex system can be knocked offline. In alignment with the Zero Trust mentality we have as a company, we trusted the integrity of what we\u2019ve built, but we needed a resilient backup system that went beyond redundancy to provide true resilience.<\/p>\n\n\n\n
Driven by this goal, we created a Zero Trust Optical Business Continuity Disaster Recovery (BCDR) network that combines two fully independent optical systems designed to sustain uninterrupted services, even during systemic failures. The result is more confidence for our employees and vendors, less pressure on our network engineers, and comprehensive network resilience that will protect us against a major outage.<\/p>\n\n\n\n
In 2021, our team in Microsoft Digital, the company\u2019s IT organization, deployed our first next-generation optical network to serve the exclusive network needs of our Puget Sound metro campuses<\/a>. It offers more bandwidth on less fiber for a lower operational cost than leasing from traditional carriers.<\/p>\n\n\n\n “Puget Sound is a highly concentrated developer network where we need to provide very high throughput,\u201d says Patrick Alverio, principal group software engineering manager for Infrastructure and Engineering Services within Microsoft Digital. \u201cOur optical system is the backbone of all that traffic.”<\/p>\n\n\n\n Our state-of-the-art optical network fulfills our need for fast and reliable connectivity at up to 400 Gbps between core sites, labs, data centers, and the internet edge. We built this network on the Reconfigurable Optical Add\/Drop Multiplexer (ROADM) technology, delivering dynamic reconfiguration, colorless, directionless, contentionless (CDC) capabilities, flexible grid support, remote provisioning, and automation. It also features a full-mesh topology that provides a layer of redundancy.<\/p>\n\n\n\n But what if the entire ROADM-based system fails?<\/p>\n\n\n\n There are plenty of operational risks that can derail even the most robust network. Anything from misconfigured automation scripts to policy changes to misaligned software versioning to simple human error can cause outages.<\/p>\n\n\n\n \u201cWe don\u2019t want even a second of downtime. We needed a life raft for when failures occur that could also function as a standby network for core site migrations or platform upgrades.\u201d<\/p>\nVinoth Elangovan, senior network engineer, Hybrid Core Network Services, Microsoft Digital<\/cite><\/blockquote>\n\n\n\n To some degree, those kinds of minor disruptions are inevitable. But catastrophic events like fiber cuts, failures in the ROADM operating system, or even natural disasters have the potential for even more wide-ranging disruption.<\/p>\n\n\n\n During a catastrophic outage, thousands of engineers, developers, researchers, and other technical employees who need access to crucial lab environments and data centers could lose connectivity. That can sabotage feature delivery, disrupt product patches, interrupt updates, and halt all kinds of core product functions.<\/p>\n\n\n\n On top of normal software development operations, new AI tools demand massive bandwidth and consistent uptime. Finally, our hybrid networks feature paths integrated with Microsoft Azure that consume on-premises resources, so they also stand to benefit from increased resilience.<\/p>\n\n\n\n A catastrophic network outage can cause incredible damage to all of these business functions. In fact, we experienced exactly that in 2022.<\/p>\n\n\n\n A fiber cut combined with a ROADM system hardware reboot caused a five-minute outage at our Puget Sound metro region. In this environment, every minute of lost connectivity can result in significant financial impact, making network resilience absolutely essential.<\/p>\n\n\n\n \u201cWe don\u2019t want even a second of downtime,\u201d says Vinoth Elangovan, senior network engineer, who designed and implemented the Zero Trust Optical BCDR network for Microsoft. \u201cWe needed a life raft for when failures occur that could also function as a standby network for core site migrations or platform upgrades.\u201d<\/p>\n\n\n\n To ensure we could deliver uninterrupted network connectivity even in the midst of a catastrophic outage, we needed to consider the technical demands of a truly resilient system. Five design pillars helped us assemble our architectural criteria:<\/p>\n\n\n\n The result was the Zero Trust Optical BCDR architecture, a layered approach to optical networking. It consists of our primary, ROADM-based transport layer and a secondary, MUX-based transport layer, both terminating into a single logical port channel.<\/p>\n\n\n\n \u201cOur core responsibility is the employee experience, so our main design thrust was making sure service is seamless and uninterrupted\u2014even during an outage.\u201d<\/p>\nVinoth Elangovan, senior network engineer, Hybrid Core Network Services, Microsoft Digital<\/cite><\/blockquote>\n\n\n\n Both systems are live and active, which means they deliver production services through their own independent fibers, power supplies, and software stacks. By layering fully independent optical domains and logically unifying them at the Ethernet edge, the network can sustain a complete failure of one system and maintain continuity.<\/p>\n\n\n\n That physical and operational independence is the difference between simple redundancy and robust resilience.<\/p>\n\n\n\n \u201cOur core responsibility is the employee experience, so our main design thrust was making sure it\u2019s seamless and uninterrupted\u2014even during an outage,\u201d Elangovan says.<\/p>\n\n\n\n A typical ROADM optical network connects campus and data center sites to the internet edge. Our design features three interconnected optical rings, with two internet edges as multi-directional nodes, while other sites operate as dual-degree nodes with bidirectional redundancy. Meanwhile, our campuses and datacenters are designated as critical sites and equipped with Optical BCDR links to ensure enhanced resiliency. In the event of a complete Optical ROADM line failure, these critical sites retain connectivity.<\/p>\n\n\n\n In the event of an outage on the primary network, the port channel handles forward continuity automatically, shifting WAN traffic between optical paths in real time. <\/p>\n\n\n\n The transition occurs seamlessly and transparently, with no noticeable impact to clients.<\/p>\n\n\n\n \u201cOur initial goal was to provide high-throughput connectivity for major labs, with less than six minutes of downtime per year. That represents a service level of 99.999% network continuity, and we\u2019re aiming for even better moving forward.\u201d<\/p>\nBlaine Martin, principal engineering manager, Hybrid Core Network Services, Microsoft Digital<\/cite><\/blockquote>\n\n\n\n Coupling at the Ethernet layer provides clients and applications with one logical interface, automatic load balancing and traffic distribution, and seamless failover, regardless of which optical domain is providing service.<\/p>\n\n\n\n \u201cOur initial goal was to provide high-throughput connectivity for major labs, with less than six minutes of downtime per year,\u201d says Blaine Martin, principal engineering manager for Hybrid Core Network Services in Microsoft Digital. \u201cThat represents a service level of 99.999% network continuity, and we\u2019re aiming for even better moving forward.\u201d<\/p>\n\n\n\n For the network engineers who keep Microsoft employees and resources connected, the Zero Trust Optical BCDR network relieves much of the pressure that comes from resolving outages.<\/p>\n\n\n\n “Before, we were dependent on a single system, even with redundancies, so the human experience was like firefighting. Now, if the primary optical network is having a problem, I don\u2019t even see it.”<\/p>\nKevin Bullard, principal cloud network engineering manager, Microsoft Digital<\/cite><\/blockquote>\n\n\n\n When a network goes down, engineers have an enormous set of responsibilities to manage: processing the incident report, assigning severity, performing checks, notifying internal teams, providing updates, and engaging with physical support teams\u2014all with a profound urgency to restore productivity.<\/p>\n\n\n\n Dialing those pressures back has been a huge benefit.<\/p>\n\n\n\n “Before, we were dependent on a single system, even with redundancies, so the human experience was like firefighting,\u201d says Kevin Bullard, Microsoft Digital principal cloud network engineering manager responsible for maintaining WAN interconnectivity between labs. \u201cNow, if the primary optical network is having a problem, I don\u2019t even see it.”<\/p>\n\n\n\n There will always be pressure on network engineers to restore connectivity during an outage, but they can breathe easier knowing it won\u2019t cost the company millions of dollars as the time to resolve ticks away. And in non-emergency situations like core site migrations, the BCDR network provides a much easier way to shunt services while the main network is offline.<\/p>\n\n\n\n \u201cOur internal users have become more confident that they can stay connected, no matter what,\u201d says Chakri Thammineni, principal cloud network engineer for Infrastructure and Engineering Services in Microsoft Digital. \u201cThat gives the people responsible for maintaining our enterprise networks incredible peace of mind.\u201d<\/p>\n\n\n\n Fortunately, there hasn\u2019t been a substantial network outage in the Puget Sound metro area since 2022. But our network engineering teams know that if and when it happens, the BCDR network will be ready to maintain service continuity.<\/p>\n\n\n\n \u201cWe\u2019re always looking ahead into industry trends to stay at the bleeding edge, whether that\u2019s in the technology we provide for our customers or the networks we use to do our own work.\u201d<\/p>\nPatrick Alverio, principal group software engineering manager, Infrastructure and Engineering Services, Microsoft Digital<\/strong><\/cite><\/blockquote>\n\n\n\n With our Puget Sound network protected, we have plans in place to extend this model to other metro areas. Naturally, we have to balance population, criticality, and the knowledge that elevated reliability and availability come with a cost.<\/p>\n\n\n\n Our selection criteria for new BCDR networks have largely centered around two factors: expansions of AI-critical infrastructure and concentrations of secure access workspaces (SAWs) for technical employees. With these criteria in mind, we\u2019re planning new BCDR networks first in the Bay Area and Dublin, then in Virginia, Atlanta, and London.<\/p>\n\n\n\n Zero Trust optical BCDR architecture represents a paradigm shift in enterprise network resilience, and we\u2019re committed to expanding the model to benefit both conventional workloads and the expanding infrastructure demands of AI.<\/p>\n\n\n\n \u201cWe\u2019re always looking ahead into industry trends to stay at the bleeding edge, whether that\u2019s in the technology we provide for our customers or the networks we use to do our own work,\u201d Alverio says. \u201cWe refuse to accept the status quo, and we\u2019re elevating the experience for employees across Puget Sound and Microsoft as a whole.\u201d<\/strong><\/p>\n\n\n\n Our journey towards an AI-driven optical network is gaining momentum.<\/p>\n\n\n\n As part of our Secure Future initiative<\/a>, we\u2019ve automated our Optical Management Platform credential rotation and are actively developing intelligent incident management ticket enrichment, auto-remediation, link provisioning, deployment validation, and capacity planning.<\/p>\n\n\n\n AI plays a central role in this transformation.<\/p>\n\n\n\n With Microsoft 365 Copilot and GitHub Copilot integrated into our engineering workflows, we\u2019re accelerating development cycles, improving code accuracy, and uncovering optimization opportunities that would otherwise take hours of manual effort.<\/p>\n\n\n\n These Copilots are also helping our engineers analyze network patterns, simulate outcomes, and validate deployment logic before execution, reducing human error and strengthening our Zero Trust posture. Over time, we\u2019re evolving toward a system where AI not only assists but proactively predicts potential disruptions, recommends remediations, and continuously learns from operational telemetry.<\/p>\n\n\n\n These advancements are paving the way for a future where our optical infrastructure can anticipate issues, recover faster, and operate with the agility and assurance expected in a Zero Trust environment.<\/p>\n\n\n\n Key takeaways<\/p>\n<\/div>\n\n\n\n If you\u2019re considering implementing your own optical and BCDR networks, consider these tips:<\/p>\n\n\n\n Try it out<\/p>\n<\/div>\n\n\n\n Learn about and start exploring how to make similar investments and find similar investments at your company by exploring our Microsoft Azure Virtual Network technology.<\/a> You\u2019ll need a subscription to Microsoft Azure\u2014go here to sign up for a free trial<\/a>.<\/p>\n<\/div>\n\n\n\n Related links<\/p>\n<\/div>\n\n\n\n When it comes to corporate connectivity at Microsoft, a minute of lost connection can lead to catastrophic disruptions for our product teams, sleepless nights for our network engineers, and millions of dollars of lost value for the company. That\u2019s why we built our own optical network at our headquarters in Washington state, and that\u2019s why […]<\/p>\n","protected":false},"author":115,"featured_media":20613,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":true,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_hide_featured_on_single":false,"_show_featured_caption_on_single":true,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[71],"tags":[383,115,849,689,848,419],"coauthors":[622],"class_list":["post-20611","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-featured","tag-azure-networking","tag-microsoft-azure","tag-network-and-infrastructure","tag-network-security","tag-security-and-risk-management","tag-zero-trust","m-blog-post"],"yoast_head":"\n
<\/figure>\n\n\n\n\n
Delivering greater network resilience<\/strong><\/h2>\n\n\n\n
\n
\n
Optical network backed by a BCDR network<\/h2>\n\n\n\n

<\/figure>\n\n\n\n\n
A new era of confidence for network engineers<\/h2>\n\n\n\n
\n
<\/figure>\n\n\n\n\n
Driving AI innovation in optical network resilience<\/h2>\n\n\n\n
<\/figure>\n\n\n\n\n
<\/figure>\n\n\n\n
<\/figure>\n\n\n\n\n