{"id":1148589,"date":"2025-09-09T07:00:00","date_gmt":"2025-09-09T14:00:00","guid":{"rendered":""},"modified":"2025-09-19T08:38:07","modified_gmt":"2025-09-19T15:38:07","slug":"breaking-the-networking-wall-in-ai-infrastructure","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/breaking-the-networking-wall-in-ai-infrastructure\/","title":{"rendered":"Breaking the\u00a0networking\u00a0wall\u00a0in\u00a0AI infrastructure\u00a0"},"content":{"rendered":"\n
\"Two<\/figure>\n\n\n\n

Memory and network bottlenecks are increasingly limiting AI system performance by reducing GPU utilization and overall efficiency, ultimately preventing infrastructure from reaching its full potential despite enormous investments. At the core of this challenge is a fundamental trade-off in the communication technologies used for memory and network interconnects.<\/p>\n\n\n\n

Datacenters typically deploy two types of physical cables for communication between GPUs. Traditional copper links are power-efficient and reliable, but limited to very short distances (< 2 meters) that restrict their use to within a single GPU rack. Optical fiber links can reach tens of meters, but they consume far more power and fail up to 100 times as often as copper. A team working across Microsoft aims to resolve this trade-off by developing MOSAIC, a novel optical link technology that can provide low power and cost, high reliability, and long reach (up to 50 meters) simultaneously<\/em>. This approach leverages a hardware-system co-design and adopts a wide-and-slow design with hundreds of parallel low-speed channels using microLEDs. <\/p>\n\n\n\n

The fundamental trade-off among power, reliability, and reach stems from the narrow-and-fast<\/em> architecture deployed in today’s copper and optical links, comprising a few channels operating at very high data rates. For example, an 800 Gbps link consists of eight 100 Gbps channels. With copper links, higher channel speeds lead to greater signal integrity challenges, which limits their reach. With optical links, high-speed transmission is inherently inefficient, requiring power-hungry laser drivers and complex electronics to compensate for transmission impairments. These challenges grow as speeds increase with every generation of networks. Transmitting at high speeds also pushes the limits of optical components, reducing systems margins and increasing failure rates. <\/p>\n\n\n\n

These limitations force systems designers to make unpleasant choices, limiting the scalability of AI infrastructure. For example, scale-up networks connecting AI accelerators at multi-Tbps bandwidth typically must rely on copper links to meet the power budget, requiring ultra-dense racks that consume hundreds of kilowatts per rack<\/em>. This creates significant challenges in cooling and mechanical design, which constrain the practical scale of these networks and end-to-end performance. This imbalance ultimately erects a networking wall<\/em> akin to the memory wall<\/em>, in which CPU speeds have outstripped memory speeds, creating performance bottlenecks.<\/p>\n\n\n\n

A technology offering copper-like power efficiency and reliability over long distances can overcome this networking\u00a0wall,\u00a0enabling\u00a0multi-rack\u00a0scale-up domains and unlocking\u00a0new architectures. This is a highly active R&D area, with many candidate technologies currently being developed across the industry.\u00a0In\u00a0our recent\u00a0paper,\u00a0\u201cMOSAIC: Breaking the Optics versus Copper Trade-off with a Wide-and-Slow Architecture and MicroLEDs<\/a>\u201d<\/em>, which received the Best Paper award at ACM SIGCOMM (opens in new tab)<\/span><\/a>, we present\u00a0one such promising\u00a0approach\u00a0that is\u00a0the result of a multi-year collaboration between Microsoft Research,\u00a0Azure, and M365.\u00a0This\u00a0work is\u00a0centered around\u00a0an optical\u00a0wide-and-slow architecture, shifting from a small number of high-speed serial channels towards\u00a0hundreds of parallel low-speed channels.\u00a0This\u00a0would be impractical\u00a0to realize with today\u2019s copper and optical technologies because of\u00a0i)\u00a0electromagnetic interference challenges in high-density copper cables and ii) the\u00a0high cost\u00a0and power consumption of lasers\u00a0in optical links,\u00a0as well as the increase in packaging complexity.\u00a0MOSAIC overcomes these issues by\u00a0leveraging\u00a0directly modulated\u00a0microLEDs, a technology originally developed for\u00a0screen\u00a0displays.\u00a0<\/p>\n\n\n\n

MicroLEDs are significantly smaller than traditional LEDs (ranging from a few to tens of microns) and, due to their small size, they can be modulated at several Gbps. They are manufactured in large arrays, with over half a million in a small physical footprint for high-resolution displays like head-mounted devices or smartwatches. For example, assuming 2 Gbps per microLED channel, an 800 Gbps MOSAIC link can be realized by using a 20\u00d720 microLED array, which can fit in less than 1 mm\u00d71 mm silicon die. <\/p>\n\n\n\n

MOSAIC\u2019s wide-and-slow design provides four core benefits.<\/p>\n\n\n\n