TECH

MRC Network Protocol Launch

37+

Signals

Strategic Overview

01.
OpenAI, alongside AMD, Broadcom, Intel, Microsoft, and NVIDIA, released Multipath Reliable Connection (MRC), a new open RDMA transport protocol that distributes a single connection's traffic across multiple network paths to improve throughput, load balancing, and availability for large-scale AI training fabrics.
02.
MRC is built into the latest 800Gb/s network interfaces and uses packet spraying across hundreds of paths, with hardware-level failure bypass that detects and reroutes around a failed path in microseconds.
03.
MRC was first proven in production on NVIDIA Spectrum-X Ethernet hardware powering OpenAI's GB200 supercomputers, Microsoft's Fairwater AI factories, and OCI's Abilene data center; the specification has been released as an open contribution through the Open Compute Project.
04.
The protocol extends RDMA over Converged Ethernet (RoCE) and uses IPv6 Segment Routing (SRv6) so the sender can specify the path each packet takes by embedding a sequence of switch identifiers into the destination address.

Six Rivals, One Spec: Why OpenAI Gave Away Its Networking Edge

The line-up on the MRC paper is the part of this launch that should not exist on paper. AMD and NVIDIA do not co-author specs together. Broadcom and NVIDIA are direct competitors in AI switching silicon. Intel is fighting both for NIC share. And yet the Multipath Reliable Connection protocol arrives with all six logos: OpenAI, AMD, Broadcom, Intel, Microsoft, NVIDIA, plus an OCP filing that hands the design to anyone who wants to implement it. The reason is visible in OpenAI networking lead Mark Handley's own framing — the work is explicitly positioned 'as opposed to each of these large companies doing their own thing.'

The choice to commoditize this layer is strategic, not generous. OpenAI does not win by selling networking IP; it wins by training larger models faster on whoever's silicon is cheapest. A proprietary multipath protocol locked to one vendor would slow OpenAI down at the procurement layer for years. A protocol that AMD, Broadcom, Intel, and NVIDIA all implement turns the AI back-end fabric into something OpenAI can dual-source on day one — which is exactly what is already happening, with Broadcom and NVIDIA hardware running MRC side-by-side inside the company's deployments. For Microsoft and OCI, who together host much of OpenAI's training compute, the same logic holds: Fairwater and Abilene each get a back-end fabric that is not hostage to a single switch vendor's roadmap. The signal to competitors is sharper still — by routing the spec through OCP rather than a closed alliance, the authors are daring Meta, Google, and the InfiniBand-heavy stacks to either adopt or explain why they didn't.

Under the Hood: Packet Spraying, SRv6, and Microsecond Failover

Mechanically, MRC is a rewrite of how an RDMA connection uses the network underneath it. Classic RDMA over Converged Ethernet pins a single connection to a single path; one congested link or one flaky transceiver and the connection stalls, dragging the whole synchronous training step with it. MRC instead does packet spraying — the same single connection scatters its packets across hundreds of paths simultaneously, so no individual link can become the bottleneck. The reordering and reliability work that this would normally break is pushed down into the NIC hardware on the new 800Gb/s interfaces.

The routing trick that makes this practical is IPv6 Segment Routing, or SRv6. Rather than letting intermediate switches choose paths via ECMP hashing — which is what causes hot-spots in the first place — the sender encodes the exact sequence of switch identifiers into each packet's destination address, so the path is chosen at source. Combined with hardware failure bypass that detects a dead path and reroutes traffic in microseconds, this turns the fabric into something that looks, to the GPU above it, like one fat reliable pipe even while individual links flicker. This is the part Ron Westfall of HyperFrame Research means when he says OpenAI is 'treating the entire AI fabric as a single fluid system instead of a series of isolated connections.' The protocol-level abstraction has moved up a layer.

The Ethernet Inflection InfiniBand Has Been Dreading

MRC lands in a market that has been quietly tilting for two years. According to Dell'Oro Group's Sameh Boujelbene, 2025 was the year Ethernet sales and shipments to AI back-end networks surpassed InfiniBand — the first time the dominant RDMA fabric for HPC has lost its lead inside AI data centers. Meta had already telegraphed the move with its 2024 RoCEv2 build-out for distributed training, and NVIDIA's own Spectrum-X platform, originally a hedge, has become the lead vehicle for the MRC launch.

What the spec does to that trajectory is convert a procurement preference into a standards reality. InfiniBand's historical advantage was that it shipped end-to-end congestion control, lossless behavior, and adaptive routing as part of one closed stack. RoCEv2 on Ethernet got most of the way there but kept losing on multipath behavior at extreme scale — the exact gap MRC closes. Boujelbene frames it directly: hyperscalers are 'leaning harder into Ethernet for AI fabrics, especially as clusters push toward 100,000 to 500,000-plus GPUs.' For NVIDIA, which sells both InfiniBand (Quantum) and Ethernet (Spectrum-X), MRC is a controlled cannibalization — better to lead the Ethernet story than have Broadcom run away with it. For pure InfiniBand stacks, the runway just got shorter.

Why Now: The Failure Amplifier Math at 10 GW

The 'why now' answer is in the scale numbers OpenAI is operating at. The company recently surpassed 10 GW of compute capacity and added more than 3 GW in just the prior 90 days, with frontier training runs spanning weeks across clusters that already touch hundreds of thousands of GPUs and are headed toward half a million. At that size, the failure mode dominates. OpenAI workload lead Greg Steinbrecher's phrase — large-scale AI training is a 'failure amplifier' for GPU clusters — is the literal description: a single link blip stalls a synchronous all-reduce, which stalls the step, which idles every GPU in the run.

MRC's economics fall directly out of this. Multi-plane support means a 100,000+ GPU cluster can be wired with only two tiers of switches instead of three, cutting power, component count, and the number of failure domains in the first place. Microsecond hardware failover means the failures that do happen don't propagate up into the training loop. And load balancing across all available paths means no GPU is starved of bandwidth mid-step. This is the layer of OpenAI's Stargate build-out that doesn't show up in renderings of new data centers but probably matters more than any individual building: the compute footprint only stretches as far as the network can keep it synchronized, and MRC is the protocol that lets the next doubling happen without re-architecting the fabric again. The community reaction has tracked this framing — the protocol's authors themselves headlined OpenAI's own podcast on the launch, and analyst commentary has converged on the 'fluid fabric at hyperscale' read rather than treating MRC as just another RDMA tweak.

Historical Context

2010

First version of RDMA over Converged Ethernet (RoCE) defined, establishing the RDMA-on-Ethernet foundation MRC builds on.

2014

Routable RoCEv2 standardized, becoming the dominant RDMA transport for hyperscale Ethernet fabrics.

2023-06

Spectrum-X, the AI-native Ethernet platform that would later host MRC's first production deployment, debuted with the Israel-1 supercomputer.

2024

Meta detailed its move from InfiniBand toward RoCEv2 for distributed AI training, signaling broader hyperscaler appetite for open Ethernet fabrics.

2025

Ethernet sales and shipments to AI back-end networks surpassed InfiniBand for the first time, setting the commercial stage for an open multipath protocol.

2026

After roughly two years of development, MRC released as an open specification through the Open Compute Project, with first production deployments on Spectrum-X Ethernet inside OpenAI, Microsoft Fairwater, and OCI Abilene.

Power Map

Key Players

Subject

MRC Network Protocol Launch

OpenAI

Originator and lead designer of MRC. Deploys it on its largest NVIDIA GB200 supercomputers used to train frontier models including ChatGPT and Codex, and is using the open release to push the entire industry past a shared bottleneck rather than differentiate on it.

NVIDIA

Hardware partner and first production platform. MRC was proven on NVIDIA Spectrum-X Ethernet (Spectrum-4 switches and BlueField-3/Spectrum-X SuperNICs), giving NVIDIA an Ethernet story that competes with its own InfiniBand stack.

Microsoft

Co-author and operator. Microsoft's Fairwater AI factories rely on MRC to train and deploy frontier LLMs, anchoring the protocol in one of the largest hyperscale AI build-outs.

AMD, Broadcom, Intel

Co-authors of the MRC specification. Broadcom hardware is already running MRC alongside NVIDIA in OpenAI deployments, and the three vendors get an open standard that lets their NICs and switches interoperate inside hyperscaler AI fabrics.

Oracle Cloud Infrastructure (OCI)

Operator. OCI's Abilene, Texas data center, built with OpenAI as part of the Stargate build-out, runs MRC at production scale.

Open Compute Project (OCP)

Standards body hosting the open MRC specification. OCP stewardship is what turns MRC from an OpenAI design into something other clouds and chip vendors can implement without licensing risk.

Source Articles

Top 4

THE SIGNAL.

Analysts

"Says MRC's end-to-end approach 'enabled us to avoid much of the typical network-related slowdowns and interruptions and maintain the efficiency of frontier training runs at scale.'"

Sachin Katti

OpenAI

"Frames MRC as a two-year effort and an explicit alternative to fragmented proprietary work, 'as opposed to each of these large companies doing their own thing,' with packet spraying scattering traffic along hundreds of paths to prevent link congestion."

Mark Handley

Networking lead, OpenAI

"Describes large-scale AI training as a 'failure amplifier' for GPU clusters and says MRC lets OpenAI 'turn the crank on our entire research pipeline much faster.'"

Greg Steinbrecher

Workload lead, OpenAI

"Argues 'OpenAI is treating the entire AI fabric as a single fluid system instead of a series of isolated connections,' reflecting an industry pivot toward specialized Ethernet-plus architectures for AI."

Ron Westfall

Research Director, HyperFrame Research

"Reads MRC as 'a strong data point that hyperscalers are leaning harder into Ethernet for AI fabrics, especially as clusters push toward 100,000 to 500,000-plus GPUs.'"

Sameh Boujelbene

Senior Director, Dell'Oro Group

The Crowd

"We've partnered with @AMD, @Broadcom, @Intel, @Microsoft, and @NVIDIA, to release Multipath Reliable Connection (MRC), a new open networking protocol that helps large AI training clusters run faster and more reliably, with less wasted GPU time."

@@OpenAI0

"JUST IN: OpenAI partners with AMD, Broadcom, Intel, Microsoft, and Nvidia to launch MRC - $AMD $AVGO $MSFT $NVDA. OpenAI Partnered With AMD, Broadcom, Intel, Microsoft, And NVIDIA To Launch Multipath Reliable Connection (MRC). MRC Is A New Open-Standard Protocol"

@@AIStockSavvy0

"NVIDIA Spectrum-X MRC is the Custom RDMA Transport Protocol for Gigascale AI"

@u/NewMaxx1

Broadcast

Why AI needs a new kind of supercomputer network — the OpenAI Podcast Ep. 18

OpenAI launches MRC, a supercomputer networking protocol | Next in AI | Astha La Vista