Packet Spraying and the Tyranny of the Slowest Link
The core insight behind MRC is that frontier AI training violates the assumption underneath the entire modern internet. Conventional networking gets its statistical magic from the law of large numbers: millions of independent flows, averaged across paths, smooth out into something predictable. Synchronous training on tens of thousands of GPUs is the opposite shape — every GPU is waiting on every other GPU at every step, so a single late transfer stalls the whole training step and leaves the rest of the cluster idle. With networks now containing millions of optical links, something is always failing somewhere, and a conventional RDMA flow pinned to a single path stalls for seconds when its link or switch goes down.
MRC's answer is to spray each transfer across hundreds of network paths simultaneously, baked into 800Gb/s NICs so the spraying happens at hardware speed. When a link or switch fails, MRC reroutes around it on a microsecond timescale rather than the seconds-to-tens-of-seconds typical of conventional fabrics — fast enough that the GPUs above never notice. OpenAI says the production proof was unceremonious: engineers rebooted four tier-1 switches during a frontier training run without coordinating with the training team, and MRC absorbed the disruption. That is a different posture toward failure than networking has historically taken, and as Ron Westfall puts it, it amounts to 'treating the entire AI fabric as a single fluid system instead of a series of isolated connections.'



