What Eigen Actually Does: The MoE Serving Stack Nebius Just Bought
Eigen AI's pitch is not a single trick but a vertically integrated optimization stack that touches every layer of how a Mixture-of-Experts model becomes tokens leaving a GPU. Co-founder and CEO Ryan Hanrui Wang lays out the surface area explicitly: 'Many frontier open models rely on Mixture-of-Experts architectures, where efficient expert routing, GPU scheduling, speculative decoding, quantization and sparsity have a significant impact on performance.' Each of those is a research-heavy lever — routing decides which experts fire per token, scheduling decides how those firings batch onto hardware, speculative decoding cuts wall-clock latency by guessing ahead with a draft model, and quantization plus sparsity shrink the activations and weights that have to move through memory at all.
The team's pedigree maps cleanly onto that stack. Wei-Chen Wang's MLSys 2024 Best Paper — Activation-aware Weight Quantization (AWQ) — is now the standard for 4-bit serving. Ryan Wang's Sparse Attention work (SpAtten) is the most-cited HPCA paper since 2020. Di Jin contributed to Meta's Llama 3 and Llama 4 post-training and co-authored the CGPO RLHF framework. In other words, Nebius isn't buying a wrapper around vLLM; it's buying the people who wrote the techniques that vLLM-like systems depend on. The receipt for that work shows up in the public benchmarks: 911 tokens/sec on GPT-OSS-120B, #1 output-speed across 23 models on Artificial Analysis as of mid-March 2026.