The Research-vs-Production Gap That Defines This Launch
The single most important detail buried in SubQ's launch materials is that the 12 million-token context window is described as a research result, while the model actually exposed to early-access users is labeled SubQ 1M-Preview. The headline number — '12M tokens' — anchors every press story, every investor pitch, every comparison to Claude Opus, but the benchmarks that have been published largely cap at one million tokens or below. RULER 128K is a 128K-token benchmark. SWE-Bench Verified is a coding benchmark, not a context-length benchmark. MRCR v2 is reported at 1M tokens with 8 needles. There is no published 12M-token benchmark, despite 12M being the entire architectural pitch.
This matters because the value proposition of sub-quadratic attention is that it should get better at long context, not just survive it. If SSA is genuinely O(n), then a 12M-token retrieval benchmark should be the easiest, most decisive test the company could run. Its absence is what drove the dominant skeptic breakdown on developer YouTube, what fueled the LocalLLaMA dissection of MRCR v2 (where SubQ's production score of 65.9 sits below Opus 4.6's 78.3 and GPT-5.5's 74), and what frames the entire community reaction. The launch isn't being judged on whether SSA could work — it's being judged on whether the demonstrated artifact matches the marketed artifact, and on that question the gap is conspicuous.


