The 35-Hour Loop: Why a 1M Context Window Finally Matches Agentic Reality
The most concrete thing Alibaba shipped isn't a benchmark number — it's a proof-of-concept that an agent loop can sustain itself for 35 hours of continuous tool use. Qwen 3.7 Max made roughly 1,158 tool calls and ran 432 kernel evaluations across five architectural redesigns to optimize an Extend Attention kernel on the new Zhenwu M890 chip, and crucially, it was given no chip-architecture documentation or performance data going in [1]. The model had to write code, dispatch it to the chip, read profiler output, form hypotheses, and try again — for a day and a half — without losing the plot.
The mechanism that makes this even plausible is the jump from 256K to 1M tokens of context. At 256K, an agent on a multi-day task constantly loses earlier reasoning or compresses it lossily; at 1M, the entire trajectory of 1,158 tool calls — prompts, code, profiler outputs, intermediate hypotheses — can plausibly stay live in attention [2]. Combined with the new extended-thinking mode, this is what 'agentic reliability' actually looks like in practice: not a higher MMLU score, but the model still being coherent at tool call #900. The headline result — a geometric-mean 10x speedup on the resulting kernel versus the reference Triton implementation [1]— is impressive on its own, but the operationally interesting claim is that the loop didn't degrade. That's the capability frontier Western labs are also chasing, and it's the one that translates most directly into enterprise dollars.



