Under the Hood: Two-Stage Drafting With a Confidence-Scheduled Verifier
Speculative decoding works by letting a cheap drafter guess several tokens ahead, then having the full model verify them in a single parallel pass - accepting whatever survives the check. The catch is that pure parallel drafting decays fast: a backbone that emits logits for every future position at once cannot see the tokens it just proposed, so each successive guess drifts further from what the real model would have produced, and acceptance collapses a few tokens out. DSpark's design answers this directly. It splits drafting into two stages: a heavy parallel backbone (DFlash in their setup) produces base logits for every position, then a lightweight sequential head adds a prefix-dependent bias before sampling each token [1]. That sequential head is the fix for the decay problem - it reintroduces the dependence on already-drafted tokens that the parallel backbone throws away, sharpening each guess against the actual prefix without paying for a full autoregressive draft pass. On top of that sits the scheduling layer: a confidence head scores each draft position, and a hardware- and load-aware prefix scheduler sets the verification length per request using a profiled throughput curve [1]. The trade-off it manages is concrete - verify more tokens per step when GPUs are idle and fewer when they are busy, which raises per-user speed while holding overall system throughput constant under strict latency constraints [1]. The full method name, 'Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation,' is the whole architecture in one phrase: semi-autoregressive because of the parallel-backbone-plus-sequential-head split, confidence-scheduled because verification depth is chosen dynamically rather than fixed.



