The encoder-free bet: why 35M beats 550M
The architectural headline isn't size, it's deletion. Gemma 4 12B replaces the 550M-parameter vision tower used in Gemma 3 with a 35M-parameter vision embedder that projects raw 48x48 pixel patches directly into the LLM's hidden dimension [2]. The separate audio encoder, roughly 300M parameters in prior multimodal stacks, is gone entirely; raw 16 kHz audio is sliced into 40-millisecond frames and fed straight into model input space [2]. The model is a dense, decoder-only transformer doing all three modalities with one set of weights [1].
Why does this matter? In the old pattern, you co-tune a frozen vision encoder, a frozen audio encoder, and a language model, each pulling against the others. Google's developer guide is explicit: because vision, audio, and text now share the exact same weights, you no longer have to co-tune separate frozen encoders [3]. That collapses memory footprint, simplifies fine-tuning, and means downstream capabilities scale with the LLM rather than being capped by a separate encoder budget. The Decoder's demo of a five-minute video processed as 313 frames plus audio in a single pass is the practical payoff of that unified design [4].
The trade-off is that you're betting a single backbone can absorb modality-specific inductive biases without a dedicated encoder learning them. Google's reported numbers, including a roughly 60%+ overall quality jump for Google AI Edge Eloquent after the upgrade to Gemma 4 12B, suggest the bet is paying off at this scale [2].



