Killing the Encoder: How Gemma 4 12B Makes Multimodality Cheap
The most technically consequential thing in the June release is what Google removed. Where conventional multimodal models bolt a separate vision encoder and audio encoder onto a language model, the Gemma 4 12B uses a unified, encoder-free architecture in which vision and audio inputs flow directly into the LLM backbone [2]. On the vision side, a 27-layer vision transformer is replaced by a single 35M-parameter vision embedder, and raw 48x48 pixel patches are projected into the model's token space through a single matrix multiply [4]. The audio path is even more aggressive: Google removed the audio encoder entirely and projected the raw 16 kHz signal, in 40ms / 640-sample frames, into the same dimensional space as text tokens [2].
The payoff is not just elegance, it is memory. Dropping a heavyweight encoder stack is part of why a genuinely multimodal model fits on a 16GB-VRAM laptop and Google's first mid-sized Gemma with native audio at that [2]. For builders, that collapses the gap between 'cloud-only multimodal' and 'something I can run offline,' since speech recognition, OCR, and chart understanding now share one backbone rather than three models stitched together. It is also a bet that a small, learned projection can carry as much signal as a purpose-built ViT, which is exactly the kind of claim the local community is now stress-testing.



