The architectural mechanism: how 850M of multimodal encoders collapsed into a 35M embedder
The headline trick in Gemma 4 12B is that Google DeepMind threw out the conventional vision-encoder + audio-encoder stack and routed pixels and waveforms directly into the LLM backbone [1]. Where Gemma 3 carried a 550M-parameter vision encoder and a 300M-parameter audio encoder, Gemma 4 12B replaces both with a 35M-parameter vision embedder and a direct audio wave projection — roughly a 24x reduction in non-LLM multimodal weight [2]. The vision tokenizer slices inputs into 48x48 pixel patches and passes them through what is effectively a single matrix multiplication before the tokens hit the same transformer that processes text.
The payoff is that the model spends its parameter budget where representational power actually compounds — inside the unified backbone — instead of duplicating semantics in modality-specific encoders. MarkTechPost's analysis argues this is why the 12B variant lands close to the 26B Mixture-of-Experts model in capability at less than half the total memory footprint [2], and it explains why the encoder-free design ships in a configuration that fits on a 16GB consumer laptop or Apple Silicon Mac with unified memory [1]. Google's internal Google AI Edge Eloquent app reportedly logged a 60%+ quality jump after switching to Gemma 4 12B, which is the kind of delta you'd expect if the unified backbone is finally learning shared cross-modal structure that the old encoder split was discarding.



