What 'encoder-free' actually does at the byte level
The shorthand 'encoder-free' hides a very specific architectural cut. In other mid-sized Gemma 4 models, a 550M-parameter Transformer vision encoder converts pixels into latents before the LLM sees anything; audio gets a separate ~305M-parameter conformer stack [1]. In Gemma 4 12B Unified, both stacks are gone. A 48x48 pixel image patch is fed through a single matrix multiplication in a 35M-parameter embedder and projected straight into the LLM's token space; raw 16 kHz audio is sliced into 40 ms frames and linearly projected into the same embedding space as text tokens, with no feature extraction and no conformer layers [1]. Maarten Grootendorst's visual guide notes that the 12B's transformer core looks 'rather similar' to the 31B dense Gemma 4 [2]- the surgery is at the front door, not the spine. Hugging Face describes the net result as 'no separate vision or audio encoder' where 'all modalities flow into a single decoder-only transformer' [3]of 11.95B parameters with a 256K-token context window.



