TECH

Gemma 4 QAT checkpoints for on-device deployment

30+

Signals

Strategic Overview

01.
Google DeepMind released Quantization-Aware Training (QAT) checkpoints for the full Gemma 4 family (E2B, E4B, 12B, 26B-A4B, 31B) on Hugging Face, with both Q4_0 GGUF and compressed-tensor variants ready for llama.cpp, Ollama, LM Studio, vLLM, MLX, LiteRT-LM, and Transformers.js.
02.
Q4_0 QAT cuts Gemma 4 E2B from 9.6GB BF16 to 3.2GB and E4B from 15GB to 5GB, while a new mobile-specialized quantization schema pushes E2B to roughly 1GB, with text-only variants dropping below 1GB.
03.
Independent measurement by Unsloth shows roughly 72% memory reduction across the family, with recommended runtime memory of 3GB (E2B), 5GB (E4B), 7GB (12B), 15GB (26B-A4B), and 18GB (31B).
04.
The mobile-specialized format uses 2-bit decoding layers, optimized KV caches, and static activations to deliver up to 2x faster inference and 40-50% lower runtime memory versus FP16 on mobile-class NPUs.

What QAT actually does that PTQ cannot

Post-training quantization takes a trained float model and rounds its weights down to 4 bits after the fact. The model never had a chance to compensate for the rounding error, so accuracy degrades, sometimes sharply on reasoning-heavy tasks. Quantization-Aware Training inverts the order: compression is baked into the training process, with quantization simulated during the forward pass so gradients teach the model to find weight configurations that are inherently robust to rounding ^[3]. The result is a checkpoint that already 'knows' it will live in 4-bit space, which is why Google argues QAT yields higher overall quality than standard PTQ baselines at the same compressed size ^[2]. The Gemma 3 generation already proved the recipe out: about 5,000 finetuning steps cut the Q4_0 perplexity regression by 54% versus naive PTQ ^[7]. Gemma 4 extends that same training-time treatment across the full family from E2B through 31B, with both Q4_0 GGUF and compressed-tensor formats shipped on Hugging Face on day one ^[1].

The mobile format: under 1GB and up to 2x faster

The headline number is E2B at roughly 1GB, with text-only variants that drop the audio and vision encoders going below 1GB ^[2]. That is not just a smaller Q4_0 file. The mobile-specialized schema layers 2-bit decoding layers on top of QAT, plus optimized KV caches and static activations engineered for mobile-class accelerators ^[4]. The combined effect is up to 2x faster inference and roughly 40-50% lower runtime memory than FP16 on mobile NPUs ^[8]. Structurally, that closes the gap between 'a model you can technically load on a phone' and 'a model you can serve interactively to a user.' Google's positioning is explicit: this release is what makes Gemma 4 'even more efficient, so you can run models locally on everyday edge devices and consumer GPUs' ^[1]. Independent measurement matches the story upward: Unsloth pegs the family-wide reduction at about 72%, with the 12B fitting in 7GB and the 26B-A4B in 15GB of runtime memory, which puts mainstream laptops and single-card desktops back in play ^[5].

The conversion-fidelity gap: why naive Q4_0 still bleeds accuracy

The catch is that QAT is necessary but not sufficient. Unsloth showed that a naive Q4_0 conversion from Google's 26B-A4B QAT checkpoint still drops top-1 accuracy to 70.2%; their dynamic GGUF (UD-Q4_K_XL) recovers it to 85.6%, a +15.4 point swing on the same underlying weights ^[3]. The 31B sees a similar jump from 87.9% to 96.7% (+8.8) ^[5]. The implication for developers is concrete: pick your converter carefully. Google's prebuilt GGUFs and the Unsloth dynamic recipes are the safe paths; cobbling Q4_0 together from the unquantized release with a generic pipeline is the path that leaves quality on the table. Open-source LLM developer communities surfaced this caveat almost immediately, with the strongest community measurement being that Q4 QAT done well actually beats traditional Q8 on KL divergence (0.014 vs 0.159) — inverting the usual 'more bits is always better' intuition. The countervailing reminder, also widely shared, is that QAT does not make models 'smarter' in any absolute sense; it just trains them to degrade less when squeezed.

Why same-day ecosystem coordination is the real shipping story

Open-weights releases live or die on tooling. Gemma 4 QAT shipped with simultaneous support across llama.cpp, Ollama, LM Studio, vLLM, MLX, LiteRT-LM, and Transformers.js, covering CPU laptops, Apple Silicon, NVIDIA consumer GPUs, server-class batched inference, mobile NPUs, and the browser in a single coordinated drop ^[2]. The two-format strategy is deliberate: GGUF for the desktop and local crowd, w4a16 compressed-tensors for vLLM-backed serving stacks ^[6]. The mobile QAT schema and LiteRT-LM together signal Google's intent to make Gemma the default for on-device assistants, where private, offline, low-latency inference is a structural advantage that cloud APIs cannot match ^[4]. The 12B Q4_0 GGUF card alone pulled 4,674 downloads in the prior month ^[6], and the social response was instant and coordinated: official Google channels and tooling partners like Unsloth all shipped same-day announcements, with the dominant high-engagement angle being '3x less memory, 26B-A4B on 16GB RAM.' Developer YouTube and the r/LocalLLaMA community immediately shifted to hands-on benchmarking — early reports of doubled throughput on a single consumer-grade GPU using QAT plus Multi-Token Prediction read as proof that the headline numbers translate into real-world performance, not just file size. Hardware-focused reviewers ran 12B QAT through structured coding and multimodal evaluations on entry-level setups like a 16GB Mac Mini under LM Studio, framing it as the moment local AI becomes honest on consumer hardware.

Historical Context

2025-04-19

Google first introduced QAT checkpoints for Gemma 3 (1B, 4B, 12B, 27B), shrinking the 27B from 54GB BF16 to 14.1GB int4 so it could fit on a single RTX 3090 with 24GB of VRAM.

2025-04-19

The Gemma 3 QAT models were finetuned for roughly 5,000 steps and cut the llama.cpp perplexity drop from Q4_0 quantization by 54% versus naive post-training quantization, establishing the recipe Gemma 4 now extends.

2026-06-05

Google DeepMind released Gemma 4 QAT checkpoints, extending the approach to the new generation with Q4_0 plus a novel mobile-specialized format that pushes E2B under 1GB and targets phone-class deployment via LiteRT-LM.

Power Map

Key Players

Subject

Gemma 4 QAT checkpoints for on-device deployment

Google DeepMind

Released the Gemma 4 QAT checkpoints with both the standard Q4_0 format and a first-of-its-kind mobile quantization schema designed to fit capable models inside consumer RAM and VRAM budgets.

Hugging Face

Hosts the official google/* QAT checkpoints in both GGUF and compressed-tensor (w4a16-ct) formats across every Gemma 4 size, including the 12B-it-qat-q4_0-gguf checkpoint that has already pulled thousands of downloads.

llama.cpp, Ollama, and LM Studio

Provide the desktop runtime path for the GGUF Q4_0 checkpoints, enabling local CPU and GPU inference on laptops and consumer desktops out of the box.

vLLM

Server-class inference runtime that consumes the compressed-tensors w4a16-ct variants, putting QAT-quantized Gemma 4 into production-grade batched serving stacks.

LiteRT-LM and Transformers.js

Google's lightweight runtime targets the new mobile QAT schema for optimized edge deployment, while Transformers.js lets the same checkpoints run directly inside the browser.

Unsloth

Independent open-source team shipping dynamic GGUF variants (UD-Q4_K_XL) that recover most of the quality lost by naive Q4_0 conversion of the QAT checkpoints, plus measurement work on the full family.

Fact Check

8 cited

Source Articles

Top 5

THE SIGNAL.

Analysts

"Position QAT as a quality-preserving alternative to standard post-training quantization, arguing the same compressed size delivers higher overall quality because the model is trained to be robust to rounding errors rather than retrofitted afterwards."

Google AI team

Model authors, Google DeepMind

"Frames the release as the end of the 'too big to run locally' era, arguing that capable open-weights models are now practical on consumer hardware and that the mobile format with 2-bit decoding and optimized KV caches is the structural shift, not just the size number."

lymy1205 (DEV community)

Independent developer analysis on DEV

"Caution that naive Q4_0 conversion from Google's QAT checkpoint still drops accuracy meaningfully; their dynamic GGUF method recovers 26B-A4B top-1 from 70.2% to 85.6% (+15.4%) and 31B from 87.9% to 96.7% (+8.8%), with Q4 QAT beating traditional Q8 in KLD (0.014 vs 0.159)."

Unsloth team

Open-source quantization tooling maintainers

The Crowd

"Google releases Gemma 4 QAT. You can now run Gemma 4 at 3x less memory with near original performance. Quantization-Aware Training (QAT) makes it possible to run Gemma 4 26B-A4B on 16GB RAM. GGUFs: https://t.co/wQgEocxUId QAT Guide: https://t.co/Nsm1yeGEHx"

@@UnslothAI2689

"Gemma 4 quantization-aware training (QAT) models are now available, bringing AI performance directly to edge devices and consumer GPUs. These checkpoints are optimized with quantization-aware training to dramatically reduce memory requirements and unlock high-speed local"

@@googledevs1094

"Introducing Gemma 4 QAT - Quantization aware training to reduce models' precision while preserving quality - Introducing a new mobile quantization format that reduces memory footprint of E2B to 1GB - Q4 for all your favorite libraries"

@@osanseviero841

"Gemma 4 with quantization-aware training"

@u/rerri748

Broadcast

Gemma4 12B in Quantization-Aware Training (QAT) with Ollama - Full Testing

Gemma 4 12B Is INSANE - Is THIS the BEST Local Coding Model Yet?

Gemma 4 12B on a 16GB Mac Mini Is Surprisingly Capable