What QAT actually does that PTQ cannot
Post-training quantization takes a trained float model and rounds its weights down to 4 bits after the fact. The model never had a chance to compensate for the rounding error, so accuracy degrades, sometimes sharply on reasoning-heavy tasks. Quantization-Aware Training inverts the order: compression is baked into the training process, with quantization simulated during the forward pass so gradients teach the model to find weight configurations that are inherently robust to rounding [3]. The result is a checkpoint that already 'knows' it will live in 4-bit space, which is why Google argues QAT yields higher overall quality than standard PTQ baselines at the same compressed size [2]. The Gemma 3 generation already proved the recipe out: about 5,000 finetuning steps cut the Q4_0 perplexity regression by 54% versus naive PTQ [7]. Gemma 4 extends that same training-time treatment across the full family from E2B through 31B, with both Q4_0 GGUF and compressed-tensor formats shipped on Hugging Face on day one [1].



