Google DeepMind releases DiffusionGemma text-diffusion model
TECH

Google DeepMind releases DiffusionGemma text-diffusion model

29+
Signals

Strategic Overview

  • 01.
    Google DeepMind released DiffusionGemma, an experimental open-weights model built on the 26B A4B Mixture-of-Experts Gemma 4 architecture that generates tokens using discrete text diffusion rather than autoregression.
  • 02.
    Instead of predicting one token at a time, it denoises a fixed 256-token canvas in parallel, activates only 3.8B of its 26B parameters during inference, and fits in roughly 18GB of VRAM when quantized.
  • 03.
    It reaches 700+ tokens/sec on an RTX 5090 and over 1,000 tokens/sec on an H100, up to 4x faster than autoregressive models, and ships under Apache 2.0 on Hugging Face with day-zero support in vLLM, Transformers, MLX, and Unsloth.
  • 04.
    Output quality is lower than standard Gemma 4 on general benchmarks, and Google recommends standard Gemma 4 for maximum-quality production use.

How it works: denoising a 256-token canvas instead of guessing one token at a time

DiffusionGemma abandons the autoregressive loop that every mainstream LLM uses. Instead of predicting the next token conditioned on everything before it, the model starts each block with a canvas of random placeholder tokens and iteratively locks in confident tokens until the whole block snaps into focus, 256 tokens per forward pass [1]. It denoises up to 256 tokens per step rather than emitting one at a time [2], locking roughly 15-20 tokens per forward pass and refining the rest across iterations [3].

The architecture is a hybrid: diffusion within each block, autoregressive across blocks. The decisive property is bidirectional attention. Because the model sees the whole canvas at once, it generates entire paragraphs rather than individual, next-token guesses, ensuring global logical consistency [3], and it can self-correct, revising tokens it placed earlier in the same block. That same global view is what makes it natively suited to non-linear tasks like in-line editing and code infilling, where an autoregressive model would have to regenerate from the edit point forward. The model card frames the shift as moving from token-by-token autoregression to block-autoregressive multi-canvas sampling [4].

Why it's fast: flipping the local-inference bottleneck from memory bandwidth to compute

Why it's fast: flipping the local-inference bottleneck from memory bandwidth to compute
DiffusionGemma generation throughput climbs from 700 tokens/sec on a consumer RTX 5090 to 1,288 on a datacenter H200.

The speedup is not free lunch from a better algorithm, it comes from moving work to where consumer hardware has spare capacity. On local GPUs the main bottleneck for autoregressive models is memory bandwidth: each token requires streaming the full weight set, and the GPU's compute units sit largely idle waiting on memory [5]. DiffusionGemma's parallel block decoding inverts this. Pulling a full 256-token block through the transformer in parallel is a compute-bound workload, exactly what NVIDIA GPUs are built for [2], so the model trades memory-bandwidth pressure for compute it can actually use [6].

The numbers bear this out: 700+ tokens/sec on an RTX 5090 [5], over 1,000 tok/s on a single H100, roughly 5x the autoregressive baseline [2], and 1,288 generation tok/s on an H200, around 6x autoregressive and 3x multi-token prediction [6]. Combined with the MoE design, 26B total parameters but only 3.8B active [1], the quantized model fits in about 18GB of VRAM, landing it on a single RTX 5090 or 4090 [5]. Google's bet, made explicit at launch, is that this can upend the cost economics of local AI: fast, low-VRAM, no cloud dependency. The local-inference community agreed loudly, with Unsloth, llama.cpp, and SGLang shipping same-day support and comparing the throughput to Groq- and Cerebras-class hardware.

The contrarian read: 4x faster, but where does it break?

The dominant tension across the technical community is speed versus accuracy, and the skeptics have receipts. A widely shared factual-recall benchmark pitted standard Gemma 4 against DiffusionGemma, and the diffusion model came out roughly 4x faster but several times more error-prone on grounded facts, fabricating names and details that an autoregressive model got right. That tracks with the official guidance: DiffusionGemma scores below standard Gemma 4 on general benchmarks including MMLU and coding evals, and Google explicitly recommends standard Gemma 4 for maximum-quality production use [5].

The speed advantage is also conditional. Parallel decoding wins specifically where the GPU has spare compute and memory bandwidth is the bottleneck: single-user, low-concurrency, local workloads. Under high-concurrency server batching, autoregressive models already saturate compute, so the diffusion gains diminish [3]. There are ecosystem gaps too: the model needs a specialized drafter module that is not yet in some mainstream runtimes like mlx-lm and LM Studio, and dLLMs needed a custom serving path outside the standard autoregressive stack [7]. The community's emerging consensus reframes the model accordingly, not as a drop-in Gemma 4 replacement, but as a fast explorer or sub-agent for constraint-heavy, verifiable tasks like code infilling, where bidirectional attention shines and a downstream check can catch errors.

Historical Context

2021
D3PM formulated a multinomial noise schedule for discrete tokens, an early foundation for discrete text diffusion.
2024
LLaDA proved scaling viability of text diffusion with an 8B-parameter model competitive with LLaMA 3-8B.
2025-02
Inception announced Mercury Coder, a high-speed commercial diffusion language model.
2025-05-20
Google DeepMind unveiled Gemini Diffusion, an experimental text diffusion model, at Google I/O 2025, the research lineage DiffusionGemma derives from.
2026-06-10
Google released DiffusionGemma, bringing text diffusion to the open-weights Gemma family under Apache 2.0.

Power Map

Key Players
Subject

Google DeepMind releases DiffusionGemma text-diffusion model

GO

Google DeepMind

Developer and releaser of DiffusionGemma; open-sourced it under Apache 2.0, extending its earlier Gemini Diffusion research into the Gemma open-weights family.

NV

NVIDIA

Hardware and software partner; provides NVFP4 quantization and NeMo support, and accelerates the compute-bound parallel decoding on RTX and H100 hardware, framing local diffusion inference as ideal for its GPUs.

VL

vLLM project

Serving framework that made DiffusionGemma the first diffusion LLM natively supported in vLLM, building a custom path for bidirectional attention and block-based generation.

HU

Hugging Face

Primary open distribution platform; hosts the official google/diffusiongemma-26B-A4B-it weights and NVIDIA's NVFP4 variant.

Fact Check

8 cited
  1. [1] DiffusionGemma
  2. [2] DiffusionGemma Runs Locally on NVIDIA RTX GPUs
  3. [3] Google's DiffusionGemma generates 256 tokens in parallel and self-corrects as it goes
  4. [4] DiffusionGemma model card
  5. [5] Google AI Releases DiffusionGemma, a 26B MoE Open Model Using Text Diffusion for Up to 4x Faster Generation
  6. [6] Day-Zero Support for DiffusionGemma in vLLM
  7. [7] Google's New Open Model Generates Text Like an Image
  8. [8] google/diffusiongemma-26B-A4B-it

Source Articles

Top 5

THE SIGNAL.

Analysts

"Diffusion LLMs require a fundamentally different serving path than autoregressive models, trading memory-bandwidth pressure for additional compute via parallel token refinement."

vLLM team
vLLM project, official blog

"Parallel block decoding is a compute-bound workload that suits GPU strengths: pulling a full 256-token block through the transformer in parallel is exactly what NVIDIA GPUs are built for."

NVIDIA (RTX AI Garage)
NVIDIA corporate blog

"Generating full paragraphs in parallel with bidirectional attention gives global logical consistency and enables non-linear tasks like in-line editing and code infilling."

VentureBeat
VentureBeat, technology coverage

"Bidirectional attention makes DiffusionGemma especially strong on constraint-heavy tasks like code infilling, but it needs a specialized drafter module not yet in mainstream runtimes like LM Studio."

Decrypt
Decrypt, technology coverage
The Crowd

"DiffusionGemma is our new experimental open model with up to 4x faster output on dedicated GPUs. Instead of predicting word-by-word, it generates entire blocks of text simultaneously. This lets the model self-correct and format complex markdown in real time."

@@GoogleDeepMind2344

"Google releases DiffusionGemma.✨ The new 26B-A4B diffusion text model runs locally on 18GB RAM. It supports high-speed text generation, thinking, image, video and 256K context. Run and train via Unsloth Studio. GGUF: https://t.co/ZH0dCJQ59P Guide: https://t.co/wYLfJWE6kG"

@@UnslothAI1819

"We made DiffusionGemma run via llama.cpp locally! It works well with Unsloth GGUFs and you can run it in realtime visualization mode or normal chat CLI mode! See our docs https://t.co/IslbgeCs7Z on how to set it up!"

@@danielhanchen396

"DiffusionGemma: 4x faster text generation"

@u/tevlon964
Broadcast
Diffusion Gemma: Google's First Open Diffusion Model

Diffusion Gemma: Google's First Open Diffusion Model

DiffusionGemma First Look. The Pros and Cons - 16GB Local LLM setup

DiffusionGemma First Look. The Pros and Cons - 16GB Local LLM setup

DiffusionGemma: 1100 Tokens/sec: Google's Fastest Open Model Yet Locally

DiffusionGemma: 1100 Tokens/sec: Google's Fastest Open Model Yet Locally