TECH

Google DeepMind launches DiffusionGemma open model

34+

Signals

Strategic Overview

01.
Google DeepMind released DiffusionGemma on June 10, 2026, an experimental open model that uses text diffusion to generate text in parallel blocks rather than sequentially token-by-token, claiming up to 4x faster inference on dedicated GPUs.
02.
It is a 26B-parameter Mixture-of-Experts model built on the Gemma 4 (26B-A4B) backbone fused with Gemini Diffusion research, activating only 3.8B parameters per forward pass, and is multimodal with text, image, and video input producing text output.
03.
The weights ship under the permissive Apache 2.0 license on Hugging Face, optimized by NVIDIA across RTX and DGX hardware, with Google steering quality-critical production workloads to standard autoregressive Gemma 4.

Denoising a paragraph instead of typing it: how text diffusion actually works

Every mainstream large language model since GPT has written the same way a person types: one token at a time, left to right, each word conditioned on the words already committed. DiffusionGemma throws that out. Instead of predicting the next word, it drafts an entire 256-token paragraph simultaneously and then refines it ^[1]. The process borrows directly from image diffusion. The model starts with a block of 256 random placeholder tokens and refines them across several passes until readable text emerges ^[5], running up to roughly 48 denoising steps and resolving on the order of 15 to 20 tokens per forward pass ^[6].

The architectural unlock that makes this work is bi-directional attention. In an autoregressive model, a token can only see what came before it. In DiffusionGemma, each token can reference every other token during generation, including ones that come later ^[5]. That single property is the source of nearly every interesting behavior the model has, from self-correction to fill-in-the-middle editing. Under the hood it remains a Gemma 4 model: a 26B-parameter Mixture-of-Experts that activates roughly 3.8B parameters, firing 8 of its 128 experts per pass, with a 256K context window and a 262K-token vocabulary trained across 140-plus languages ^[4]. The diffusion head is grafted onto that proven backbone rather than trained from scratch, which is why Google can ship it as an open Gemma sibling rather than a wholly new research artifact.

Why a consumer RTX card is the surprise winner

The headline numbers are genuinely large: up to 4x faster inference on dedicated GPUs ^[1], with DeepMind's own model page citing 4x to 5x ^[2]. In raw throughput that translates to more than 1,000 tokens per second on a single NVIDIA H100, with The Decoder citing around 1,100 ^[5], over 700 tokens per second on a GeForce RTX 5090 ^[6], and a striking spread on Nvidia's own boxes — about 150 tokens per second on a DGX Spark scaling to as much as 2,000 on a DGX Station ^[3].

The reason these gains land hardest on local hardware is a bottleneck swap. Autoregressive decoding is memory-bandwidth bound: the GPU spends most of its time shuttling weights in and out of memory to produce one token, leaving its arithmetic units idle. Generating a full 256-token block in parallel makes the workload compute-bound instead, shifting the bottleneck from memory bandwidth to raw compute ^[5]. Consumer cards like the RTX 5090 are exactly the hardware profile that benefits — abundant compute, more modest bandwidth — which is why this is being compared to dedicated inference accelerators in enthusiast circles. NVIDIA leaned in, adding native NVFP4 support on Blackwell and tuning the model across RTX and DGX lines ^[3]. Quantized, the model fits within 18GB of VRAM ^[1], putting fast local generation within reach of a single high-end gaming GPU. The caveat the architecture cannot escape: even with only 3.8B active parameters, all 26B weights must still be resident in VRAM, so the memory footprint is set by the full model, not the active slice.

Fill-in-the-middle for free: the use cases that actually fit

DiffusionGemma is not a general-purpose chatbot replacement, and its best applications follow directly from bi-directional attention. The standout is code infilling. Because the model can condition on text both before and after a gap, it can perform fill-in-the-middle completion with no special FIM tokens or scaffolding that autoregressive code models require — you hand it a function with a hole in the middle and it fills the hole while respecting both sides. The same property powers mid-paragraph insertion and even constraint-satisfaction puzzles like Sudoku, where the answer depends on cells in every direction at once ^[5].

Beyond editing, the model's ability to revise opens a second class of workloads. Bidirectional context enables real-time self-correction through re-noising, where the model can revisit and rewrite tokens it has already drafted — something autoregressive models, which commit each token permanently, simply cannot do ^[6]. That makes it a natural fit for generator-verifier agentic loops, classification and labeling, context compression, and rapid creative-writing prototyping, the use cases the local-model community has gravitated to first. Google frames the target squarely as latency-sensitive tooling: inline editing, code infilling, and rapid iteration where speed and bidirectional structure outweigh peak quality ^[1]. The strategic read is that Google is not trying to dethrone Gemma 4 here; it is opening a second lane optimized for interaction speed and non-linear text, and shipping it open under Apache 2.0 to let the community find the edges.

The contrarian read: lower quality, and it may not even feel faster

The most important asterisk comes from Google itself. DiffusionGemma prioritizes speed and parallel layout generation, and its overall output quality is lower than standard Gemma 4 ^[6]. The benchmarks bear out a jagged profile: respectable on MMLU Pro at 77.6% and GPQA Diamond at 73.2%, but a steep drop on hard reasoning, where AIME falls to 69.1% versus 88.3% for standard Gemma 4 ^[4]. A skeptical minority in the local-model community has seized on Google's own quality disclaimer to argue the intelligence gap is not worth the speed for many tasks.

There is also a subtler, experiential catch that raw tokens-per-second hides. With diffusion you wait for an entire block to finish denoising before you see anything, which means you lose the token-by-token streaming that makes autoregressive chat feel responsive — so a model that is faster on paper may not feel faster to a human watching a cursor. And the economics narrow the audience further: in multi-user cloud serving, autoregressive models retain their hardware-efficiency advantage, which can make DiffusionGemma more expensive to operate at scale and confines its sweet spot to local, single-user inference ^[6]. Taken together, the honest framing is the one Google gave it — an experimental tool with a specific shape, not a drop-in upgrade. For maximum quality, deploy standard Gemma 4 ^[1].

Historical Context

2025-05

Gemini Diffusion, an earlier experimental diffusion text demo, reached roughly 1,479 tokens per second and forms the research basis for DiffusionGemma's diffusion head.

2026-01

Mercury 2, a competing diffusion-based reasoning model, arrived as a contemporaneous entrant in the diffusion-LLM space.

2026-06-10

DiffusionGemma released as the first open-weights diffusion entry in the Gemma family, built on the Gemma 4 26B-A4B architecture.

Power Map

Key Players

Subject

Google DeepMind launches DiffusionGemma open model

Google DeepMind

Developer and releaser of DiffusionGemma; ships weights under Apache 2.0 and frames it as an experimental open model for speed-critical local workflows, while steering production users to Gemma 4.

NVIDIA

Hardware partner; optimized DiffusionGemma across GeForce RTX, the RTX PRO platform, DGX Spark and DGX Station, with native NVFP4 support on Blackwell GPUs and CUDA/Tensor Core acceleration for the compute-bound parallel block workload.

Hugging Face

Distribution platform hosting the official google/diffusiongemma-26B-A4B-it weights alongside community GGUF quantizations such as Unsloth's.

Fact Check

6 cited

Source Articles

Top 5

THE SIGNAL.

Analysts

"DiffusionGemma deliberately trades output quality for speed and parallel layout generation. It is built for interactive, single-user local workflows rather than high-quality production serving, and DeepMind explicitly recommends standard Gemma 4 when output quality matters most."

Google DeepMind (official positioning)

Model developer

"The model's real edge is non-linear tasks such as code infilling, mid-paragraph insertion, and constraint-satisfaction problems, because bi-directional attention lets each token reference every other token, including ones that come later. The speed gain comes from shifting the GPU bottleneck from memory bandwidth to raw compute."

The Decoder

Tech publication analysis

"Bidirectional context enables real-time self-correction through re-noising, something autoregressive models that commit tokens once cannot do. However, autoregressive models retain a hardware-efficiency advantage in multi-user cloud serving, so DiffusionGemma's economics favor local inference."

MarkTechPost

Tech publication analysis

The Crowd

"DiffusionGemma is our new experimental open model with up to 4x faster output on dedicated GPUs. Instead of predicting word-by-word, it generates entire blocks of text simultaneously. This lets the model self-correct and format complex markdown in real time."

@@GoogleDeepMind1768

"Google releases DiffusionGemma.✨ The new 26B-A4B diffusion text model runs locally on 18GB RAM. It supports high-speed text generation, thinking, image, video and 256K context. Run and train via Unsloth Studio. GGUF: https://t.co/ZH0dCJQ59P Guide: https://t.co/wYLfJWE6kG"

@@UnslothAI1182

"DiffusionGemma: 4x faster text generation"

@u/tevlon683

"Google releases new DiffusionGemma model."

@u/yoracale393

Broadcast

DiffusionGemma: 1100 Tokens/sec: Google's Fastest Open Model Yet Locally

DiffusionGemma: the LLM that writes text ALL AT ONCE — live on an RTX 4090

AI News: Google's New Gemma Model Is 4X Faster