
DiffusionGemma model release
Strategic Overview
- 01.On June 10, 2026, Google DeepMind released DiffusionGemma, an experimental open model in the Gemma 4 family.
- 02.The 26-billion-parameter Mixture-of-Experts model activates approximately 3.8 billion parameters per step and generates entire blocks of text in parallel using text diffusion rather than sequential token prediction.
- 03.DiffusionGemma is distributed under the Apache 2.0 license with weights available on Hugging Face at google/diffusiongemma-26B-A4B-it, optimized in collaboration with NVIDIA for local GPU inference.
Root Analysis
# Demand for faster local inference
Developers seek models that run efficiently on consumer and enterprise GPUs without cloud latency for interactive and real-time applications.
# Exploration of diffusion techniques beyond images
Google DeepMind builds on prior Gemini Diffusion research to apply parallel block generation to language models for potential efficiency gains.
Systemic Impact
Enables new real-time local workflows
Up to 4x faster inference on dedicated GPUs could support speed-critical applications such as live interactive AI tools running on RTX GPUs or H100 systems.
Model remains experimental
As an early exploration of diffusion-based text generation, adoption may be limited until further refinements address quality or compatibility compared with established autoregressive models.
Historical Context
The Lexicon
text diffusion
Text diffusion is a generative approach where the model starts with noise and iteratively refines an entire block of text simultaneously rather than predicting one token at a time. This mirrors how image diffusion models like Stable Diffusion create pictures from random noise. In DiffusionGemma it enables parallel generation of up to 256 tokens per forward pass and real-time self-correction within the block.
Power Map
Source Articles
Google released DiffusionGemma, an experimental 26B MoE model that generates text in blocks via text diffusion, achieving up to 4x faster inference.
Demis Hassabis 盛赞 DiffusionGemma:文本生成速度提升 4 倍
Google DeepMind releases DiffusionGemma, a model that runs local AI 4x faster
DiffusionGemma, a new experimental open model, is being introduced with faster output and simultaneous text generation capabilities.
DiffusionGemma, an experimental open model for fast text generation, has been released under an Apache 2.0 license.
THE SIGNAL.
"DiffusionGemma represents a fundamentally different approach from other Gemma models by generating text blocks in parallel instead of linearly, potentially improving efficiency on local hardware."
"The model's sparse MoE activation combined with diffusion allows it to fit in 18 GB VRAM when quantized, making high-performance local inference accessible on consumer cards."
"DiffusionGemma is our new experimental open model with up to 4x faster output on dedicated GPUs. Instead of predicting word-by-word, it generates entire blocks of text simultaneously. This lets the model self-correct and format complex markdown in real time."
"Congrats to @GoogleDeepMind on the launch of DiffusionGemma. The model generates 256 tokens in parallel per step, delivering 150+ TPS on DGX Spark, and 1,000+ TPS on a single H100. We're supporting it from day one with: • BF16 and NVFP4 checkpoints on @huggingface🤗 • Free GPU-accelerated endpoints on https://t.co/6T0R9P7EXS • @vllm_project support with FP8 precision Get started with DiffusionGemma on NVIDIA: https://t.co/vurk7GCQUs"
"Congrats to @GoogleDeepMind on the launch of DiffusionGemma. The model generates 256 tokens in parallel per step, delivering 150+ TPS on DGX Spark, and 1,000+ TPS on a single H100. We're supporting it from day one with: • BF16 and NVFP4 checkpoints on @huggingface🤗 • Free GPU-accelerated endpoints on https://t.co/6T0R9P7EXS • @vllm_project support with FP8 precision Get started with DiffusionGemma on NVIDIA: https://t.co/vurk7GCQUs"
"Want 4x faster local inference on dedicated GPUs for your interactive apps? DiffusionGemma is an experimental, open 26B MoE model that generates entire blocks of text simultaneously instead of token-by-token. By shifting the local decoding bottleneck from memory-bandwidth to compute, it hits speeds over 700 tokens/sec on a single NVIDIA RTX 5090 GPU. This diffusion unlocks unique local workflows like real-time inline editing, code infilling, and instant self-correction. 📥 Download the Apache 2.0 weights on @HuggingFace: https://t.co/L5eqih19T5 📖 Read the full technical announcement on the blog: https://t.co/mESsFJNEDc"

Gemini 3.5 Pro Leaked And Google Is Behind!

Diffusion Gemma: The First Diffusion Model that "Thinks"

DiffusionGemma - Google’s New Local AI Diffusion Model (Day-Zero Setup and Testing)