DeepMind's DiffusionGemma speeds text generation 4x

D

DeepMind

June 12, 2026

◷ 3 MIN

Original source

deepmind.google — read the full announcement →

DiffusionGemma: A new text generation paradigm from DeepMind

DeepMind just dropped DiffusionGemma, a family of text-to-text models that ditch the standard autoregressive approach for something faster. Instead of predicting one token at a time, these models generate entire sequences in parallel using a diffusion process — the same kind of iterative denoising that powers image generators like Stable Diffusion. The headline number: 4x faster text generation on standard hardware. That's not a theoretical speedup; it's measured on a single TPU v4. The models come in two sizes: a 2B parameter variant and a 9B parameter variant, both built on the Gemma architecture. DeepMind claims they match or exceed the quality of similarly sized autoregressive models on benchmarks like MMLU and HellaSwag, while being significantly faster at inference time.

Why diffusion for text? The autoregressive bottleneck

For years, text generation has been dominated by autoregressive models — think GPT-4, Llama 3, or Gemma itself. They work by predicting the next token given all previous ones, which is inherently sequential. That's fine for short outputs, but for long-form generation — say, a 2,000-word article — you're stuck waiting for each token to be computed one by one. Diffusion models, by contrast, start with random noise and iteratively refine it into a coherent sequence. This parallelization is what gives DiffusionGemma its speed advantage. The trade-off has always been quality: early diffusion text models produced garbled outputs. DeepMind's contribution here is a training recipe that preserves quality while unlocking parallelism. They use a continuous-time diffusion process with a learned noise schedule, plus a novel conditioning mechanism that keeps the model grounded in the input prompt.

What 4x faster actually means for developers and users

Let's be concrete. If you're running a chatbot that generates 500-token responses, an autoregressive model might take 2 seconds on a decent GPU. DiffusionGemma cuts that to 0.5 seconds. For real-time applications — live translation, code completion, interactive storytelling — that's the difference between feeling snappy and feeling sluggish. The 2B model is small enough to run on a phone or edge device, which opens up on-device AI use cases that were previously impractical. But here's the catch: diffusion models are still slower at training time, and they require more memory during inference because they process the entire sequence at once. DeepMind hasn't released latency numbers for the 9B model on consumer hardware, so the 4x claim might not hold on a laptop GPU. Still, for cloud deployments where you're paying per token, a 4x speedup is a direct cost reduction.

Open questions: quality, controllability, and real-world deployment

DeepMind's benchmarks show DiffusionGemma matching autoregressive models on standard tests, but benchmarks aren't the real world. How does it handle long-range coherence? Can it maintain a consistent persona over a multi-turn conversation? The paper doesn't address these. Also, diffusion models are notoriously hard to control — you can't easily steer the output with a temperature parameter or top-k sampling the way you can with autoregressive models. DeepMind mentions a 'guidance scale' but doesn't detail how it works. Then there's the question of adoption: the entire AI infrastructure — from Hugging Face to vLLM to TensorRT — is built around autoregressive inference. Getting DiffusionGemma into production will require new tooling. DeepMind has open-sourced the model weights and inference code, but not the training pipeline. That's a smart move for research, but it leaves a gap for anyone wanting to fine-tune or adapt the model.

Watch video

Click to play

Frequently Asked Questions

How does DiffusionGemma achieve 4x faster text generation?▾

It uses a diffusion process that generates the entire output sequence in parallel, rather than predicting one token at a time like traditional autoregressive models. This parallelization is what gives the speedup, though it requires more memory during inference.

What sizes does DiffusionGemma come in?▾

DeepMind released two variants: a 2 billion parameter model and a 9 billion parameter model. Both are built on the Gemma architecture and are designed to run on a single TPU v4.

Does DiffusionGemma match the quality of autoregressive models?▾

On standard benchmarks like MMLU and HellaSwag, it matches or exceeds similarly sized autoregressive models. However, real-world quality — especially for long-form or multi-turn tasks — hasn't been thoroughly tested yet.

Can I run DiffusionGemma on my laptop?▾

The 2B model is small enough to potentially run on a laptop or even a phone, but DeepMind hasn't published latency numbers for consumer hardware. The 4x speedup claim is based on a single TPU v4, so your mileage may vary.

Is DiffusionGemma open-source?▾

DeepMind has open-sourced the model weights and inference code, but not the full training pipeline. This means you can run the model, but fine-tuning or adapting it will require reverse-engineering the training process.