Learn about Google DiffusionGemma
DiffusionGemma is Google's open-source model for text generation, testing a very different idea.
Most local LLMs today are fairly predictable. You download a model, point a program to it, ask a question, and then watch the text move across the screen one token at a time. The model might be better or worse than the one you used yesterday, but the basic experience is usually the same.
DiffusionGemma is different, at least when you run it in visual mode. Google's new experimental Gemma model doesn't just type answers from left to right. Instead, it processes one block of text at a time, gradually replacing and refining tokens until the answer is complete. The effect is similar to watching an image generator denoise an image; it's a process called "diffusion." It's a very different experience from typical LLMs that generate tokens one by one.
One person tested it on a MacBook Pro M4 Pro using 4-bit GGUF via a custom llama.cpp fork detailed by Unsloth. It wasn't faster than running Google's standard Gemma 4 26B-A4B model, and it also made the Mac run slower than with regular LLMs. However, it was a strange, but also incredibly interesting, experience due to its difference from typical autoregressive language models.
What is DiffusionGemma?
DiffusionGemma is Google's experimental open-source model for text generation, experimenting with a very different idea: Instead of writing each token individually like most language models, it composes and refines entire blocks of text in parallel, which Google claims can make text generation up to four times faster on GPUs. It's an open-source Apache 2.0 weighted model based on the Gemma 4 family , built as a 26 billion Mixture-of-Experts model with approximately 4 billion parameters operating in the inference process. It can process text, images, and video as input while producing text output.
DiffusionGemma changes the way text is generated.
DiffusionGemma feels strange because its output doesn't resemble regular text. When visual mode is enabled, you can see a 256-token canvas being rewritten as the model works, with text that looks like placeholders appearing before parts of it change and the answer gradually becoming more coherent. It's not just a line of words appearing at the end of the previous word, and that alone makes it seem like another kind of local model.
You don't need to observe the text generation process for the model to be useful, and many local LLM interfaces are better precisely because they hide the complexities. But in this case, the visual representation did a good job of explaining what makes DiffusionGemma different. You can read about how much text spreads, but seeing the text constantly changing in place makes the concept much easier to understand.
A typical autoregressive model has to commit the next token, then the token after that, and the one after that. It can plan loosely, and good models obviously do, but the token it writes now can't directly base on the exact token it will write 50 tokens later because that token doesn't yet exist. Instead, DiffusionGemma operates on a block, with two-way attention within that canvas. It can use later parts of the block to refine earlier parts, which is why the output might look like it's being focused rather than typed.
That's also why DiffusionGemma feels different from the localized modeling that people are usually used to.
Google's claims about speed need specific context.
DiffusionGemma's strength lies in its speed. In its launch post, Google stated that the model can generate text up to four times faster on dedicated GPUs, with over 1,000 tokens per second on an Nvidia H100 card and over 700 tokens per second on an RTX 5090 card. They also stated that the quantization model can fit within 18GB of VRAM on high-end consumer GPUs.
The results on the M4 Pro were different. While not receiving the usual token read-per-second results, the captured footer reported a total of 137.9 seconds, 123 denoising steps, and 9 blocks, equivalent to 1.121 seconds per step. Since each block is a 256-token canvas, that also translates to 2,304 canvas positions across 123 steps, or approximately 18.7 token positions per denoising step.
Hardware is also crucial. Macs experience system-wide slowdowns while running DiffusionGemma, and it doesn't feel any faster than running Google's standard Gemma 4 26B-A4B model on a local computer. Google warns that Macs using Apple Silicon chips may not achieve similar speed increases because unified memory systems are typically limited by memory bandwidth during inference, while DiffusionGemma's speed boost relies on assigning the dedicated accelerator a heavier computational workload.
That doesn't mean the speed claim is wrong; it just means the interesting part of the run isn't the raw throughput. The interesting part is seeing the model use a distinctly differentiating process, and seeing how that changed the feel of interacting with the local linear model.
Local execution is still in its early stages and somewhat challenging.
The method used to set this up and run is Unsloth GGUF , which depends on the DiffusionGemma branch from an open llama.cpp pull request. Unsloth's instructions build a dedicated llama-diffusion-cli runner, because the standard llama-cli or llama-server path cannot yet be generated from the model.
That difference is crucial if you're used to easily using Ollama or llama.cpp as your local LLM default. This isn't the kind of model you can easily pull into your existing setup and treat as just another GGUF. It needs the right branch, the right runner, and the --diffusion-visual flag if you want the part that makes it more visual. The command to run it with visual output, after compilation, is:
./llama-diffusion-cli -m ./diffusiongemma-26B-A4B-it-Q4_K_M.gguf -ngl 99 -cnv -n 4096 --diffusion-visual
Quantized files are at least feasible for consumer hardware. Unsloth lists the 16GB Q4KM file as the smallest option, with larger 18GB, 21GB, 25GB, and 47GB variants above. That puts the model in the same general range as other large local models that you can run on GPUs with a decent amount of VRAM.
However, this is still an experimental setup. The key now is the model's user support capabilities, operational efficiency, and image quality, rather than the imperfections surrounding a typical, boring model. If you've heard of distributed models and want to experience them for yourself, that's the appeal.
DiffusionGemma is not a direct upgrade over Gemma 4.
The name DiffusionGemma sounds like another member of the Gemma family, and rightly so, but this model has a very different goal. Google describes it as an experimental open model based on the Gemma 4 26B A4B Mixture of Experts architecture, with a total of approximately 26 billion parameters and about 4 billion operational parameters. The difference is that it's a diffusion-based and block-based model, rather than the fundamental idea of a local MoE model.
Google makes it very clear that standard Gemma 4 autoregressive models remain the recommended approach for achieving maximum output quality. DiffusionGemma prioritizes speed and parallel layout creation, and published benchmarks often show it lagging behind the standard Gemma 4 26B A4B model in long-term reasoning, programming, visual, and contextual tests.
The specific test that was performed at least worked. The user asked it to create a Flappy Bird-style game in Python, displayed in a browser and run using Flask, and the generated project worked during testing. The gravitational pull was so strong that it wasn't very smooth to play, but it generated the necessary Flask application, HTML, CSS, and JavaScript to have a working browser game. You can see the full output at this Gist link .
DiffusionGemma is still in its experimental, early stages and not comparable to a typical local LLM. Observing the response denoising process is rather strange and somewhat distracting, but it's really helpful in understanding what Google is trying to achieve, and it makes it easier than ever for people to understand what the diffusion model actually looks like in practice.
- 10 creative ways to use Google Keep every day
- Use Google Chrome to learn foreign languages when browsing the web
- Sort and filter data
- Learn about the new Google Sheets
- Google Translate is about to have a 'language learning' feature to compete with Duolingo?
- 9 extremely interesting secrets about Google you may not know yet
- Instructions for checking Google DNS settings after changes
- Free online learning about AI and Machine learning on Google website