Learn about Google DiffusionGemma

DiffusionGemma is Googles open-source model for text generation, testing a very different idea.

Most local LLMs today are fairly predictable. You download a model, point a program to it, ask a question, and then watch the text move across the screen one token at a time. The model might be better or worse than the one you used yesterday, but the basic experience is usually the same.

DiffusionGemma is different, at least when you run it in visual mode. Google's new experimental Gemma model doesn't just type answers from left to right. Instead, it processes one block of text at a time, gradually replacing and refining tokens until the answer is complete. The effect is similar to watching an image generator denoise an image; it's a process called "diffusion." It's a very different experience from typical LLMs that generate tokens one by one.

One person tested it on a MacBook Pro M4 Pro using 4-bit GGUF via a custom llama.cpp fork detailed by Unsloth. It wasn't faster than running Google's standard Gemma 4 26B-A4B model, and it also made the Mac run slower than with regular LLMs. However, it was a strange, but also incredibly interesting, experience due to its difference from typical autoregressive language models.

What is DiffusionGemma?

DiffusionGemma is Google's experimental open-source model for text generation, experimenting with a very different idea: Instead of writing each token individually like most language models, it composes and refines entire blocks of text in parallel, which Google claims can make text generation up to four times faster on GPUs. It's an open-source Apache 2.0 weighted model based on the Gemma 4 family , built as a 26 billion Mixture-of-Experts model with approximately 4 billion parameters operating in the inference process. It can process text, images, and video as input while producing text output.

DiffusionGemma changes the way text is generated.

Images 1 of Learn about Google DiffusionGemma

DiffusionGemma feels strange because its output doesn't resemble regular text. When visual mode is enabled, you can see a 256-token canvas being rewritten as the model works, with text that looks like placeholders appearing before parts of it change and the answer gradually becoming more coherent. It's not just a line of words appearing at the end of the previous word, and that alone makes it seem like another kind of local model.

You don't need to observe the text generation process for the model to be useful, and many local LLM interfaces are better precisely because they hide the complexities. But in this case, the visual representation did a good job of explaining what makes DiffusionGemma different. You can read about how much text spreads, but seeing the text constantly changing in place makes the concept much easier to understand.

A typical autoregressive model has to commit the next token, then the token after that, and the one after that. It can plan loosely, and good models obviously do, but the token it writes now can't directly base on the exact token it will write 50 tokens later because that token doesn't yet exist. Instead, DiffusionGemma operates on a block, with two-way attention within that canvas. It can use later parts of the block to refine earlier parts, which is why the output might look like it's being focused rather than typed.

That's also why DiffusionGemma feels different from the localized modeling that people are usually used to.

Google's claims about speed need specific context.

Images 2 of Learn about Google DiffusionGemma

DiffusionGemma's strength lies in its speed. In its launch post, Google stated that the model can generate text up to four times faster on dedicated GPUs, with over 1,000 tokens per second on an Nvidia H100 card and over 700 tokens per second on an RTX 5090 card. They also stated that the quantization model can fit within 18GB of VRAM on high-end consumer GPUs.

The results on the M4 Pro were different. While not receiving the usual token read-per-second results, the captured footer reported a total of 137.9 seconds, 123 denoising steps, and 9 blocks, equivalent to 1.121 seconds per step. Since each block is a 256-token canvas, that also translates to 2,304 canvas positions across 123 steps, or approximately 18.7 token positions per denoising step.

Hardware is also crucial. Macs experience system-wide slowdowns while running DiffusionGemma, and it doesn't feel any faster than running Google's standard Gemma 4 26B-A4B model on a local computer. Google warns that Macs using Apple Silicon chips may not achieve similar speed increases because unified memory systems are typically limited by memory bandwidth during inference, while DiffusionGemma's speed boost relies on assigning the dedicated accelerator a heavier computational workload.

That doesn't mean the speed claim is wrong; it just means the interesting part of the run isn't the raw throughput. The interesting part is seeing the model use a distinctly differentiating process, and seeing how that changed the feel of interacting with the local linear model.

Local execution is still in its early stages and somewhat challenging.

The method used to set this up and run is Unsloth GGUF , which depends on the DiffusionGemma branch from an open llama.cpp pull request. Unsloth's instructions build a dedicated llama-diffusion-cli runner, because the standard llama-cli or llama-server path cannot yet be generated from the model.

That difference is crucial if you're used to easily using Ollama or llama.cpp as your local LLM default. This isn't the kind of model you can easily pull into your existing setup and treat as just another GGUF. It needs the right branch, the right runner, and the --diffusion-visual flag if you want the part that makes it more visual. The command to run it with visual output, after compilation, is:

./llama-diffusion-cli -m ./diffusiongemma-26B-A4B-it-Q4_K_M.gguf -ngl 99 -cnv -n 4096 --diffusion-visual

Quantized files are at least feasible for consumer hardware. Unsloth lists the 16GB Q4KM file as the smallest option, with larger 18GB, 21GB, 25GB, and 47GB variants above. That puts the model in the same general range as other large local models that you can run on GPUs with a decent amount of VRAM.

However, this is still an experimental setup. The key now is the model's user support capabilities, operational efficiency, and image quality, rather than the imperfections surrounding a typical, boring model. If you've heard of distributed models and want to experience them for yourself, that's the appeal.

DiffusionGemma is not a direct upgrade over Gemma 4.

The name DiffusionGemma sounds like another member of the Gemma family, and rightly so, but this model has a very different goal. Google describes it as an experimental open model based on the Gemma 4 26B A4B Mixture of Experts architecture, with a total of approximately 26 billion parameters and about 4 billion operational parameters. The difference is that it's a diffusion-based and block-based model, rather than the fundamental idea of ​​a local MoE model.

Google makes it very clear that standard Gemma 4 autoregressive models remain the recommended approach for achieving maximum output quality. DiffusionGemma prioritizes speed and parallel layout creation, and published benchmarks often show it lagging behind the standard Gemma 4 26B A4B model in long-term reasoning, programming, visual, and contextual tests.

The specific test that was performed at least worked. The user asked it to create a Flappy Bird-style game in Python, displayed in a browser and run using Flask, and the generated project worked during testing. The gravitational pull was so strong that it wasn't very smooth to play, but it generated the necessary Flask application, HTML, CSS, and JavaScript to have a working browser game. You can see the full output at this Gist link .

DiffusionGemma is still in its experimental, early stages and not comparable to a typical local LLM. Observing the response denoising process is rather strange and somewhat distracting, but it's really helpful in understanding what Google is trying to achieve, and it makes it easier than ever for people to understand what the diffusion model actually looks like in practice.

Close
Category

System

Windows XP

Windows Server 2012

Windows 8

Windows 7

Windows 10

Wifi tips

Virus Removal - Spyware

Speed ​​up the computer

Server

Security solution

Mail Server

LAN - WAN

Ghost - Install Win

Fix computer error

Configure Router Switch

Computer wallpaper

Computer security

Mac OS X

Mac OS System software

Mac OS Security

Mac OS Office application

Mac OS Email Management

Mac OS Data - File

Mac hardware

Hardware

USB - Flash Drive

Speaker headset

Printer

PC hardware

Network equipment

Laptop hardware

Computer components

Advice Computer

Game

PC game

Online game

Mobile Game

Pokemon GO

information

Technology story

Technology comments

Quiz technology

New technology

British talent technology

Attack the network

Artificial intelligence

Technology

Smart watches

Raspberry Pi

Linux

Camera

Basic knowledge

Banking services

SEO tips

Science

Strange story

Space Science

Scientific invention

Science Story

Science photo

Science and technology

Medicine

Health Care

Fun science

Environment

Discover science

Discover nature

Archeology

Life

Travel Experience

Tips

Raise up child

Make up

Life skills

Home Care

Entertainment

DIY Handmade

Cuisine

Christmas

Application

Web Email

Website - Blog

Web browser

Support Download - Upload

Software conversion

Social Network

Simulator software

Online payment

Office information

Music Software

Map and Positioning

Installation - Uninstall

Graphic design

Free - Discount

Email reader

Edit video

Edit photo

Compress and Decompress

Chat, Text, Call

Archive - Share

Electric

Water heater

Washing machine

Television

Machine tool

Fridge

Fans

Air conditioning

Program

Unix and Linux

SQL Server

SQL

Python

Programming C

PHP

NodeJS

MongoDB

jQuery

JavaScript

HTTP

HTML

Git

Database

Data structure and algorithm

CSS and CSS3

C ++

C #

AngularJS

Mobile

Wallpapers and Ringtones

Tricks application

Take and process photos

Storage - Sync

Security and Virus Removal

Personalized

Online Social Network

Map

Manage and edit Video

Data

Chat - Call - Text

Browser and Add-on

Basic setup