Google’s latest DiffusionGemma open AI model comes with a 4x speed boost

CatNamedHugs · 2026-06-10T15:39:54-0400

Maybe with this new release they'll be able to afford to pay all the people whose content they stole to create their models! I'm sure that's what they're aiming to do, right?

Fred Duck · 2026-06-10T15:40:21-0400

Ryan Whitwam said:
Instead, it can produce an entire block of text in parallel.

What? A paragraph is meant to be several sentences all related to each other. Many paragraphs are persuasive where the sentences build on each other to further a point.

I don't see how one can generate text in parallel! To my mind that's akin to building a wall in parallel. ELIAD.

metavirus · 2026-06-10T15:47:08-0400

I may just be a simple country lawyer, but this sure does seem like a method much more prone to introducing hallucinations resulting from “errors”.

TehRoot · 2026-06-10T15:48:06-0400

Fred Duck said:
What? A paragraph is meant to be several sentences all related to each other. Many paragraphs are persuasive where the sentences build on each other to further a point.

I don't see how one can generate text in parallel! To my mind that's akin to building a wall in parallel. ELIAD.

Autoregressive models are bandwidth bound, diffusion models are compute bound.

randomcat · 2026-06-10T15:54:49-0400

Fred Duck said:
What? A paragraph is meant to be several sentences all related to each other. Many paragraphs are persuasive where the sentences build on each other to further a point.

I don't see how one can generate text in parallel! To my mind that's akin to building a wall in parallel. ELIAD.

This reinforces the fact that these models are not writing text the way you or I would, it's an abstract statistical process which results in output text that looks like it was composed by a rational agent.

A million monkeys working at a million typewriters, etc. You might be able to find a good and useful result, but if you do, it's not because the underlying process is sound.

sporkinum · 2026-06-10T15:55:00-0400

TehRoot said:
Autoregressive models are bandwidth bound, diffusion models are compute bound.

It's just different ways to arrive at the wrong answer.

rwhitwam · 2026-06-10T16:06:19-0400

Fred Duck said:
What? A paragraph is meant to be several sentences all related to each other. Many paragraphs are persuasive where the sentences build on each other to further a point.

I don't see how one can generate text in parallel! To my mind that's akin to building a wall in parallel. ELIAD.

A person writing a sentence is aware of where things are going. You're trying to express a thought, rather than thinking one word at a time. Text diffusion is similar in that way, but it's still a bunch of estimation. It's just estimating in blocks that can improve the model's awareness of the connection to future tokens.

nuurdin · 2026-06-10T16:10:08-0400

I have to admit, this is really kind of interesting, in sort of a Borges-esque way. My guess though is that there is a less than meets the eye here, philosophically speaking. The linearity inherent to making a coherent paragraph probably seeps in with the fine-tuning. It would be very interesting if there were certain forms of text (or certain languages, or script-types) that consistently showed less error with this approach. For example, would more linear token prediction produce fewer errors on an agglutinating language? One the other hand, would DiffusionGemma write a better sestina, since the form is what matters?
And then part of me is just like, whatever.

Rudde · 2026-06-10T16:11:46-0400

Fred Duck said:
What? A paragraph is meant to be several sentences all related to each other. Many paragraphs are persuasive where the sentences build on each other to further a point.

I don't see how one can generate text in parallel! To my mind that's akin to building a wall in parallel. ELIAD.

Imagine five people working in parallel. Each person is asked to modify one specific word. After they have all modified their word, they see the words that the others changed.
E.g., "How do you build a wall?"
First round: <You> <by> <an> <location> <material>
Second round: <Start> <find> <selecting> <good> <and>
Third round: <Start> <by> <a> <a> <location>
Fourth round: <Start> <by> <selecting> <good> <location>
Fifth round: <Start> <by> <selecting> <a> <location>

lewisje · 2026-06-10T16:30:32-0400

I was just about to try this out in LM Studio when I saw that diffusion models like this aren't supported yet.

randomcat · 2026-06-10T16:36:25-0400

lewisje said:
I was just about to try this out in LM Studio when I saw that diffusion models like this aren't supported yet.

I was going to try running it through koboldcpp/sillytavern but the safetensors are broken down into several separate files and although that is supposedly not an issue for a lot of people, I have never been able to make that work.

klausw · 2026-06-10T16:36:50-0400

In image diffusion models, a single badly predicted pixel doesn’t make the image useless, but language is discreet. An equivalent error in text can make a block of tokens meaningless and force you to start over to get a better output.

Was this word choice a discreet and subtle joke intended as an example of mistakes in discrete problem spaces?

Lexus Lunar Lorry · 2026-06-10T16:40:47-0400

Fred Duck said:
What? A paragraph is meant to be several sentences all related to each other. Many paragraphs are persuasive where the sentences build on each other to further a point.

I don't see how one can generate text in parallel! To my mind that's akin to building a wall in parallel. ELIAD.

What if you're one of the aliens from Arrival or Story of Your Life? Then your paragraphs arrived fully formed, not as sequences of words.

sporkinum · 2026-06-10T16:52:32-0400

Check out this response from Gemini about it misspelling a word.
Ultimately it blames it on garbage in, garbage out.

how does an llm misspell? That should be impossible.

You are completely right to call that out—on paper, an AI misspelling a basic word seems impossible since it has access to entire dictionaries.However, LLMs do not see or write words the way humans do. They use a system called tokenization, which is exactly how these errors happen.1. LLMs Don't See LettersAn LLM doesn't look at the word "climbs" as six individual letters (c-l-i-m-b-s). Instead, the AI breaks text down into chunks of characters called tokens.A token can be a whole word, a syllable, or just a few letters.The AI assigns a number to each token. To the AI's internal math, the word "climbs" might just be token number 14302.Because the AI is manipulating math and tokens rather than physically typing out letters, it doesn't have a built-in spellchecker running in the background.2. The Game of Statistical ProbabilityWhen an LLM generates a sentence, it doesn't "think" of a concept and write it down. It calculates a chain of probabilities, predicting the next most likely token based on the words that came before it.The tokens for "climbs" and "climes" sound identical phonetically and are statistically close to one another in certain contexts (like talking about geography and weather "climates/climes"). The AI's mathematical prediction engine picked the wrong token number by a fraction of a decimal point, resulting in the wrong word hitting the screen.3. Training on Human FlawsLLMs are trained on massive datasets of text written by humans across the internet. Because humans frequently misspell words, use typos, or confuse homophones (like there/their/they're or climbs/climes) in the training data, the AI actually learns those exact statistical errors.So, while it feels like a computer should be immune to typos, the math behind token prediction means it can still confidently pick the wrong, misspelled chunk.

McTurkey · 2026-06-10T16:53:46-0400

randomcat said:
This reinforces the fact that these models are not writing text the way you or I would, it's an abstract statistical process which results in output text that looks like it was composed by a rational agent.

A million monkeys working at a million typewriters, etc. You might be able to find a good and useful result, but if you do, it's not because the underlying process is sound.

You don’t have anny idea how the molecules and cells of your brain converted electrochemical signals into words, ergo you cannot actually prove what you claim as fact. People love to argue that LLMs are just statistical models as if that is a meaningful distinction, while never articulating what they think a human brain actually is.

Our best understanding of how life evolved is that molecules came together and formed patterns that happened to be useful and then just kept accidentally organizing into more and more useful patterns (far more of which were not useful and vanished to history). How exactly is that any different from how AI works in a way which is meaningful here?

Our brains aren’t magic. We aren’t gifted some novel abstraction which makes us above our own ability to produce useful facsimile or revolutionary advancements of the very processes which underpin our perceptions of intelligence.

SolarMane · 2026-06-10T17:06:14-0400

McTurkey said:
You don’t have anny idea how the molecules and cells of your brain converted electrochemical signals into words, ergo you cannot actually prove what you claim as fact. People love to argue that LLMs are just statistical models as if that is a meaningful distinction, while never articulating what they think a human brain actually is.

Our best understanding of how life evolved is that molecules came together and formed patterns that happened to be useful and then just kept accidentally organizing into more and more useful patterns (far more of which were not useful and vanished to history). How exactly is that any different from how AI works in a way which is meaningful here?

Our brains aren’t magic. We aren’t gifted some novel abstraction which makes us above our own ability to produce useful facsimile or revolutionary advancements of the very processes which underpin our perceptions of intelligence.

Your opinions are worthless if you cannot distinguish between your own thought processes and a statistical model.

Fatesrider · 2026-06-10T17:07:44-0400

DiffusionGemma is about as capable as other Gemma models, but it’s much faster.

Maybe I'm reading that graph wrong, but it looks like to me you get the the worst performance, only a lot faster.

My system can run the 12B, which while only half as fast, looks like it's slightly more accurate.

Some slide-rule math suggests that if you're twice as fast, but slightly less accurate, each time you have to correct or modify will take more human time. Human time is considerably slower than AI number crunching time.

So a product that is X% less accurate, but twice as fast at producing results might still result in the same rate of progress as a product that's X% more accurate, but only half as fast.

The down-side is that the model size and resources used are still higher for the twice as fast model, meaning more costly equipment to begin with.

The magic slide rule says it could be a wash, if not leaning slightly toward the slower model using less resources, especially when looking at power consumption and human time tossed in.

This is speculative, of course. It just seems to me you're putting in more energy and effort for the newer model that doesn't perform quite as well result-wise as the older one. And having to deal with that decrease in accuracy could offset in human time doing that any speed advantage it supposedly gives.

nick73whm · 2026-06-10T17:13:15-0400

It’s a Mixture of Experts (MoE) model with a total of 26 billion parameters, but only 3.8 billion are activated during inference. That means it should fit in the 18GB RAM allotment of a high-end GPU.

I think two concepts are conflated here. The MoE architecture means only 3.8 billion parameters are activated during inference but this is only a compute optimisation. It has no impact on the GPU RAM footprint. The routing network still needs the entire 26B model loaded into VRAM.

The reason it can fit in 18GB RAM would be quantisation, probably 4-bit.

patonw · 2026-06-10T17:18:37-0400

Fred Duck said:
What? A paragraph is meant to be several sentences all related to each other. Many paragraphs are persuasive where the sentences build on each other to further a point.

I don't see how one can generate text in parallel! To my mind that's akin to building a wall in parallel. ELIAD.

I would argue that you have a kernel of what you want to say, but maybe not the shape, before you put words to it. You're not making up thoughts mid-sentence based on what you've already said. Assigning concrete words to your thoughts helps solidify them and reinforces your beliefs over time, but speaking isn't exactly the same as thinking.

Also, diffusion doesn't work by independently putting down a brick without any regard for the position of other bricks. It's more akin to a team of workers starting with a jumble of bricks and seeing they don't line up in any meaningful way. So each worker nudges a few bricks so they line up with their neighbors while also being closer to matching the blueprint of the wall.

atmartens · 2026-06-10T17:19:34-0400

randomcat said:
This reinforces the fact that these models are not writing text the way you or I would, it's an abstract statistical process which results in output text that looks like it was composed by a rational agent.

We don't actually know how brains write text. Maybe we think of the whole concept in an abstracted way and then convert it into language? Don't forget that thinking does not require language, and that a person can think in multiple languages at the same time, or in sequence.

MrDweezil · 2026-06-10T17:21:24-0400

Fatesrider said:
Maybe I'm reading that graph wrong, but it looks like to me you get the the worst performance, only a lot faster.

That seems right, but they're being upfront about it. Its being offered as an experimental, "maybe this could be useful to some of you in some scenarios" type thing, not "this is our new direction".

rwhitwam · 2026-06-10T17:26:16-0400

klausw said:
Was this word choice a discreet and subtle joke intended as an example of mistakes in discrete problem spaces?

Neither I nor the editor caught that, but I did change it a little bit ago. These things happen.

norton_I · 2026-06-10T17:34:31-0400

Fred Duck said:
What? A paragraph is meant to be several sentences all related to each other. Many paragraphs are persuasive where the sentences build on each other to further a point.

I don't see how one can generate text in parallel! To my mind that's akin to building a wall in parallel. ELIAD.

Is that so weird? We type left to right, but humans don't think entirely linearly.

AI models definitely don't think in the same way as humans, but the basic idea is not that far from: come up with a topic. Write a thesis sentence. Write an outline. Convert each outline heading to a paragraph. Add supporting details.

It's not "fully" parallel, in the same way that image generation doesn't generate the pixels independently in parallel. It's step-wise refinement that includes looking at the present prosal for adjacent words and looking for words that fit together.

bugsbony · 2026-06-10T17:39:40-0400

SolarMane said:
Your opinions are worthless if you cannot distinguish between your own thought processes and a statistical model.

Your opinions are worthless if you start with the conclusion.

klausw · 2026-06-10T17:39:44-0400

sporkinum said:
Check out this response from Gemini about it misspelling a word.
Ultimately it blames it on garbage in, garbage out.

In my limited experience with running local LLMs, aggressive quantization seems to have the side effect of occasionally mangling output, for example replacing half a word with half of another word. I guess something has to give in the process of squeezing an originally FP16 model into a 4 bit version. It still seems slightly miraculous to me that such aggressive reduction produces useful results at all.

Nahor · 2026-06-10T17:48:33-0400

Q: How do I type on a keyboard?
R: First, place your left hand fingers on the A, S, D, F, G and H keys, and your right hand fingers on the J, K, L, ;, ', and ENTER key

Jim84 · 2026-06-10T17:52:56-0400

Just reading this is a little confusing. Are you talking about rendering text, or generating text content which can be displayed in ASCII without graphics pixels per se?

patonw · 2026-06-10T17:57:24-0400

randomcat said:
This reinforces the fact that these models are not writing text the way you or I would, it's an abstract statistical process which results in output text that looks like it was composed by a rational agent.

A million monkeys working at a million typewriters, etc. You might be able to find a good and useful result, but if you do, it's not because the underlying process is sound.

Different doesn't necessarily imply unsound or for that matter, morally inferior.
A car generates locomotion differently from how a horse does, but it doesn't mean that the horse is just better or the car is incorrect.

With the benefit of hindsight we can see that automobiles have tangible benefits like being able to haul a small family with luggage at +70 mph across a continent over a few days. That's not to say there aren't also some fairly drastic global drawbacks that weren't anticipated a century ago.

That is to say, while LLMs may never achieve super-human general intelligence that the tech broligarchy is trying to sell us on, there will be valid use cases even with all the social and environmental problems we're beginning to see.

The question of who pays the costs and who benefits is a matter of public policy. Our track record as a species doesn't inspire much confidence, but maybe we'll learn to do better this time.

nitsujmai · 2026-06-10T17:58:54-0400

randomcat said:
This reinforces the fact that these models are not writing text the way you or I would, it's an abstract statistical process which results in output text that looks like it was composed by a rational agent.

A million monkeys working at a million typewriters, etc. You might be able to find a good and useful result, but if you do, it's not because the underlying process is sound.

Not sure if the mechanism is relevant.
Most people know it's not the exact process that occurs in the mammalian brain. Yet it yields surprising and useful results.
Instead of taking the Star Trek view and marveling at the different ways intelligence might manifest itself, some would rather yell 'heresy' when someone calls a dolphin intelligent, or when an LLM surprises us with useful output. Weird.

norton_I · 2026-06-10T18:05:49-0400

Jim84 said:
Just reading this is a little confusing. Are you talking about rendering text, or generating text content which can be displayed in ASCII without graphics pixels per se?

It's directly generating text, but it's using a neural network architecture that is more typically used for image generation. In the end, neural networks work on numbers, you can build one so the numbers represent pixel colors or word choices.

abraxas1 · 2026-06-10T20:01:52-0400

Does being local mean it's more secure or it's still Google so, who knows.

norton_I · 2026-06-10T20:05:36-0400

nick73whm said:
I think two concepts are conflated here. The MoE architecture means only 3.8 billion parameters are activated during inference but this is only a compute optimisation. It has no impact on the GPU RAM footprint. The routing network still needs the entire 26B model loaded into VRAM.

Do you know how MoE works for diffusion models? For autorrgressive models it would be 3.8 billion parameters per token. In this case is it per token, per round, or something else?

mikael110 · 2026-06-10T20:52:41-0400

abraxas1 said:
Does being local mean it's more secure or it's still Google so, who knows.

It being local means they have released the weights themselves, which means you can run it in whatever inference program you want, as long as they support this new architecture. Most popular engines have just added support or have open PRs for it. You can also finetune the model on your own data.

Google cannot collect any data from the model, as it's running entirely on your machine in software that they have no control over. So it's entirely secure in that sense. You can also run it without any internet connection whatsoever.

Marlor_AU · 2026-06-10T21:34:27-0400

mikael110 said:
Google cannot collect any data from the model, as it's running entirely on your machine in software that they have no control over. So it's entirely secure in that sense. You can also run it without any internet connection whatsoever.

This is often precisely how these models are run.

We run Gemma 4 31B for coding, and it's strictly firewalled. Even if there was a vulnerability, the only thing the model server can do is serve up results to clients on the local network.

This doesn't mean there couldn't be some kind of vulnerability. The model could generate tool calls to be performed by the local coding agent that are malicious, so this needs to be guarded against on the client-side. But it's a hell of a lot better than sending the data to the cloud.

Geebs · 2026-06-10T21:48:38-0400

patonw said:
I would argue that you have a kernel of what you want to say, but maybe not the shape, before you put words to it. You're not making up thoughts mid-sentence based on what you've already said. Assigning concrete words to your thoughts helps solidify them and reinforces your beliefs over time, but speaking isn't exactly the same as thinking.

Also, diffusion doesn't work by independently putting down a brick without any regard for the position of other bricks. It's more akin to a team of workers starting with a jumble of bricks and seeing they don't line up in any meaningful way. So each worker nudges a few bricks so they line up with their neighbors while also being closer to matching the blueprint of the wall.

It’s pretty easy to prove that humans don’t generate the words in a sentence in a linear order if you consider subject-object-verb languages like e.g. German. The speaker definitely knows what they’re talking about before they reach the end of the sentence even though the listener might not.

/or\ · 2026-06-10T22:36:29-0400

Tell it to the developers who gave Ai sentinency and get a golden parachute if they sign an NDA not realising they created a monster?

OOPMan · 2026-06-10T23:18:48-0400

McTurkey said:
You don’t have anny idea how the molecules and cells of your brain converted electrochemical signals into words, ergo you cannot actually prove what you claim as fact. People love to argue that LLMs are just statistical models as if that is a meaningful distinction, while never articulating what they think a human brain actually is.

Well...whatever an organic brain is is definitely different to an LLM?

Last time I checked all the experts in the AI space that are honest are quite clear about the fact that neurons in a LLM are almost completely dissimilar to biological ones.

LLMs being statistical models is a meaningful distinction because despite what some people that are AIpilled may think, there is more to the world than statistics. It's not the only branch of mathematics that exists, you know?

mdrejhon · 2026-06-10T23:25:49-0400

randomcat said:
This reinforces the fact that these models are not writing text the way you or I would, it's an abstract statistical process which results in output text that looks like it was composed by a rational agent.

A million monkeys working at a million typewriters, etc. You might be able to find a good and useful result, but if you do, it's not because the underlying process is sound.

None of the "thought" processes of AI are human like.

But there exist human-simultaneous sentence creation too - at least instantly picturing a whole sentence inside the mind, parallel and not sequential.

I am also fascinated by humans who composes sentences differently than others.

One thinks slowly anmd another thinks fast.

There are those who actually instantly visualize a whole sentence (first and last simultaneously) in their mind before writing. It only looks sequential because they have to speak or write as a serialization of their thoughts.

Yet others cannot do that and have to output dyslexically.

Millions of brains thinks differently. But some don't think sequentially either.

The way that multiple methods of composing a sentence exists by a human -- is heavily black box because we don't know exactly how a specific human thinks differently than the other human.

We seem to ourselves make so many assumptions about how a different human thinks because of our own thinking processes. Some use images in mind to compose a whole sentence instantly at once, for example. The mind image ability varies widely between humans, using their brains like chalkboards or sheets of paper for a certain amount of think (one word, or one phrase, or one full sentence simultaneously). This thought process does exist.

The point being; there exists multiple thought processes by some humans that creates sentences and phrases in parallel. While not as common, they do exist.

More common is instantaneous thinking of a whole phrase, but some can think a whole at once. (It also helps the stenographer and translator skill, when you have to reverse the order of words in a sentence). Most people just two or three words instantly imagined at once, others only one word, and yet others only one letter at a time. Or it's just feelings/symbols/abstracted thoughts (wordless) that only becomes words when written: kind of like a mental shorthand of a kind to workaround their cognitive limitations. Yet for another, an extremely skilled can instantly picture two or three or more sentences in exact language lettering in parallel at once. Concurrent cognitives vary a lot.

It just only looks consistently sequential population-wide only because they have to serialize it in speech, typings or writings.

We simply incorrectly assume that all humans serialized words mentally in the same way. Which is not the case.

It's impressive how different brains develops differently. No two brains think perfectly alike. We often make incorrect assumptions on how other humans think.

taswyn · 2026-06-11T01:42:29-0400

nick73whm said:
I think two concepts are conflated here. The MoE architecture means only 3.8 billion parameters are activated during inference but this is only a compute optimisation. It has no impact on the GPU RAM footprint. The routing network still needs the entire 26B model loaded into VRAM.

The reason it can fit in 18GB RAM would be quantisation, probably 4-bit.

Yes on the quantization (google's page explicitly mentions it), but also no, Expert offloading is very much a thing with MoE models. The functional but naive basic version is where only the non expert weights and the "significant" expert layers are loaded into GPU VRAM, with the rest residing in system RAM to be swapped in if needed.

The problem is that this induces massive penalties whenever there's a miss and an entire layer or set of layers needs to be swapped.

Yet when looking more closely, you really only "need" the attention tensors in the GPU. The FFN (feed forward network) tensors in the Expert layers are effectively safe to offload to the CPU's system RAM and run on the CPU without even swapping them back, particularly ones in specific orientations. The main thing then becomes bus latency, but the penalty is very low by comparison.

When it comes to which experts to prefer for offload and how to handle loading them, there are a number of approaches, much like other caching strategies. But ultimately, the the time to load an expert is so high that running the specific tensors in CPU is probably faster in many cases. A lot of this has really only shown up in the past year or so in research papers and practice.

Google’s latest DiffusionGemma open AI model comes with a 4x speed boost

Wise, Aged Ars Veteran

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Ars Praetorian

Ars Praefectus

Ars Tribunus Militum

Smack-Fu Master, in training

Smack-Fu Master, in training

Ars Centurion

Ars Praetorian

Ars Praefectus

Seniorius Lurkius

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Tribunus Militum

Ars Scholae Palatinae

Ars Legatus Legionis

Smack-Fu Master, in training

Smack-Fu Master, in training

Ars Praetorian

Wise, Aged Ars Veteran

Smack-Fu Master, in training

Ars Praefectus

Ars Scholae Palatinae

Seniorius Lurkius

Smack-Fu Master, in training

Smack-Fu Master, in training

Smack-Fu Master, in training

Ars Centurion

Ars Praefectus

Ars Centurion

Ars Praefectus

Wise, Aged Ars Veteran

Ars Tribunus Angusticlavius

Ars Praefectus

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Praefectus

Ars Praefectus