Google’s latest DiffusionGemma open AI model comes with a 4x speed boost

Fred Duck

Ars Tribunus Angusticlavius
7,430
Ryan Whitwam said:
Instead, it can produce an entire block of text in parallel.

What? A paragraph is meant to be several sentences all related to each other. Many paragraphs are persuasive where the sentences build on each other to further a point.

I don't see how one can generate text in parallel! To my mind that's akin to building a wall in parallel. ELIAD.
 
Upvote
-4 (16 / -20)
What? A paragraph is meant to be several sentences all related to each other. Many paragraphs are persuasive where the sentences build on each other to further a point.

I don't see how one can generate text in parallel! To my mind that's akin to building a wall in parallel. ELIAD.
1781120883895.png

1781120937570.png



Autoregressive models are bandwidth bound, diffusion models are compute bound.
 
Upvote
103 (103 / 0)
What? A paragraph is meant to be several sentences all related to each other. Many paragraphs are persuasive where the sentences build on each other to further a point.

I don't see how one can generate text in parallel! To my mind that's akin to building a wall in parallel. ELIAD.

This reinforces the fact that these models are not writing text the way you or I would, it's an abstract statistical process which results in output text that looks like it was composed by a rational agent.

A million monkeys working at a million typewriters, etc. You might be able to find a good and useful result, but if you do, it's not because the underlying process is sound.
 
Upvote
33 (56 / -23)

rwhitwam

Smack-Fu Master, in training
54
What? A paragraph is meant to be several sentences all related to each other. Many paragraphs are persuasive where the sentences build on each other to further a point.

I don't see how one can generate text in parallel! To my mind that's akin to building a wall in parallel. ELIAD.
A person writing a sentence is aware of where things are going. You're trying to express a thought, rather than thinking one word at a time. Text diffusion is similar in that way, but it's still a bunch of estimation. It's just estimating in blocks that can improve the model's awareness of the connection to future tokens.
 
Upvote
65 (65 / 0)

nuurdin

Smack-Fu Master, in training
66
I have to admit, this is really kind of interesting, in sort of a Borges-esque way. My guess though is that there is a less than meets the eye here, philosophically speaking. The linearity inherent to making a coherent paragraph probably seeps in with the fine-tuning. It would be very interesting if there were certain forms of text (or certain languages, or script-types) that consistently showed less error with this approach. For example, would more linear token prediction produce fewer errors on an agglutinating language? One the other hand, would DiffusionGemma write a better sestina, since the form is what matters?
And then part of me is just like, whatever.
 
Upvote
22 (23 / -1)
What? A paragraph is meant to be several sentences all related to each other. Many paragraphs are persuasive where the sentences build on each other to further a point.

I don't see how one can generate text in parallel! To my mind that's akin to building a wall in parallel. ELIAD.
Imagine five people working in parallel. Each person is asked to modify one specific word. After they have all modified their word, they see the words that the others changed.
E.g., "How do you build a wall?"
First round: <You> <by> <an> <location> <material>
Second round: <Start> <find> <selecting> <good> <and>
Third round: <Start> <by> <a> <a> <location>
Fourth round: <Start> <by> <selecting> <good> <location>
Fifth round: <Start> <by> <selecting> <a> <location>
 
Upvote
20 (20 / 0)
I was just about to try this out in LM Studio when I saw that diffusion models like this aren't supported yet.

I was going to try running it through koboldcpp/sillytavern but the safetensors are broken down into several separate files and although that is supposedly not an issue for a lot of people, I have never been able to make that work.
 
Upvote
10 (10 / 0)
In image diffusion models, a single badly predicted pixel doesn’t make the image useless, but language is discreet. An equivalent error in text can make a block of tokens meaningless and force you to start over to get a better output.
Was this word choice a discreet and subtle joke intended as an example of mistakes in discrete problem spaces? :cool:
 
Upvote
18 (18 / 0)

Lexus Lunar Lorry

Ars Scholae Palatinae
935
Subscriptor++
What? A paragraph is meant to be several sentences all related to each other. Many paragraphs are persuasive where the sentences build on each other to further a point.

I don't see how one can generate text in parallel! To my mind that's akin to building a wall in parallel. ELIAD.
What if you're one of the aliens from Arrival or Story of Your Life? Then your paragraphs arrived fully formed, not as sequences of words.
 
Upvote
9 (9 / 0)
Check out this response from Gemini about it misspelling a word.
Ultimately it blames it on garbage in, garbage out.
how does an llm misspell? That should be impossible.

You are completely right to call that out—on paper, an AI misspelling a basic word seems impossible since it has access to entire dictionaries.However, LLMs do not see or write words the way humans do. They use a system called tokenization, which is exactly how these errors happen.1. LLMs Don't See LettersAn LLM doesn't look at the word "climbs" as six individual letters (c-l-i-m-b-s). Instead, the AI breaks text down into chunks of characters called tokens.A token can be a whole word, a syllable, or just a few letters.The AI assigns a number to each token. To the AI's internal math, the word "climbs" might just be token number 14302.Because the AI is manipulating math and tokens rather than physically typing out letters, it doesn't have a built-in spellchecker running in the background.2. The Game of Statistical ProbabilityWhen an LLM generates a sentence, it doesn't "think" of a concept and write it down. It calculates a chain of probabilities, predicting the next most likely token based on the words that came before it.The tokens for "climbs" and "climes" sound identical phonetically and are statistically close to one another in certain contexts (like talking about geography and weather "climates/climes"). The AI's mathematical prediction engine picked the wrong token number by a fraction of a decimal point, resulting in the wrong word hitting the screen.3. Training on Human FlawsLLMs are trained on massive datasets of text written by humans across the internet. Because humans frequently misspell words, use typos, or confuse homophones (like there/their/they're or climbs/climes) in the training data, the AI actually learns those exact statistical errors.So, while it feels like a computer should be immune to typos, the math behind token prediction means it can still confidently pick the wrong, misspelled chunk.
 
Upvote
-12 (2 / -14)

McTurkey

Ars Tribunus Militum
2,277
Subscriptor
This reinforces the fact that these models are not writing text the way you or I would, it's an abstract statistical process which results in output text that looks like it was composed by a rational agent.

A million monkeys working at a million typewriters, etc. You might be able to find a good and useful result, but if you do, it's not because the underlying process is sound.
You don’t have anny idea how the molecules and cells of your brain converted electrochemical signals into words, ergo you cannot actually prove what you claim as fact. People love to argue that LLMs are just statistical models as if that is a meaningful distinction, while never articulating what they think a human brain actually is.

Our best understanding of how life evolved is that molecules came together and formed patterns that happened to be useful and then just kept accidentally organizing into more and more useful patterns (far more of which were not useful and vanished to history). How exactly is that any different from how AI works in a way which is meaningful here?

Our brains aren’t magic. We aren’t gifted some novel abstraction which makes us above our own ability to produce useful facsimile or revolutionary advancements of the very processes which underpin our perceptions of intelligence.
 
Upvote
-3 (25 / -28)
You don’t have anny idea how the molecules and cells of your brain converted electrochemical signals into words, ergo you cannot actually prove what you claim as fact. People love to argue that LLMs are just statistical models as if that is a meaningful distinction, while never articulating what they think a human brain actually is.

Our best understanding of how life evolved is that molecules came together and formed patterns that happened to be useful and then just kept accidentally organizing into more and more useful patterns (far more of which were not useful and vanished to history). How exactly is that any different from how AI works in a way which is meaningful here?

Our brains aren’t magic. We aren’t gifted some novel abstraction which makes us above our own ability to produce useful facsimile or revolutionary advancements of the very processes which underpin our perceptions of intelligence.
Your opinions are worthless if you cannot distinguish between your own thought processes and a statistical model.
 
Upvote
-8 (20 / -28)

Fatesrider

Ars Legatus Legionis
25,472
Subscriptor
DiffusionGemma is about as capable as other Gemma models, but it’s much faster.
Maybe I'm reading that graph wrong, but it looks like to me you get the the worst performance, only a lot faster.

My system can run the 12B, which while only half as fast, looks like it's slightly more accurate.

Some slide-rule math suggests that if you're twice as fast, but slightly less accurate, each time you have to correct or modify will take more human time. Human time is considerably slower than AI number crunching time.

So a product that is X% less accurate, but twice as fast at producing results might still result in the same rate of progress as a product that's X% more accurate, but only half as fast.

The down-side is that the model size and resources used are still higher for the twice as fast model, meaning more costly equipment to begin with.

The magic slide rule says it could be a wash, if not leaning slightly toward the slower model using less resources, especially when looking at power consumption and human time tossed in.

This is speculative, of course. It just seems to me you're putting in more energy and effort for the newer model that doesn't perform quite as well result-wise as the older one. And having to deal with that decrease in accuracy could offset in human time doing that any speed advantage it supposedly gives.
 
Upvote
2 (5 / -3)

nick73whm

Smack-Fu Master, in training
70
It’s a Mixture of Experts (MoE) model with a total of 26 billion parameters, but only 3.8 billion are activated during inference. That means it should fit in the 18GB RAM allotment of a high-end GPU.
I think two concepts are conflated here. The MoE architecture means only 3.8 billion parameters are activated during inference but this is only a compute optimisation. It has no impact on the GPU RAM footprint. The routing network still needs the entire 26B model loaded into VRAM.

The reason it can fit in 18GB RAM would be quantisation, probably 4-bit.
 
Upvote
8 (11 / -3)

patonw

Smack-Fu Master, in training
4
What? A paragraph is meant to be several sentences all related to each other. Many paragraphs are persuasive where the sentences build on each other to further a point.

I don't see how one can generate text in parallel! To my mind that's akin to building a wall in parallel. ELIAD.
I would argue that you have a kernel of what you want to say, but maybe not the shape, before you put words to it. You're not making up thoughts mid-sentence based on what you've already said. Assigning concrete words to your thoughts helps solidify them and reinforces your beliefs over time, but speaking isn't exactly the same as thinking.

Also, diffusion doesn't work by independently putting down a brick without any regard for the position of other bricks. It's more akin to a team of workers starting with a jumble of bricks and seeing they don't line up in any meaningful way. So each worker nudges a few bricks so they line up with their neighbors while also being closer to matching the blueprint of the wall.
 
Upvote
18 (18 / 0)

atmartens

Ars Praetorian
516
Subscriptor
This reinforces the fact that these models are not writing text the way you or I would, it's an abstract statistical process which results in output text that looks like it was composed by a rational agent.
We don't actually know how brains write text. Maybe we think of the whole concept in an abstracted way and then convert it into language? Don't forget that thinking does not require language, and that a person can think in multiple languages at the same time, or in sequence.
 
Upvote
30 (31 / -1)
Maybe I'm reading that graph wrong, but it looks like to me you get the the worst performance, only a lot faster.
That seems right, but they're being upfront about it. Its being offered as an experimental, "maybe this could be useful to some of you in some scenarios" type thing, not "this is our new direction".
 
Upvote
11 (11 / 0)

norton_I

Ars Praefectus
5,913
Subscriptor++
What? A paragraph is meant to be several sentences all related to each other. Many paragraphs are persuasive where the sentences build on each other to further a point.

I don't see how one can generate text in parallel! To my mind that's akin to building a wall in parallel. ELIAD.

Is that so weird? We type left to right, but humans don't think entirely linearly.

AI models definitely don't think in the same way as humans, but the basic idea is not that far from: come up with a topic. Write a thesis sentence. Write an outline. Convert each outline heading to a paragraph. Add supporting details.

It's not "fully" parallel, in the same way that image generation doesn't generate the pixels independently in parallel. It's step-wise refinement that includes looking at the present prosal for adjacent words and looking for words that fit together.
 
Upvote
12 (12 / 0)
Check out this response from Gemini about it misspelling a word.
Ultimately it blames it on garbage in, garbage out.
In my limited experience with running local LLMs, aggressive quantization seems to have the side effect of occasionally mangling output, for example replacing half a word with half of another word. I guess something has to give in the process of squeezing an originally FP16 model into a 4 bit version. It still seems slightly miraculous to me that such aggressive reduction produces useful results at all.
 
Upvote
12 (12 / 0)

patonw

Smack-Fu Master, in training
4
This reinforces the fact that these models are not writing text the way you or I would, it's an abstract statistical process which results in output text that looks like it was composed by a rational agent.

A million monkeys working at a million typewriters, etc. You might be able to find a good and useful result, but if you do, it's not because the underlying process is sound.
Different doesn't necessarily imply unsound or for that matter, morally inferior.
A car generates locomotion differently from how a horse does, but it doesn't mean that the horse is just better or the car is incorrect.

With the benefit of hindsight we can see that automobiles have tangible benefits like being able to haul a small family with luggage at +70 mph across a continent over a few days. That's not to say there aren't also some fairly drastic global drawbacks that weren't anticipated a century ago.

That is to say, while LLMs may never achieve super-human general intelligence that the tech broligarchy is trying to sell us on, there will be valid use cases even with all the social and environmental problems we're beginning to see.

The question of who pays the costs and who benefits is a matter of public policy. Our track record as a species doesn't inspire much confidence, but maybe we'll learn to do better this time.
 
Upvote
22 (23 / -1)
This reinforces the fact that these models are not writing text the way you or I would, it's an abstract statistical process which results in output text that looks like it was composed by a rational agent.

A million monkeys working at a million typewriters, etc. You might be able to find a good and useful result, but if you do, it's not because the underlying process is sound.
Not sure if the mechanism is relevant.
Most people know it's not the exact process that occurs in the mammalian brain. Yet it yields surprising and useful results.
Instead of taking the Star Trek view and marveling at the different ways intelligence might manifest itself, some would rather yell 'heresy' when someone calls a dolphin intelligent, or when an LLM surprises us with useful output. Weird.
 
Upvote
2 (7 / -5)

norton_I

Ars Praefectus
5,913
Subscriptor++
Just reading this is a little confusing. Are you talking about rendering text, or generating text content which can be displayed in ASCII without graphics pixels per se?

It's directly generating text, but it's using a neural network architecture that is more typically used for image generation. In the end, neural networks work on numbers, you can build one so the numbers represent pixel colors or word choices.
 
Upvote
11 (11 / 0)

norton_I

Ars Praefectus
5,913
Subscriptor++
I think two concepts are conflated here. The MoE architecture means only 3.8 billion parameters are activated during inference but this is only a compute optimisation. It has no impact on the GPU RAM footprint. The routing network still needs the entire 26B model loaded into VRAM.

Do you know how MoE works for diffusion models? For autorrgressive models it would be 3.8 billion parameters per token. In this case is it per token, per round, or something else?
 
Upvote
1 (1 / 0)
Does being local mean it's more secure or it's still Google so, who knows.
It being local means they have released the weights themselves, which means you can run it in whatever inference program you want, as long as they support this new architecture. Most popular engines have just added support or have open PRs for it. You can also finetune the model on your own data.

Google cannot collect any data from the model, as it's running entirely on your machine in software that they have no control over. So it's entirely secure in that sense. You can also run it without any internet connection whatsoever.
 
Upvote
17 (17 / 0)

Marlor_AU

Ars Tribunus Angusticlavius
7,776
Subscriptor
Google cannot collect any data from the model, as it's running entirely on your machine in software that they have no control over. So it's entirely secure in that sense. You can also run it without any internet connection whatsoever.
This is often precisely how these models are run.

We run Gemma 4 31B for coding, and it's strictly firewalled. Even if there was a vulnerability, the only thing the model server can do is serve up results to clients on the local network.

This doesn't mean there couldn't be some kind of vulnerability. The model could generate tool calls to be performed by the local coding agent that are malicious, so this needs to be guarded against on the client-side. But it's a hell of a lot better than sending the data to the cloud.
 
Upvote
11 (11 / 0)
I would argue that you have a kernel of what you want to say, but maybe not the shape, before you put words to it. You're not making up thoughts mid-sentence based on what you've already said. Assigning concrete words to your thoughts helps solidify them and reinforces your beliefs over time, but speaking isn't exactly the same as thinking.

Also, diffusion doesn't work by independently putting down a brick without any regard for the position of other bricks. It's more akin to a team of workers starting with a jumble of bricks and seeing they don't line up in any meaningful way. So each worker nudges a few bricks so they line up with their neighbors while also being closer to matching the blueprint of the wall.
It’s pretty easy to prove that humans don’t generate the words in a sentence in a linear order if you consider subject-object-verb languages like e.g. German. The speaker definitely knows what they’re talking about before they reach the end of the sentence even though the listener might not.
 
Upvote
11 (11 / 0)
You don’t have anny idea how the molecules and cells of your brain converted electrochemical signals into words, ergo you cannot actually prove what you claim as fact. People love to argue that LLMs are just statistical models as if that is a meaningful distinction, while never articulating what they think a human brain actually is.
Well...whatever an organic brain is is definitely different to an LLM?

Last time I checked all the experts in the AI space that are honest are quite clear about the fact that neurons in a LLM are almost completely dissimilar to biological ones.

LLMs being statistical models is a meaningful distinction because despite what some people that are AIpilled may think, there is more to the world than statistics. It's not the only branch of mathematics that exists, you know?
 
Upvote
6 (7 / -1)

mdrejhon

Ars Praefectus
3,144
Subscriptor
This reinforces the fact that these models are not writing text the way you or I would, it's an abstract statistical process which results in output text that looks like it was composed by a rational agent.

A million monkeys working at a million typewriters, etc. You might be able to find a good and useful result, but if you do, it's not because the underlying process is sound.
None of the "thought" processes of AI are human like.

But there exist human-simultaneous sentence creation too - at least instantly picturing a whole sentence inside the mind, parallel and not sequential.

I am also fascinated by humans who composes sentences differently than others.

One thinks slowly anmd another thinks fast.

There are those who actually instantly visualize a whole sentence (first and last simultaneously) in their mind before writing. It only looks sequential because they have to speak or write as a serialization of their thoughts.

Yet others cannot do that and have to output dyslexically.

Millions of brains thinks differently. But some don't think sequentially either.

The way that multiple methods of composing a sentence exists by a human -- is heavily black box because we don't know exactly how a specific human thinks differently than the other human.

We seem to ourselves make so many assumptions about how a different human thinks because of our own thinking processes. Some use images in mind to compose a whole sentence instantly at once, for example. The mind image ability varies widely between humans, using their brains like chalkboards or sheets of paper for a certain amount of think (one word, or one phrase, or one full sentence simultaneously). This thought process does exist.

The point being; there exists multiple thought processes by some humans that creates sentences and phrases in parallel. While not as common, they do exist.

More common is instantaneous thinking of a whole phrase, but some can think a whole at once. (It also helps the stenographer and translator skill, when you have to reverse the order of words in a sentence). Most people just two or three words instantly imagined at once, others only one word, and yet others only one letter at a time. Or it's just feelings/symbols/abstracted thoughts (wordless) that only becomes words when written: kind of like a mental shorthand of a kind to workaround their cognitive limitations. Yet for another, an extremely skilled can instantly picture two or three or more sentences in exact language lettering in parallel at once. Concurrent cognitives vary a lot.

It just only looks consistently sequential population-wide only because they have to serialize it in speech, typings or writings.

We simply incorrectly assume that all humans serialized words mentally in the same way. Which is not the case.

It's impressive how different brains develops differently. No two brains think perfectly alike. We often make incorrect assumptions on how other humans think.
 
Last edited:
Upvote
3 (4 / -1)
I think two concepts are conflated here. The MoE architecture means only 3.8 billion parameters are activated during inference but this is only a compute optimisation. It has no impact on the GPU RAM footprint. The routing network still needs the entire 26B model loaded into VRAM.

The reason it can fit in 18GB RAM would be quantisation, probably 4-bit.

Yes on the quantization (google's page explicitly mentions it), but also no, Expert offloading is very much a thing with MoE models. The functional but naive basic version is where only the non expert weights and the "significant" expert layers are loaded into GPU VRAM, with the rest residing in system RAM to be swapped in if needed.

The problem is that this induces massive penalties whenever there's a miss and an entire layer or set of layers needs to be swapped.

Yet when looking more closely, you really only "need" the attention tensors in the GPU. The FFN (feed forward network) tensors in the Expert layers are effectively safe to offload to the CPU's system RAM and run on the CPU without even swapping them back, particularly ones in specific orientations. The main thing then becomes bus latency, but the penalty is very low by comparison.

When it comes to which experts to prefer for offload and how to handle loading them, there are a number of approaches, much like other caching strategies. But ultimately, the the time to load an expert is so high that running the specific tensors in CPU is probably faster in many cases. A lot of this has really only shown up in the past year or so in research papers and practice.
 
Upvote
6 (6 / 0)