Diffusion AI is most common in image generation, but it can make text outputs much faster.
See full article...
See full article...
Ryan Whitwam said:Instead, it can produce an entire block of text in parallel.
What? A paragraph is meant to be several sentences all related to each other. Many paragraphs are persuasive where the sentences build on each other to further a point.
I don't see how one can generate text in parallel! To my mind that's akin to building a wall in parallel. ELIAD.
What? A paragraph is meant to be several sentences all related to each other. Many paragraphs are persuasive where the sentences build on each other to further a point.
I don't see how one can generate text in parallel! To my mind that's akin to building a wall in parallel. ELIAD.
It's just different ways to arrive at the wrong answer.Autoregressive models are bandwidth bound, diffusion models are compute bound.
A person writing a sentence is aware of where things are going. You're trying to express a thought, rather than thinking one word at a time. Text diffusion is similar in that way, but it's still a bunch of estimation. It's just estimating in blocks that can improve the model's awareness of the connection to future tokens.What? A paragraph is meant to be several sentences all related to each other. Many paragraphs are persuasive where the sentences build on each other to further a point.
I don't see how one can generate text in parallel! To my mind that's akin to building a wall in parallel. ELIAD.
Imagine five people working in parallel. Each person is asked to modify one specific word. After they have all modified their word, they see the words that the others changed.What? A paragraph is meant to be several sentences all related to each other. Many paragraphs are persuasive where the sentences build on each other to further a point.
I don't see how one can generate text in parallel! To my mind that's akin to building a wall in parallel. ELIAD.
I was just about to try this out in LM Studio when I saw that diffusion models like this aren't supported yet.
Was this word choice a discreet and subtle joke intended as an example of mistakes in discrete problem spaces?In image diffusion models, a single badly predicted pixel doesn’t make the image useless, but language is discreet. An equivalent error in text can make a block of tokens meaningless and force you to start over to get a better output.
What if you're one of the aliens from Arrival or Story of Your Life? Then your paragraphs arrived fully formed, not as sequences of words.What? A paragraph is meant to be several sentences all related to each other. Many paragraphs are persuasive where the sentences build on each other to further a point.
I don't see how one can generate text in parallel! To my mind that's akin to building a wall in parallel. ELIAD.
how does an llm misspell? That should be impossible.
You are completely right to call that out—on paper, an AI misspelling a basic word seems impossible since it has access to entire dictionaries.However, LLMs do not see or write words the way humans do. They use a system called tokenization, which is exactly how these errors happen.1. LLMs Don't See LettersAn LLM doesn't look at the word "climbs" as six individual letters (c-l-i-m-b-s). Instead, the AI breaks text down into chunks of characters called tokens.A token can be a whole word, a syllable, or just a few letters.The AI assigns a number to each token. To the AI's internal math, the word "climbs" might just be token number 14302.Because the AI is manipulating math and tokens rather than physically typing out letters, it doesn't have a built-in spellchecker running in the background.2. The Game of Statistical ProbabilityWhen an LLM generates a sentence, it doesn't "think" of a concept and write it down. It calculates a chain of probabilities, predicting the next most likely token based on the words that came before it.The tokens for "climbs" and "climes" sound identical phonetically and are statistically close to one another in certain contexts (like talking about geography and weather "climates/climes"). The AI's mathematical prediction engine picked the wrong token number by a fraction of a decimal point, resulting in the wrong word hitting the screen.3. Training on Human FlawsLLMs are trained on massive datasets of text written by humans across the internet. Because humans frequently misspell words, use typos, or confuse homophones (like there/their/they're or climbs/climes) in the training data, the AI actually learns those exact statistical errors.So, while it feels like a computer should be immune to typos, the math behind token prediction means it can still confidently pick the wrong, misspelled chunk.
You don’t have anny idea how the molecules and cells of your brain converted electrochemical signals into words, ergo you cannot actually prove what you claim as fact. People love to argue that LLMs are just statistical models as if that is a meaningful distinction, while never articulating what they think a human brain actually is.This reinforces the fact that these models are not writing text the way you or I would, it's an abstract statistical process which results in output text that looks like it was composed by a rational agent.
A million monkeys working at a million typewriters, etc. You might be able to find a good and useful result, but if you do, it's not because the underlying process is sound.
Your opinions are worthless if you cannot distinguish between your own thought processes and a statistical model.You don’t have anny idea how the molecules and cells of your brain converted electrochemical signals into words, ergo you cannot actually prove what you claim as fact. People love to argue that LLMs are just statistical models as if that is a meaningful distinction, while never articulating what they think a human brain actually is.
Our best understanding of how life evolved is that molecules came together and formed patterns that happened to be useful and then just kept accidentally organizing into more and more useful patterns (far more of which were not useful and vanished to history). How exactly is that any different from how AI works in a way which is meaningful here?
Our brains aren’t magic. We aren’t gifted some novel abstraction which makes us above our own ability to produce useful facsimile or revolutionary advancements of the very processes which underpin our perceptions of intelligence.
Maybe I'm reading that graph wrong, but it looks like to me you get the the worst performance, only a lot faster.DiffusionGemma is about as capable as other Gemma models, but it’s much faster.
I think two concepts are conflated here. The MoE architecture means only 3.8 billion parameters are activated during inference but this is only a compute optimisation. It has no impact on the GPU RAM footprint. The routing network still needs the entire 26B model loaded into VRAM.It’s a Mixture of Experts (MoE) model with a total of 26 billion parameters, but only 3.8 billion are activated during inference. That means it should fit in the 18GB RAM allotment of a high-end GPU.
I would argue that you have a kernel of what you want to say, but maybe not the shape, before you put words to it. You're not making up thoughts mid-sentence based on what you've already said. Assigning concrete words to your thoughts helps solidify them and reinforces your beliefs over time, but speaking isn't exactly the same as thinking.What? A paragraph is meant to be several sentences all related to each other. Many paragraphs are persuasive where the sentences build on each other to further a point.
I don't see how one can generate text in parallel! To my mind that's akin to building a wall in parallel. ELIAD.
We don't actually know how brains write text. Maybe we think of the whole concept in an abstracted way and then convert it into language? Don't forget that thinking does not require language, and that a person can think in multiple languages at the same time, or in sequence.This reinforces the fact that these models are not writing text the way you or I would, it's an abstract statistical process which results in output text that looks like it was composed by a rational agent.
That seems right, but they're being upfront about it. Its being offered as an experimental, "maybe this could be useful to some of you in some scenarios" type thing, not "this is our new direction".Maybe I'm reading that graph wrong, but it looks like to me you get the the worst performance, only a lot faster.
Neither I nor the editor caught that, but I did change it a little bit ago. These things happen.Was this word choice a discreet and subtle joke intended as an example of mistakes in discrete problem spaces?![]()
What? A paragraph is meant to be several sentences all related to each other. Many paragraphs are persuasive where the sentences build on each other to further a point.
I don't see how one can generate text in parallel! To my mind that's akin to building a wall in parallel. ELIAD.
Your opinions are worthless if you start with the conclusion.Your opinions are worthless if you cannot distinguish between your own thought processes and a statistical model.
In my limited experience with running local LLMs, aggressive quantization seems to have the side effect of occasionally mangling output, for example replacing half a word with half of another word. I guess something has to give in the process of squeezing an originally FP16 model into a 4 bit version. It still seems slightly miraculous to me that such aggressive reduction produces useful results at all.Check out this response from Gemini about it misspelling a word.
Ultimately it blames it on garbage in, garbage out.
Different doesn't necessarily imply unsound or for that matter, morally inferior.This reinforces the fact that these models are not writing text the way you or I would, it's an abstract statistical process which results in output text that looks like it was composed by a rational agent.
A million monkeys working at a million typewriters, etc. You might be able to find a good and useful result, but if you do, it's not because the underlying process is sound.
Not sure if the mechanism is relevant.This reinforces the fact that these models are not writing text the way you or I would, it's an abstract statistical process which results in output text that looks like it was composed by a rational agent.
A million monkeys working at a million typewriters, etc. You might be able to find a good and useful result, but if you do, it's not because the underlying process is sound.
Just reading this is a little confusing. Are you talking about rendering text, or generating text content which can be displayed in ASCII without graphics pixels per se?
I think two concepts are conflated here. The MoE architecture means only 3.8 billion parameters are activated during inference but this is only a compute optimisation. It has no impact on the GPU RAM footprint. The routing network still needs the entire 26B model loaded into VRAM.
It being local means they have released the weights themselves, which means you can run it in whatever inference program you want, as long as they support this new architecture. Most popular engines have just added support or have open PRs for it. You can also finetune the model on your own data.Does being local mean it's more secure or it's still Google so, who knows.
This is often precisely how these models are run.Google cannot collect any data from the model, as it's running entirely on your machine in software that they have no control over. So it's entirely secure in that sense. You can also run it without any internet connection whatsoever.
It’s pretty easy to prove that humans don’t generate the words in a sentence in a linear order if you consider subject-object-verb languages like e.g. German. The speaker definitely knows what they’re talking about before they reach the end of the sentence even though the listener might not.I would argue that you have a kernel of what you want to say, but maybe not the shape, before you put words to it. You're not making up thoughts mid-sentence based on what you've already said. Assigning concrete words to your thoughts helps solidify them and reinforces your beliefs over time, but speaking isn't exactly the same as thinking.
Also, diffusion doesn't work by independently putting down a brick without any regard for the position of other bricks. It's more akin to a team of workers starting with a jumble of bricks and seeing they don't line up in any meaningful way. So each worker nudges a few bricks so they line up with their neighbors while also being closer to matching the blueprint of the wall.
Well...whatever an organic brain is is definitely different to an LLM?You don’t have anny idea how the molecules and cells of your brain converted electrochemical signals into words, ergo you cannot actually prove what you claim as fact. People love to argue that LLMs are just statistical models as if that is a meaningful distinction, while never articulating what they think a human brain actually is.
None of the "thought" processes of AI are human like.This reinforces the fact that these models are not writing text the way you or I would, it's an abstract statistical process which results in output text that looks like it was composed by a rational agent.
A million monkeys working at a million typewriters, etc. You might be able to find a good and useful result, but if you do, it's not because the underlying process is sound.
I think two concepts are conflated here. The MoE architecture means only 3.8 billion parameters are activated during inference but this is only a compute optimisation. It has no impact on the GPU RAM footprint. The routing network still needs the entire 26B model loaded into VRAM.
The reason it can fit in 18GB RAM would be quantisation, probably 4-bit.