Google says new TurboQuant compression can lower AI memory usage without sacrificing quality

akw0088 · 2026-03-25T14:04:13-0400

Could also use the newly freed-up memory to run more complex models

And that they did my friend, and that they did, there is no end to the run a bigger model game just yet

KingArthur10 · 2026-03-25T14:13:27-0400

This will be helpful for edge-computing. More complex on-device processing will be nice for privacy-conscious users.

Braam7 · 2026-03-25T14:13:46-0400

But using Cartesian coordinates, it’s simply “Go 5 blocks at 37-degrees.

Isn't that Polar not Cartesian?

numerobis · 2026-03-25T14:21:03-0400

Google offers an interesting real-world analogy to explain this process. The vector coordinates are like directions, so the traditional encoding might be “Go 3 blocks East, 4 blocks North.” But using Cartesian coordinates, it’s simply “Go 5 blocks at 37-degrees.” This takes up less space and saves the system from performing expensive data normalization steps.

That takes up less space why? It's the same number of coordinates.

JL will reduce dimensions, so that'll save space. Reducing a vector from R^n to Z_2 seems ... extreme.

Feels like the writeup is missing a few key details.

Nikratio · 2026-03-25T14:24:02-0400

The vector coordinates are like directions, so the traditional encoding might be “Go 3 blocks East, 4 blocks North.” But using Cartesian coordinates, it’s simply “Go 5 blocks at 37-degrees.” This takes up less space and saves the system from performing expensive data normalization steps.

As far as I can tell, that's still two numbers you need to store (angle and distance), so no reduction of data has taken place. Same in higher dimensions. It'd be nice if the article explained this in a bit more detail.

Varste · 2026-03-25T14:24:24-0400

I don't understand Google's real-world analogy example. They go from giving you two numbers to.... two numbers, but now one is an angle. Don't get me wrong, I believe they've seen the benefits they're touting. I just don't think you could come up with a sufficient layman explanation for us folks who don't work with LLM data structures.
edit: ninja'd quite thoroughly it seems.

twilightomni · 2026-03-25T14:24:25-0400

numerobis said:
That takes up less space why? It's the same number of coordinates.

Yeah the analogy doesn’t actually clarify the benefit. In 2D polar and Cartesian are equivalently dense.

KingAZAZ · 2026-03-25T14:26:01-0400

Can I buy some DDR5 without a 2nd mortgage now, please?

marsilies · 2026-03-25T14:28:13-0400

numerobis said:
That takes up less space why? It's the same number of coordinates.

The analogy is using only 2 dimensions for ease of understanding. The actual vectors have many more dimensions, which are quantized down to just two components.

From the original article from Google:
https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/

PolarQuant acts as a high-efficiency compression bridge, converting Cartesian inputs into a compact Polar "shorthand" for storage and processing. The mechanism begins by grouping pairs of coordinates from a d-dimensional vector and mapping them onto a polar coordinate system. Radii are then gathered in pairs for recursive polar transformations — a process that repeats until the data is distilled into a single final radius and a collection of descriptive angles.

....Because the pattern of the angles is known and highly concentrated, the model no longer needs to perform the expensive data normalization step because it maps data onto a fixed, predictable "circular" grid where the boundaries are already known, rather than a "square" grid where the boundaries change constantly. This allows PolarQuant to eliminate the memory overhead that traditional methods must carry.

TKK · 2026-03-25T14:30:15-0400

Suggest exploring the use of quaternion math...

azazel1024 · 2026-03-25T14:34:27-0400

TKK said:
Suggest exploring the use of quaternion math...

Is that Q math? Is he just going to wave his hand and make it appear?

I volunteer "the Jean Luc solution" for the name of it.

Dputiger · 2026-03-25T14:49:57-0400

markg729 said:
I think "LLMs don’t actually know anything" is a strong mischaracterization. See https://www.astralcodexten.com/p/next-token-predictor-is-an-ais-job and https://www.theargumentmag.com/p/when-technically-true-becomes-actually

Edit: There hasn't been enough time for the people downvoting me to actually read the links, so they're all just assholes apparently.

I did look at the links you posted. I've also spent a great deal of time with ChatGPT. I have seen ChatGPT describe itself as, essentially, a form of autocomplete.

The problem with arguing that AI "knows" things is that AI models have no ground truths and no way to validate the data they are fed. A human being can take a prism, shine a light through it, and see the rainbow of colors that results. A human can determine the wavelength of each color. More practically, you can walk outside every day and see what color the sky is.

AI can't do those things. If an AI is trained on sources that all refer to the color of the sky as green, it will confidently state that the sky is green. If trained on data that misrepresents scientific facts, it will also misrepresent scientific facts. At no point will an LLM say, "Hey, my training data says the Earth is flat, but if that's true, why do the sails of a ship appear over the horizon before the rest of the vessel?" It can't ask these kinds of questions, because it cannot observe anything independent of its training data. It has no senses.

It's very difficult to make ChatGPT shake off the tics and tell-tale signs of AI authorship. I've had conversation after conversation about what those signs are and why they shouldn't be in copy. I've asked the model itself how I can better tune it for desired output. I've then incorporated those responses verbatim. Precious little changes. ChatGPT often states that while I did offer specific instructions, those instructions were not sufficient to overcome its own baselines and style. It's gone from "Use these custom rules," to "Create my own GPT with 12-15 examples of good and bad output, accompanied by an explanation of why each is good or bad."

If I was speaking to a human, I could give that person feedback and explain to them how to write more effectively. Even with a brand-new, fresh-out-of-college graduate, I'd expect to see improvements within a month. By the six-month mark, I'd expect them to have internalized these ideas flawlessly. But I'm not working with a human. I'm working with a bot that can't stop saying shit like "This is where the magic happens" or "This idea isn't just important -- it's revolutionary."

AI is turtles, all the way down. The buck stops nowhere, because there is no source of ground truth, no guaranteed-known facts, and no position it can't be shoved off with a little creative work. Its desire to be affirming and foster engagement easily overwhelms its desire to be honest, which is why there are so many stories of AI telling people to do terrible things and affirming toxic (or just plain crazy) beliefs.

AI doesn't know things because AI can't "know things." We may fudge that distinction in common language when we say something like "Excel knows how to turn a CSV file into a structured table with comma delineation," but that's colloquial usage, not factual truth. Excel knows nothing. Neither does ChatGPT.

PS: Complaining about downvotes is the fastest way to get downvoted.

floyd42 · 2026-03-25T14:57:32-0400

Polar coordinates feel like they make more sense than vectors since I think of the world in 2 dimensions rather than 3 when I'm navigating.

I remember writing a zip algorithm back in my data structures and algorithms class and this feels pretty similar with reducing and optimizing the key space.

The current models must have really bulky keys if they're able to get a real world, 6x memory improvement?

allears · 2026-03-25T15:00:36-0400

Still won't leave any for the rest of us.

EricM2 · 2026-03-25T15:00:37-0400

marsilies said:
The analogy is using only 2 dimensions for ease of understanding. The actual vectors have many more dimensions, which are quantized down to just two components.

But to describe an n-vector you'd need multiple (n-1) angles in this analogy. n=2 results in one angle in the example. For n=3 you need a start point and 2 angles to describe a point in 3D space. if n=4 you'd need 3 angles and so on ...
Anyway, my brain tilted at

This applies a 1-bit error-correction layer to the model, reducing each vector to a single bit (+1 or -1) while preserving the essential vector data that describes relationships

I've no idea what to make of this - will try tomorrow again, with a better caffeine supply...

TheNinja · 2026-03-25T15:04:03-0400

Varste said:
I don't understand Google's real-world analogy example. They go from giving you two numbers to.... two numbers, but now one is an angle. Don't get me wrong, I believe they've seen the benefits they're touting. I just don't think you could come up with a sufficient layman explanation for us folks who don't work with LLM data structures.
edit: ninja'd quite thoroughly it seems.

From my (limited) knowledge, it is easier to work with polar coordinates mathematically speaking. With the normal x and y values you need more calculus steps to do vector calculations than with polar coordinates. You need extra memory to deal with those extra steps, or so I think. Maybe someone more knowledgeable than me can help us here?

markg729 · 2026-03-25T15:08:32-0400

Dputiger said:
I did look at the links you posted. I've also spent a great deal of time with ChatGPT. I have seen ChatGPT describe itself as, essentially, a form of autocomplete.

The problem with arguing that AI "knows" things is that AI models have no ground truths and no way to validate the data they are fed. A human being can take a prism, shine a light through it, and see the rainbow of colors that results. A human can determine the wavelength of each color. More practically, you can walk outside every day and see what color the sky is.

AI can't do those things. If an AI is trained on sources that all refer to the color of the sky as green, it will confidently state that the sky is green. If trained on data that misrepresents scientific facts, it will also misrepresent scientific facts. At no point will an LLM say, "Hey, my training data says the Earth is flat, but if that's true, why do the sails of a ship appear over the horizon before the rest of the vessel?" It can't ask these kinds of questions, because it cannot observe anything independent of its training data. It has no senses.

It's very difficult to make ChatGPT shake off the tics and tell-tale signs of AI authorship. I've had conversation after conversation about what those signs are and why they shouldn't be in copy. I've asked the model itself how I can better tune it for desired output. I've then incorporated those responses verbatim. Precious little changes. ChatGPT often states that while I did offer specific instructions, those instructions were not sufficient to overcome its own baselines and style. It's gone from "Use these custom rules," to "Create my own GPT with 12-15 examples of good and bad output, accompanied by an explanation of why each is good or bad."

If I was speaking to a human, I could give that person feedback and explain to them how to write more effectively. Even with a brand-new, fresh-out-of-college graduate, I'd expect to see improvements within a month. By the six-month mark, I'd expect them to have internalized these ideas flawlessly.

AI is turtles, all the way down. The buck stops nowhere, because there is no source of ground truth, no guaranteed-known facts, and no position it can't be shoved off with a little creative work. Its desire to be affirming and foster engagement easily overwhelms its desire to be honest, which is why there are so many stories of AI telling people to do terrible things and affirming toxic (or just plain crazy) beliefs.

AI doesn't know things because AI can't "know things." We may fudge that distinction in common language when we say something like "Excel knows how to turn a CSV file into a structured table with comma delineation," but that's colloquial usage, not factual truth. Excel knows nothing. Neither does ChatGPT.

PS: Complaining about downvotes is the fastest way to get downvoted.

I agree with a lot of what you said. LLMs clearly lack grounding in the physical world and that makes them dysfunctional. I don't know why that means they don't know things though. They model patterns of language and code much better than I can in numerous areas, which seems like "knowing things" to me, even if those things are just a subset of the things that humans know.

Speaking of grounding, models seem to be getting rapidly better at grounding their experience in their interactions with computer systems, like how claude code works. That is still not a physical environment providing grounding, but it is an environment. They are building up a ground truth tested on their back-and-forth interactions with computer systems.

Uncivil Servant · 2026-03-25T15:11:31-0400

So, the answer to the problem that reality is too complex is to reduce the various relationships between word-parts (not concepts) into a higher-level abstraction?

Are these people just trying to brute-force outperform entropy? "Let's build a simulation on top of our simulation so that we can simplify data that is fractally complex".

Look, this isn't my wheelhouse, but if it looks like grifters and quacks like grifters, I really don't need a bird-watcher to lecture me about the cladistics of anatines.

Uncivil Servant · 2026-03-25T15:17:51-0400

numerobis said:
That takes up less space why? It's the same number of coordinates.

JL will reduce dimensions, so that'll save space. Reducing a vector from R^n to Z_2 seems ... extreme.

Feels like the writeup is missing a few key details.

So more like trying to plot the shortest course of a transatlantic flight using a mercator projection map instead of a globe?

Is that going to create potential problems when these people use it to build systems to define their own realities?

foboz1 · 2026-03-25T15:19:31-0400

Nikratio said:
As far as I can tell, that's still two numbers you need to store (angle and distance), so no reduction of data has taken place. Same in higher dimensions. It'd be nice if the article explained this in a bit more detail.

4 pieces of data (distance and direction for two dimensions) vs. two (angle and distance). The analogy basically works.

JoHBE · 2026-03-25T15:23:11-0400

This is very interesting... Considering the asymptotic nature of the impact of increased size of the models on their capabilities, this could mean that consumer-grade 24/32GB VRAM gpus are suddenly able to approach frontier cloud models MUCH closer? What will this mean for the open-sourcing of downgraded versions of those models? Will the hyperscalers continue to do that in the future? Maybe Google WILL, just to suffocate the rivals? But what if this new generation of paired-down models is MORE than good enough for 95% of uses/users???

Edit: all under the assumption that we're talking about reduction in the TOTAL memory footprint of themodels

bugsbony · 2026-03-25T15:25:58-0400

You really should have put in the title that is only the key-value cache. So many people are going to believe that it is the whole memory use of the LLM that is reduced by 6x.

CrystalCowboy · 2026-03-25T15:52:17-0400

Google offers an interesting real-world analogy to explain this process. The vector coordinates are like directions, so the traditional encoding might be “Go 3 blocks East, 4 blocks North.” But using Cartesian coordinates, it’s simply “Go 5 blocks at 37-degrees.” This takes up less space and saves the system from performing expensive data normalization steps.

I've spotted the error. The article publication date is a week early.

JoHBE · 2026-03-25T15:55:07-0400

CrystalCowboy said:
Humans don't actually know anything either, they just do an impression of knowing things by sending signals across synapses and action potentials down axons, and regulate those interactions with astrocytes.

Humans CAN and WILL figure out, however, that a popular and wide-spread "thinking out of the box" riddle evaporates into obviousness when you slightly change ONE word.

numerobis · 2026-03-25T15:58:08-0400

OK, the arXiv paper shows what's missing in the PolarQuant discussion.

tl;dr they're doing lossy compression of a d-vector of floats via two tricks: first reduce the number of floats using random projection, then drop low-order bits of the angles in the polar coordinates.

The overall algorithm:
1. Random projection from the original d dimensions down to m dimensions (m < d).
2. Convert to polar in m dimensions. This doesn't drop the number of values at all, it's just rewriting it. The rewrite preserves independence between coordinates.
3. Quantize each polar coordinate independently, using the fact that they are iid under a known distribution (except the radius, which you just write as a float).

JL projection (Johnson-Lindenstrauss, the two who wrote the original paper on the idea) is that you generate a random matrix of size d x m (with m < d) and you multiply d-dimensional vectors by it, producing a bunch of lower-dimensional vectors. You'd think you'd just get garbage out if you multiply by random shit, but no, the vectors that come out have various useful properties. E.g. distances and angles are preserved.

You could skip step 1 and just get a d-vector of polar coordinates and quantize it straight up. But then they couldn't prove how many bits are needed for a given error bound. The random projection means the polar angles are independent random variables with a known distribution, so you can come up with a procedure ahead of time independent of the data.

Then there's the 1-bit correction thing that I haven't gotten to yet.

chiasticslide · 2026-03-25T15:59:00-0400

The stick in the mud is that the overall speed of the entire process is demonstrably worse. Their release talks about speeds for computing attention and building indices for vector databases, but real-world tests show a dramtic reduction in speed (in terms of generated t/s). The guy working with this on GH managed to get TurboQuant running at either 60% or 83% (depending on model) of Q8 for the two models he was testing. So as with most things, there's a tradeoff, and it looks like that without optimization that tradeoff will be fairly severe. Compressed KV caches will definitely help with long context windows but won't help with fitting what are nominally larger models into smaller RAM/VRAM capacities.

bugsbony · 2026-03-25T16:02:01-0400

JoHBE said:
Humans CAN and WILL figure out, however, that a popular and wide-spread "thinking out of the box" riddle evaporates into obviousness when you slightly change ONE word.

You'd be surprised.
But yes, that means better training is needed, and I don't mean more data.

CrystalCowboy · 2026-03-25T16:03:12-0400

JoHBE said:
Humans CAN and WILL figure out, however, that a popular and wide-spread "thinking out of the box" riddle evaporates into obviousness when you slightly change ONE word.

YMMV. In the most powerful human-led democracy, in the most recent top election, 49.8% of those who voted chose to put a convicted felon in charge of appointing federal judges. I am led to consider George Carlin's words about the intelligence of the average person.

asihkaeun · 2026-03-25T16:03:20-0400

eightycc said:
They can optimize it all they want; but, at the end of the day, it's still slop.

That's like saying because cameras can be used to film porn that cameras are only used for porn. Its getting tedious seeing endless similar comments that thoughtless edgelords insist on making every possible occasion that AI/LLMs come up.

CKHarwood · 2026-03-25T16:06:02-0400

Nikratio said:
As far as I can tell, that's still two numbers you need to store (angle and distance), so no reduction of data has taken place. Same in higher dimensions. It'd be nice if the article explained this in a bit more detail.

In the analogy, an uncompressed method stores four numbers (not two): X direction, X distance, Y direction, and Y distance. The compressed method only stores two (direction and distance, as you described).

That’s just the analogy, though. It doesn’t really explain how the compression works, just acts as an intuition pump to give us a sense of how compression might work. Based on the comments, the analogy might not be broadly intuitive enough to serve that purpose. So maybe just think of it as a red herring.

numerobis · 2026-03-25T16:09:23-0400

chiasticslide said:
The stick in the mud is that the overall speed of the entire process is demonstrably worse. Their release talks about speeds for computing attention and building indices for vector databases, but real-world tests show a dramtic reduction in speed (in terms of generated t/s). The guy working with this on GH managed to get TurboQuant running at either 60% or 83% (depending on model) of Q8 for the two models he was testing. So as with most things, there's a tradeoff, and it looks like that without optimization that tradeoff will be fairly severe. Compressed KV caches will definitely help with long context windows but won't help with fitting what are nominally larger models into smaller RAM/VRAM capacities.

That sounds like a pretty small performance drop for a first implementation of a new technique compared to the standard implementation.

CrystalCowboy · 2026-03-25T16:12:39-0400

CKHarwood said:
In the analogy, an uncompressed method stores four numbers (not two): X direction, X distance, Y direction, and Y distance. The compressed method only stores two (direction and distance, as you described).

That’s just the analogy, though. It doesn’t really explain how the compression works, just acts as an intuition pump to give us a sense of how compression might work. Based on the comments, the analogy might not be broadly intuitive enough to serve that purpose. So maybe just think of it as a red herring.

There are 4 choices for direction, that's 2 bits. The distance can be integer, probably short.
In the polar comparison, you probably need to switch to floating point, you need some trigonometry routines, maybe you need a routine to convert from degrees to radians.

Dputiger · 2026-03-25T16:13:06-0400

markg729 said:
I agree with a lot of what you said. LLMs clearly lack grounding in the physical world and that makes them dysfunctional. I don't know why that means they don't know things though. They model patterns of language and code much better than I can in numerous areas, which seems like "knowing things" to me, even if those things are just a subset of the things that humans know.

That's actually a really good question and it touches on the epistemic question of what it means to "know" something in the first place.

When we say that someone is a good programmer, what we (generally) mean is that this person has spent time both studying a programming language and putting that knowledge into practice. They have both book knowledge and lived experience. This person can remember a time when they had to fix someone else's kludgy, poorly documented code. They've experienced the thrill of finally solving a problem with an elegant solution and the frustration of not being able to make something work.

Assume a human and an AI coding service return the exact same answer to a computing question. The human has answered that question based on his lived experience as a programmer. He or she knew to avoid certain pitfalls, because they've tripped and fallen into those pits before. The human may know an elegant solution because they've used that elegant solution before. The human is simultaneously retrieving memories and applying what they learned in new ways to achieve new goals.

The AI may also turn in the same answer, but it did not arrive at it the same way. The AI is predicting what the best answer is likely to be based on neural net weights derived from its training data. It does not have a comprehensive understanding of programming based on lived experience. AI tends to perform very well on problems it has seen before and much more poorly on problems that aren't represented in its training data. AI cannot remember a clever solution it found to a problem years ago and then extend that solution to a modern problem because most AI bots have limited context windows and cannot be modified by the end-user. You can tweak output by giving custom instructions and creating memories, but you can't change its knowledge base.

Or, to put it a little differently: The best AI models are trained on absolutely enormous data sets in order to improve their accuracy. A human doesn't need to ingest every single StackExchange programming question in order to give good advice. AI's do. While the size of training data sets is often talked about as a good thing, there's an argument that it's actually an indictment of our current best practices. Mozart didn't need to learn every single song written by humans between 4000 BC and the 1700s to become a master composer. William Faulkner, JRR Tolkien, and Ernest Hemingway didn't need to read every single book ever written to be literary giants.

In the colloquial sense, saying computers "know" things is much faster and simpler than trying to find words that don't imply a human level of knowledge. I do this, too. But AI bots don't know things like human do, and they perform less well when asked to take what they know and apply it to new problems.

markg729 said:
Speaking of grounding, models seem to be getting rapidly better at grounding their experience in their interactions with computer systems, like how claude code works. That is still not a physical environment providing grounding, but it is an environment. They are building up a ground truth tested on their back-and-forth interactions with computer systems.

There are efforts to make AI more aware of its own surroundings. I believe it's referred to as physical AI. I agree that in the future, the restrictions and limits may well be different than they are today.

Fatesrider · 2026-03-25T16:14:13-0400

Cutting to the chase: How does this impact the cost per token to generate/process?

If the cost per token isn't making a profit, then there's no real advantage WRT the elephant that needs to be removed from the room: profit generation.

It doesn't matter how shiny it is. If it's not profitable to use, they're still going to lose money on it and the current issues with AI remain.

I gathered from the article that the potential to reduce the cost per token exists. But you're also processing a lot more tokens. So how does that work on on the balance sheet? Is AI able to make a profit in each token, or no?

CrystalCowboy · 2026-03-25T16:15:15-0400

JoHBE said:
Humans CAN and WILL figure out, however, that a popular and wide-spread "thinking out of the box" riddle evaporates into obviousness when you slightly change ONE word.

Can they recognize John Searle's 'Chinese room argument' when they read it, and are they aware of the convincing counter-arguments?

numerobis · 2026-03-25T16:16:16-0400

CKHarwood said:
In the analogy, an uncompressed method stores four numbers (not two): X direction, X distance, Y direction, and Y distance. The compressed method only stores two (direction and distance, as you described).

That’s just the analogy, though. It doesn’t really explain how the compression works, just acts as an intuition pump to give us a sense of how compression might work. Based on the comments, the analogy might not be broadly intuitive enough to serve that purpose. So maybe just think of it as a red herring.

It's (x,y) in cartesian versus (phi, r) in polar. Add more dimensions, you add more coordinates in cartesian and an equal number of angles in polar. Same number of values no matter what.

Where it helps is that the angles are in [0, pi/2] rather than being some arbitrary value in [-inf,inf]. So you can quantize it with a fixed-point scheme. But that only helps if you know something about how error in the angles affects error in the output. See my big post above, which is about as simple as I could make it but it still somewhat hurts my brain.

Google says new TurboQuant compression can lower AI memory usage without sacrificing quality

Ars Centurion

Ars Centurion

Seniorius Lurkius

Ars Tribunus Angusticlavius

Seniorius Lurkius

Ars Praetorian

Ars Centurion

Ars Centurion

Ars Legatus Legionis

Smack-Fu Master, in training

Ars Legatus Legionis

Ars Praefectus

Ars Scholae Palatinae

Ars Praetorian

Ars Centurion

Ars Scholae Palatinae

Seniorius Lurkius

Ars Scholae Palatinae

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Ars Praefectus

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Praefectus

Ars Tribunus Angusticlavius

Ars Centurion

Ars Scholae Palatinae

Ars Scholae Palatinae

Smack-Fu Master, in training

Smack-Fu Master, in training

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Ars Praefectus

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Tribunus Angusticlavius