Google says new TurboQuant compression can lower AI memory usage without sacrificing quality

Post content hidden for low score. Show…

numerobis

Ars Tribunus Angusticlavius
50,230
Subscriptor
Google offers an interesting real-world analogy to explain this process. The vector coordinates are like directions, so the traditional encoding might be “Go 3 blocks East, 4 blocks North.” But using Cartesian coordinates, it’s simply “Go 5 blocks at 37-degrees.” This takes up less space and saves the system from performing expensive data normalization steps.

That takes up less space why? It's the same number of coordinates.

JL will reduce dimensions, so that'll save space. Reducing a vector from R^n to Z_2 seems ... extreme.

Feels like the writeup is missing a few key details.
 
Upvote
92 (92 / 0)
The vector coordinates are like directions, so the traditional encoding might be “Go 3 blocks East, 4 blocks North.” But using Cartesian coordinates, it’s simply “Go 5 blocks at 37-degrees.” This takes up less space and saves the system from performing expensive data normalization steps.

As far as I can tell, that's still two numbers you need to store (angle and distance), so no reduction of data has taken place. Same in higher dimensions. It'd be nice if the article explained this in a bit more detail.
 
Upvote
72 (73 / -1)

Varste

Ars Praetorian
532
Subscriptor
I don't understand Google's real-world analogy example. They go from giving you two numbers to.... two numbers, but now one is an angle. Don't get me wrong, I believe they've seen the benefits they're touting. I just don't think you could come up with a sufficient layman explanation for us folks who don't work with LLM data structures.
edit: ninja'd quite thoroughly it seems.
 
Upvote
41 (41 / 0)
Post content hidden for low score. Show…

marsilies

Ars Legatus Legionis
24,385
Subscriptor++
That takes up less space why? It's the same number of coordinates.
The analogy is using only 2 dimensions for ease of understanding. The actual vectors have many more dimensions, which are quantized down to just two components.

From the original article from Google:
https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/
PolarQuant acts as a high-efficiency compression bridge, converting Cartesian inputs into a compact Polar "shorthand" for storage and processing. The mechanism begins by grouping pairs of coordinates from a d-dimensional vector and mapping them onto a polar coordinate system. Radii are then gathered in pairs for recursive polar transformations — a process that repeats until the data is distilled into a single final radius and a collection of descriptive angles.

....Because the pattern of the angles is known and highly concentrated, the model no longer needs to perform the expensive data normalization step because it maps data onto a fixed, predictable "circular" grid where the boundaries are already known, rather than a "square" grid where the boundaries change constantly. This allows PolarQuant to eliminate the memory overhead that traditional methods must carry.
 
Upvote
75 (77 / -2)
I think "LLMs don’t actually know anything" is a strong mischaracterization. See https://www.astralcodexten.com/p/next-token-predictor-is-an-ais-job and https://www.theargumentmag.com/p/when-technically-true-becomes-actually

Edit: There hasn't been enough time for the people downvoting me to actually read the links, so they're all just assholes apparently.
I did look at the links you posted. I've also spent a great deal of time with ChatGPT. I have seen ChatGPT describe itself as, essentially, a form of autocomplete.

The problem with arguing that AI "knows" things is that AI models have no ground truths and no way to validate the data they are fed. A human being can take a prism, shine a light through it, and see the rainbow of colors that results. A human can determine the wavelength of each color. More practically, you can walk outside every day and see what color the sky is.

AI can't do those things. If an AI is trained on sources that all refer to the color of the sky as green, it will confidently state that the sky is green. If trained on data that misrepresents scientific facts, it will also misrepresent scientific facts. At no point will an LLM say, "Hey, my training data says the Earth is flat, but if that's true, why do the sails of a ship appear over the horizon before the rest of the vessel?" It can't ask these kinds of questions, because it cannot observe anything independent of its training data. It has no senses.

It's very difficult to make ChatGPT shake off the tics and tell-tale signs of AI authorship. I've had conversation after conversation about what those signs are and why they shouldn't be in copy. I've asked the model itself how I can better tune it for desired output. I've then incorporated those responses verbatim. Precious little changes. ChatGPT often states that while I did offer specific instructions, those instructions were not sufficient to overcome its own baselines and style. It's gone from "Use these custom rules," to "Create my own GPT with 12-15 examples of good and bad output, accompanied by an explanation of why each is good or bad."

If I was speaking to a human, I could give that person feedback and explain to them how to write more effectively. Even with a brand-new, fresh-out-of-college graduate, I'd expect to see improvements within a month. By the six-month mark, I'd expect them to have internalized these ideas flawlessly. But I'm not working with a human. I'm working with a bot that can't stop saying shit like "This is where the magic happens" or "This idea isn't just important -- it's revolutionary."

AI is turtles, all the way down. The buck stops nowhere, because there is no source of ground truth, no guaranteed-known facts, and no position it can't be shoved off with a little creative work. Its desire to be affirming and foster engagement easily overwhelms its desire to be honest, which is why there are so many stories of AI telling people to do terrible things and affirming toxic (or just plain crazy) beliefs.

AI doesn't know things because AI can't "know things." We may fudge that distinction in common language when we say something like "Excel knows how to turn a CSV file into a structured table with comma delineation," but that's colloquial usage, not factual truth. Excel knows nothing. Neither does ChatGPT.

PS: Complaining about downvotes is the fastest way to get downvoted.
 
Last edited:
Upvote
105 (120 / -15)

floyd42

Ars Scholae Palatinae
1,188
Subscriptor++
Polar coordinates feel like they make more sense than vectors since I think of the world in 2 dimensions rather than 3 when I'm navigating.

I remember writing a zip algorithm back in my data structures and algorithms class and this feels pretty similar with reducing and optimizing the key space.

The current models must have really bulky keys if they're able to get a real world, 6x memory improvement?
 
Upvote
-7 (0 / -7)

EricM2

Ars Centurion
354
Subscriptor
The analogy is using only 2 dimensions for ease of understanding. The actual vectors have many more dimensions, which are quantized down to just two components.
But to describe an n-vector you'd need multiple (n-1) angles in this analogy. n=2 results in one angle in the example. For n=3 you need a start point and 2 angles to describe a point in 3D space. if n=4 you'd need 3 angles and so on ...
Anyway, my brain tilted at
This applies a 1-bit error-correction layer to the model, reducing each vector to a single bit (+1 or -1) while preserving the essential vector data that describes relationships
I've no idea what to make of this - will try tomorrow again, with a better caffeine supply...
 
Upvote
28 (28 / 0)
I don't understand Google's real-world analogy example. They go from giving you two numbers to.... two numbers, but now one is an angle. Don't get me wrong, I believe they've seen the benefits they're touting. I just don't think you could come up with a sufficient layman explanation for us folks who don't work with LLM data structures.
edit: ninja'd quite thoroughly it seems.
From my (limited) knowledge, it is easier to work with polar coordinates mathematically speaking. With the normal x and y values you need more calculus steps to do vector calculations than with polar coordinates. You need extra memory to deal with those extra steps, or so I think. Maybe someone more knowledgeable than me can help us here?
 
Upvote
-2 (2 / -4)

markg729

Seniorius Lurkius
46
Subscriptor
I did look at the links you posted. I've also spent a great deal of time with ChatGPT. I have seen ChatGPT describe itself as, essentially, a form of autocomplete.

The problem with arguing that AI "knows" things is that AI models have no ground truths and no way to validate the data they are fed. A human being can take a prism, shine a light through it, and see the rainbow of colors that results. A human can determine the wavelength of each color. More practically, you can walk outside every day and see what color the sky is.

AI can't do those things. If an AI is trained on sources that all refer to the color of the sky as green, it will confidently state that the sky is green. If trained on data that misrepresents scientific facts, it will also misrepresent scientific facts. At no point will an LLM say, "Hey, my training data says the Earth is flat, but if that's true, why do the sails of a ship appear over the horizon before the rest of the vessel?" It can't ask these kinds of questions, because it cannot observe anything independent of its training data. It has no senses.

It's very difficult to make ChatGPT shake off the tics and tell-tale signs of AI authorship. I've had conversation after conversation about what those signs are and why they shouldn't be in copy. I've asked the model itself how I can better tune it for desired output. I've then incorporated those responses verbatim. Precious little changes. ChatGPT often states that while I did offer specific instructions, those instructions were not sufficient to overcome its own baselines and style. It's gone from "Use these custom rules," to "Create my own GPT with 12-15 examples of good and bad output, accompanied by an explanation of why each is good or bad."

If I was speaking to a human, I could give that person feedback and explain to them how to write more effectively. Even with a brand-new, fresh-out-of-college graduate, I'd expect to see improvements within a month. By the six-month mark, I'd expect them to have internalized these ideas flawlessly.

AI is turtles, all the way down. The buck stops nowhere, because there is no source of ground truth, no guaranteed-known facts, and no position it can't be shoved off with a little creative work. Its desire to be affirming and foster engagement easily overwhelms its desire to be honest, which is why there are so many stories of AI telling people to do terrible things and affirming toxic (or just plain crazy) beliefs.

AI doesn't know things because AI can't "know things." We may fudge that distinction in common language when we say something like "Excel knows how to turn a CSV file into a structured table with comma delineation," but that's colloquial usage, not factual truth. Excel knows nothing. Neither does ChatGPT.

PS: Complaining about downvotes is the fastest way to get downvoted.
I agree with a lot of what you said. LLMs clearly lack grounding in the physical world and that makes them dysfunctional. I don't know why that means they don't know things though. They model patterns of language and code much better than I can in numerous areas, which seems like "knowing things" to me, even if those things are just a subset of the things that humans know.

Speaking of grounding, models seem to be getting rapidly better at grounding their experience in their interactions with computer systems, like how claude code works. That is still not a physical environment providing grounding, but it is an environment. They are building up a ground truth tested on their back-and-forth interactions with computer systems.
 
Upvote
-14 (14 / -28)

Uncivil Servant

Ars Scholae Palatinae
4,664
Subscriptor
So, the answer to the problem that reality is too complex is to reduce the various relationships between word-parts (not concepts) into a higher-level abstraction?

Are these people just trying to brute-force outperform entropy? "Let's build a simulation on top of our simulation so that we can simplify data that is fractally complex".

Look, this isn't my wheelhouse, but if it looks like grifters and quacks like grifters, I really don't need a bird-watcher to lecture me about the cladistics of anatines.
 
Upvote
-12 (1 / -13)

Uncivil Servant

Ars Scholae Palatinae
4,664
Subscriptor
That takes up less space why? It's the same number of coordinates.

JL will reduce dimensions, so that'll save space. Reducing a vector from R^n to Z_2 seems ... extreme.

Feels like the writeup is missing a few key details.

So more like trying to plot the shortest course of a transatlantic flight using a mercator projection map instead of a globe?

Is that going to create potential problems when these people use it to build systems to define their own realities?
 
Upvote
-2 (1 / -3)

foboz1

Wise, Aged Ars Veteran
173
Subscriptor++
As far as I can tell, that's still two numbers you need to store (angle and distance), so no reduction of data has taken place. Same in higher dimensions. It'd be nice if the article explained this in a bit more detail.
4 pieces of data (distance and direction for two dimensions) vs. two (angle and distance). The analogy basically works.
 
Upvote
-3 (11 / -14)

JoHBE

Ars Praefectus
4,130
Subscriptor++
This is very interesting... Considering the asymptotic nature of the impact of increased size of the models on their capabilities, this could mean that consumer-grade 24/32GB VRAM gpus are suddenly able to approach frontier cloud models MUCH closer? What will this mean for the open-sourcing of downgraded versions of those models? Will the hyperscalers continue to do that in the future? Maybe Google WILL, just to suffocate the rivals? But what if this new generation of paired-down models is MORE than good enough for 95% of uses/users???

Edit: all under the assumption that we're talking about reduction in the TOTAL memory footprint of themodels
 
Last edited:
Upvote
1 (3 / -2)
Post content hidden for low score. Show…
Google offers an interesting real-world analogy to explain this process. The vector coordinates are like directions, so the traditional encoding might be “Go 3 blocks East, 4 blocks North.” But using Cartesian coordinates, it’s simply “Go 5 blocks at 37-degrees.” This takes up less space and saves the system from performing expensive data normalization steps.
I've spotted the error. The article publication date is a week early.
 
Upvote
-4 (0 / -4)

JoHBE

Ars Praefectus
4,130
Subscriptor++
Humans don't actually know anything either, they just do an impression of knowing things by sending signals across synapses and action potentials down axons, and regulate those interactions with astrocytes.

Humans CAN and WILL figure out, however, that a popular and wide-spread "thinking out of the box" riddle evaporates into obviousness when you slightly change ONE word.
 
Upvote
20 (20 / 0)

numerobis

Ars Tribunus Angusticlavius
50,230
Subscriptor
OK, the arXiv paper shows what's missing in the PolarQuant discussion.

tl;dr they're doing lossy compression of a d-vector of floats via two tricks: first reduce the number of floats using random projection, then drop low-order bits of the angles in the polar coordinates.

The overall algorithm:
1. Random projection from the original d dimensions down to m dimensions (m < d).
2. Convert to polar in m dimensions. This doesn't drop the number of values at all, it's just rewriting it. The rewrite preserves independence between coordinates.
3. Quantize each polar coordinate independently, using the fact that they are iid under a known distribution (except the radius, which you just write as a float).

JL projection (Johnson-Lindenstrauss, the two who wrote the original paper on the idea) is that you generate a random matrix of size d x m (with m < d) and you multiply d-dimensional vectors by it, producing a bunch of lower-dimensional vectors. You'd think you'd just get garbage out if you multiply by random shit, but no, the vectors that come out have various useful properties. E.g. distances and angles are preserved.

You could skip step 1 and just get a d-vector of polar coordinates and quantize it straight up. But then they couldn't prove how many bits are needed for a given error bound. The random projection means the polar angles are independent random variables with a known distribution, so you can come up with a procedure ahead of time independent of the data.


Then there's the 1-bit correction thing that I haven't gotten to yet.
 
Upvote
48 (48 / 0)

chiasticslide

Ars Centurion
241
Subscriptor++
The stick in the mud is that the overall speed of the entire process is demonstrably worse. Their release talks about speeds for computing attention and building indices for vector databases, but real-world tests show a dramtic reduction in speed (in terms of generated t/s). The guy working with this on GH managed to get TurboQuant running at either 60% or 83% (depending on model) of Q8 for the two models he was testing. So as with most things, there's a tradeoff, and it looks like that without optimization that tradeoff will be fairly severe. Compressed KV caches will definitely help with long context windows but won't help with fitting what are nominally larger models into smaller RAM/VRAM capacities.
 
Upvote
17 (18 / -1)
Humans CAN and WILL figure out, however, that a popular and wide-spread "thinking out of the box" riddle evaporates into obviousness when you slightly change ONE word.
You'd be surprised.
But yes, that means better training is needed, and I don't mean more data.
 
Upvote
-1 (0 / -1)
Humans CAN and WILL figure out, however, that a popular and wide-spread "thinking out of the box" riddle evaporates into obviousness when you slightly change ONE word.
YMMV. In the most powerful human-led democracy, in the most recent top election, 49.8% of those who voted chose to put a convicted felon in charge of appointing federal judges. I am led to consider George Carlin's words about the intelligence of the average person.
 
Upvote
2 (7 / -5)

asihkaeun

Smack-Fu Master, in training
92
They can optimize it all they want; but, at the end of the day, it's still slop.
That's like saying because cameras can be used to film porn that cameras are only used for porn. Its getting tedious seeing endless similar comments that thoughtless edgelords insist on making every possible occasion that AI/LLMs come up.
 
Upvote
6 (17 / -11)

CKHarwood

Smack-Fu Master, in training
35
As far as I can tell, that's still two numbers you need to store (angle and distance), so no reduction of data has taken place. Same in higher dimensions. It'd be nice if the article explained this in a bit more detail.
In the analogy, an uncompressed method stores four numbers (not two): X direction, X distance, Y direction, and Y distance. The compressed method only stores two (direction and distance, as you described).

That’s just the analogy, though. It doesn’t really explain how the compression works, just acts as an intuition pump to give us a sense of how compression might work. Based on the comments, the analogy might not be broadly intuitive enough to serve that purpose. So maybe just think of it as a red herring.
 
Upvote
-3 (7 / -10)

numerobis

Ars Tribunus Angusticlavius
50,230
Subscriptor
The stick in the mud is that the overall speed of the entire process is demonstrably worse. Their release talks about speeds for computing attention and building indices for vector databases, but real-world tests show a dramtic reduction in speed (in terms of generated t/s). The guy working with this on GH managed to get TurboQuant running at either 60% or 83% (depending on model) of Q8 for the two models he was testing. So as with most things, there's a tradeoff, and it looks like that without optimization that tradeoff will be fairly severe. Compressed KV caches will definitely help with long context windows but won't help with fitting what are nominally larger models into smaller RAM/VRAM capacities.
That sounds like a pretty small performance drop for a first implementation of a new technique compared to the standard implementation.
 
Upvote
6 (6 / 0)
In the analogy, an uncompressed method stores four numbers (not two): X direction, X distance, Y direction, and Y distance. The compressed method only stores two (direction and distance, as you described).

That’s just the analogy, though. It doesn’t really explain how the compression works, just acts as an intuition pump to give us a sense of how compression might work. Based on the comments, the analogy might not be broadly intuitive enough to serve that purpose. So maybe just think of it as a red herring.
There are 4 choices for direction, that's 2 bits. The distance can be integer, probably short.
In the polar comparison, you probably need to switch to floating point, you need some trigonometry routines, maybe you need a routine to convert from degrees to radians.
 
Upvote
0 (2 / -2)
I agree with a lot of what you said. LLMs clearly lack grounding in the physical world and that makes them dysfunctional. I don't know why that means they don't know things though. They model patterns of language and code much better than I can in numerous areas, which seems like "knowing things" to me, even if those things are just a subset of the things that humans know.
That's actually a really good question and it touches on the epistemic question of what it means to "know" something in the first place.

When we say that someone is a good programmer, what we (generally) mean is that this person has spent time both studying a programming language and putting that knowledge into practice. They have both book knowledge and lived experience. This person can remember a time when they had to fix someone else's kludgy, poorly documented code. They've experienced the thrill of finally solving a problem with an elegant solution and the frustration of not being able to make something work.

Assume a human and an AI coding service return the exact same answer to a computing question. The human has answered that question based on his lived experience as a programmer. He or she knew to avoid certain pitfalls, because they've tripped and fallen into those pits before. The human may know an elegant solution because they've used that elegant solution before. The human is simultaneously retrieving memories and applying what they learned in new ways to achieve new goals.

The AI may also turn in the same answer, but it did not arrive at it the same way. The AI is predicting what the best answer is likely to be based on neural net weights derived from its training data. It does not have a comprehensive understanding of programming based on lived experience. AI tends to perform very well on problems it has seen before and much more poorly on problems that aren't represented in its training data. AI cannot remember a clever solution it found to a problem years ago and then extend that solution to a modern problem because most AI bots have limited context windows and cannot be modified by the end-user. You can tweak output by giving custom instructions and creating memories, but you can't change its knowledge base.

Or, to put it a little differently: The best AI models are trained on absolutely enormous data sets in order to improve their accuracy. A human doesn't need to ingest every single StackExchange programming question in order to give good advice. AI's do. While the size of training data sets is often talked about as a good thing, there's an argument that it's actually an indictment of our current best practices. Mozart didn't need to learn every single song written by humans between 4000 BC and the 1700s to become a master composer. William Faulkner, JRR Tolkien, and Ernest Hemingway didn't need to read every single book ever written to be literary giants.

In the colloquial sense, saying computers "know" things is much faster and simpler than trying to find words that don't imply a human level of knowledge. I do this, too. But AI bots don't know things like human do, and they perform less well when asked to take what they know and apply it to new problems.
Speaking of grounding, models seem to be getting rapidly better at grounding their experience in their interactions with computer systems, like how claude code works. That is still not a physical environment providing grounding, but it is an environment. They are building up a ground truth tested on their back-and-forth interactions with computer systems.

There are efforts to make AI more aware of its own surroundings. I believe it's referred to as physical AI. I agree that in the future, the restrictions and limits may well be different than they are today.
 
Upvote
16 (16 / 0)

Fatesrider

Ars Legatus Legionis
24,966
Subscriptor
Cutting to the chase: How does this impact the cost per token to generate/process?

If the cost per token isn't making a profit, then there's no real advantage WRT the elephant that needs to be removed from the room: profit generation.

It doesn't matter how shiny it is. If it's not profitable to use, they're still going to lose money on it and the current issues with AI remain.

I gathered from the article that the potential to reduce the cost per token exists. But you're also processing a lot more tokens. So how does that work on on the balance sheet? Is AI able to make a profit in each token, or no?
 
Upvote
8 (9 / -1)
Humans CAN and WILL figure out, however, that a popular and wide-spread "thinking out of the box" riddle evaporates into obviousness when you slightly change ONE word.
Can they recognize John Searle's 'Chinese room argument' when they read it, and are they aware of the convincing counter-arguments?
 
Upvote
-4 (0 / -4)

numerobis

Ars Tribunus Angusticlavius
50,230
Subscriptor
In the analogy, an uncompressed method stores four numbers (not two): X direction, X distance, Y direction, and Y distance. The compressed method only stores two (direction and distance, as you described).

That’s just the analogy, though. It doesn’t really explain how the compression works, just acts as an intuition pump to give us a sense of how compression might work. Based on the comments, the analogy might not be broadly intuitive enough to serve that purpose. So maybe just think of it as a red herring.
It's (x,y) in cartesian versus (phi, r) in polar. Add more dimensions, you add more coordinates in cartesian and an equal number of angles in polar. Same number of values no matter what.

Where it helps is that the angles are in [0, pi/2] rather than being some arbitrary value in [-inf,inf]. So you can quantize it with a fixed-point scheme. But that only helps if you know something about how error in the angles affects error in the output. See my big post above, which is about as simple as I could make it but it still somewhat hurts my brain.
 
Upvote
17 (17 / 0)