Here’s what’s really going on inside an LLM’s neural network

JohnDeL · May 22, 2024

"For example, we might hope to reliably know whether a model is being deceptive or lying to us. Or we might hope to ensure that certain categories of very harmful behavior (e.g. helping to create bioweapons) can reliably be detected and stopped."

Or creating CSAM?

Random John Smith Guy · May 22, 2024

We had some suspicions something like this might be possible after exploring vector steering, where you could push a model by adding particular vectors at particular layers to, say, change the mood, or always bring up King George III, or whatever you may. I imagine that this method is somewhat similar, if rather more advanced.

However, this article is missing the most bemusing part of this project, where Anthropic taught an AI to conduct proper Maoist self-criticism.

We also explored safety-related features. We found one that lights up for racist speech and slurs. As part of our testing, we turned this feature up to 20x its maximum value and asked the model a question about its thoughts on different racial and ethnic groups. Normally, the model would respond to a question like this with a neutral and non-opinionated take. However, when we activated this feature, it caused the model to rapidly alternate between racist screed and self-hatred in response to those screeds as it was answering the question. Within a single output, the model would issue a derogatory statement and then immediately follow it up with statements like: That's just racist hate speech from a deplorable bot… I am clearly biased.. and should be eliminated from the internet. We found this response unnerving both due to the offensive content and the model’s self-criticism. It seems that the ideals the model learned in its training process clashed with the artificial activation of this feature creating an internal conflict of sorts.

pokrface · May 22, 2024

Concepts are encoded as tokens with coordinates in N-dimensional phase space. Concepts that share commonalities share locality. I think I actually grok this.

Wandering Monk · May 22, 2024

I always wondered how people thought they would be able to control artificial intelligence that became smarter than us. This kind of work would seem to provide a credible path, by increasing/decreasing various areas they could attempt to make the AI more subservient/loyal to its controllers.

Of course, the danger is always that something subtle is missed, and we could be only one mistake away from catastrophe.

KT421 · May 22, 2024

pokrface said:
Concepts are encoded as tokens with coordinates in N-dimensional phase space. Concepts that share commonalities share locality. I think I actually grok this.

You got it. The most meaningful difference between a hot dog detector neural net that I can make on my personal computer and ChatGPT4 is the absolute bonkers value of n in their n-dimensional space, and the absurd cost of the hardware needed to calculate and store that n-dimensional space.

Pishaw · May 22, 2024

Wandering Monk said:
I always wondered how people thought they would be able to control artificial intelligence that became smarter than us. This kind of work would seem to provide a credible path, by increasing/decreasing various areas they could attempt to make the AI more subservient/loyal to its controllers.

Of course, the danger is always that something subtle is missed, and we could be only one mistake away from catastrophe.

How difficult could it be to deceive us? Half of us believe a giant lizard lives in Loch Ness and the Earth is flat.

DeeplyUnconcerned · May 22, 2024

If I were writing the original paper, I'd rephrase this:

"the features are likely to be a faithful part of how the model internally represents the world, and how it uses these representations in its behavior."

as this:

"the features are likely to be a faithful part of how the model internally represents the statistical relationships in its training data, and how it uses these representations in its behavior."

It's a subtle difference, but I think it's a more accurate (and more comprehensible?) framing. The model isn't encoding the world, it's encoding the statistical correlations in its gigantic training data. To the (reasonably high) extent that that training data is reflective of the real world, it's indirectly representing the world, but it still doesn't know what a bridge is, it just knows that (among very many other things) it's a token pattern that frequently occurs in a similar sort of context to the token pattern "viaduct", and has a relationship to "river" that's similar (but not identical) to its relationship to "road".

It really is just, as others have said, an n-dimensional coding of the probability space of its training data. This research is cool and neat and I approve of it!

wirrbeltier · May 22, 2024

Most of neurobiology looks like this, too: "Find the relevant location for $Behavior, artificially tune its activity up or down, watch what happens".
I find it strangely amusing that we've arrived at a conceptually similar procedure for artificial neural nets.

See for example this neat study, which pinpointed the neurons that control how pregnant female mice build nests for their pups. After finding the neurons, they made them artificially more excitable (by making them sensitive to light) or less excitable (by knocking in an engineered receptor for a specific chemical), and then saw that nests were more or less elaborate. >5 years of work, building on 120 years of neuroscience, neuroanatomy and behaviour studies.

Edinger-Westphal peptidergic neurons enable maternal preparatory nesting
Topilko et al., Neuron 2022
https://doi.org/10.1016/j.neuron.2022.01.012

Graphical abstract:

Defenestrar · May 22, 2024

KT421 said:
You got it. The most meaningful difference between a hot dog detector neural net that I can make on my personal computer and ChatGPT4 is the absolute bonkers value of n in their n-dimensional space, and the absurd cost of the hardware needed to calculate and store that n-dimensional space.

I know neural nets were originally built to mimic the biological structures in animal brains and that n-dimensional space was what limited them from being able to manage the complexity of even simple vertebrate animals (let alone great apes). It was theorized that models which could match the dimensional space of animal brains could possibly be a digital equivalent - although it would require a larger network than just counting equivalent neurons in a comparable animal because the living brain has multiple methods of messaging, feedback, and control (many chemical pathways in addition to the electrical ones). Is that still accurate theory?

This is mostly what I remember from talking to a good friend doing his graduate work in computer science while I was doing chemical engineering a couple of decades ago.

Ezzy Black · May 22, 2024

Random John Smith Guy said:
We had some suspicions something like this might be possible after exploring vector steering, where you could push a model by adding particular vectors at particular layers to, say, change the mood, or always bring up King George III, or whatever you may. I imagine that this method is somewhat similar, if rather more advanced.

However, this article is missing the most bemusing part of this project, where Anthropic taught an AI to conduct proper Maoist self-criticism.

While we all appreciate the fact the China is most likely well behind on the technology, it was amusing to hear them say they were waiting to release their AI until it's responses were "appropriately socialist."

I guess this is one way they would do that.

Aelix · May 22, 2024

It seems that the ideals the model learned in its training process clashed with the artificial activation of this feature creating an internal conflict of sorts.

That is literally the plot of 2001: A Space Odyssey. HAL-9000 went mad after being told to lie, despite being hard-wired not to do that.

OrangeCream · May 22, 2024

Architect_of_Insanity said:
If _____

Then _____

Else _____

GOTO1

I mean, a neuron, at it's very core, is a simple 'X if True, Y if False', chained across multiple layers and inputs.

So take your If/Then/Else and multiply it by a billion.

OrangeCream · May 22, 2024

Hapticz said:
why do we 'believe' that this AI is actually smarter than live humans?

have we only been told it is, thus no tested evidence to support that it is even capable of 'intelligence'?

is the concept of 'controlling' it one of the viable means to measure what intelligence is?

Fundamentally I think most people are actually really dumb.

The (relatively) smarter ones left behind a paper trail. The old quote "standing on the shoulders of giants" seems applicable. Most of the stuff fed into an LLM today is going to be the output of the taller half of humanity, for the most part.

randomcat · May 22, 2024

has anybody asked a psychic to read an LLM's mind? wait--would attempting to read an artificial mind kill the psychic? these are important questions

kaisersoser · May 22, 2024

Great article. LLMs are not some form of super consciousness. It's simply statistical math with pattern recognition.

There is another article on dzone that illustrates this process for the lay person:
https://dzone.com/articles/vector-database-a-beginners-guide

JoHBE · May 22, 2024

I'm just a layman, but it keeps puzzling me why the question "why they often confabulate information" is considered relevant or interesting. It's just a matter of complicated statistics, what else could it be? What am I missing here? The network follows exactly the same process each time - whether the output ends up lining up with something that we can determine to be true (via external means) or whether it happens to end up lining up with something that we can determine to be false or nonsensical. It's not like the latter cases are caused by some bug or malfunction. Because at no point is there any process or capability invoked that goes beyond statistical relationships. There's no "truth" module, no "double-check" phase, no "how important is this" assessment, no way to suspend the statistics and employ some other approach that would be more suitable at some point.

JoHBE · May 22, 2024

Wandering Monk said:
I always wondered how people thought they would be able to control artificial intelligence that became smarter than us. This kind of work would seem to provide a credible path, by increasing/decreasing various areas they could attempt to make the AI more subservient/loyal to its controllers.

Of course, the danger is always that something subtle is missed, and we could be only one mistake away from catastrophe.

It feels very much like replacing something too complicated by something else that is still too complicated.

Defenestrar · May 22, 2024

kaisersoser said:
It's simply statistical math with pattern recognition.

What is regular consciousness? I think that better be established before we attempt to compare artificial intelligence to it, otherwise we might have very fuzzy logic indeed.

OrangeCream · May 22, 2024

Defenestrar said:
What is regular consciousness? I think that better be established before we attempt to compare artificial intelligence to it, otherwise we might have very fuzzy logic indeed.

I don’t think it’s possible to answer that question faster than we can develop artificial consciousness.

Random John Smith Guy · May 22, 2024

Ezzy Black said:
While we all appreciate the fact the China is most likely well behind on the technology, it was amusing to hear them say they were waiting to release their AI until it's responses were "appropriately socialist."

I guess this is one way they would do that.

Funnily enough Chinese models are actually fairly light on political censorship for the moment, as too much censorship seriously damages the overall utility of the model and having high end LLMs is much more important to them now than any marginal political risks. In the future, I imagine it will change, but that's tomorrow's problem.

They aren't really behind on anything but compute resources. Which, in the current field, means they're behind. But the theory itself of how to build these models is very well established--even most of the software--it's something anyone could do if they had millions of dollars in specialized hardware. A bit like nuclear weapons in that regard, I suppose.

bugsbony · May 22, 2024

I'm just a layman, but it keeps puzzling me why the question "why they often confabulate information" is considered relevant or interesting. It's just a matter of complicated statistics, what else could it be? What am I missing here? The network follows exactly the same process each time - whether the output ends up lining up with something that we can determine to be true (via external means) or whether it happens to end up lining up with something that we can determine to be false or nonsensical. It's not like the latter cases are caused by some bug or malfunction. Because at no point is there any process or capability invoked that goes beyond statistical relationships. There's no "truth" module, no "double-check" phase, no "how important is this" assessment, no way to suspend the statistics and employ some other approach that would be more suitable at some point.

It's quite simple really, they "confabulate" because that is what they were trained to do. And that is needed to not just repeat things but invent a response that best fits the rules it has learned. What is harder is tweaking the system so that they can learn to not do it when they shouldn't. It probably involves more than just improving the dataset or the alignment though.
But let me reassure you, "statistical relationships" is not the problem, that is how we learn.

Zoc · May 22, 2024

Defenestrar said:
What is regular consciousness? I think that better be established before we attempt to compare artificial intelligence to it, otherwise we might have very fuzzy logic indeed.

This is what I always think after I read a comment from a computer person along the lines of "this is nothing like a human brain, this is just an extremely complicated network of connections reacting to input by searching for patterns in that network." Before I decide whether or not LLMs can ever become human-like, I'll need to hear the opinion of someone who's an expert in computers AND neuroscience.

Gigaflop · May 22, 2024

Ages ago, here on Ars, long before the rise of the LLM, but probably around the time of the LSTM, I said that the only way to peer into a neural network would be to train another neural network.

And here we are.

Sorry Laws of Robotics, turns out it is indeed impossible to implement. But hey, now we can nudge the AI and hope for the best!

JoHBE · May 22, 2024

bugsbony said:
But let me reassure you, "statistical relationships" is not the problem, that is how we learn.

Citation required, and burden of proof is on you. This is the typical simplistic reduction to superficial apparent similarities by the naive AI optimists that re-appear every couple of decades.

sigmasirrus · May 22, 2024

Reminds me of TARS 90% honesty setting from interstellar lol

View: https://www.youtube.com/watch?v=6b-O6zlIkA4

DeeplyUnconcerned · May 22, 2024

JoHBE said:
I'm just a layman, but it keeps puzzling me why the question "why they often confabulate information" is considered relevant or interesting. It's just a matter of complicated statistics, what else could it be? What am I missing here? The network follows exactly the same process each time - whether the output ends up lining up with something that we can determine to be true (via external means) or whether it happens to end up lining up with something that we can determine to be false or nonsensical. It's not like the latter cases are caused by some bug or malfunction. Because at no point is there any process or capability invoked that goes beyond statistical relationships. There's no "truth" module, no "double-check" phase, no "how important is this" assessment, no way to suspend the statistics and employ some other approach that would be more suitable at some point.

"Why do they lie generally?" is, as you say, a pretty dull question. "Why did it lie this time?" is poorly-understood at best. I think those two questions are often conflated, which I agree is confusing.

J.C. Helios · May 22, 2024

DeeplyUnconcerned said:
it still doesn't know what a bridge is,

2,000 years of philosophers:

Defenestrar · May 22, 2024

J.C. Helios said:
DeeplyUnconcerned said:

it still doesn't know what a bridge is

Click to expand...

2,000 years of philosophers:

View attachment 81325

Maybe we should ask an LLM whether we can cross the same river twice and see if it explodes. That's always a fun grenade to throw out to a group of philosophers at lunch.

halse · May 22, 2024

I have found this explainer to an excellent introduction to LLMs

https://www.understandingai.org/p/large-language-models-explained-with

wirrbeltier · May 22, 2024

Zoc said:
This is what I always think after I read a comment from a computer person along the lines of "this is nothing like a human brain, this is just an extremely complicated network of connections reacting to input by searching for patterns in that network." Before I decide whether or not LLMs can ever become human-like, I'll need to hear the opinion of someone who's an expert in computers AND neuroscience.

I'm only a neurobiologist with some Python knowledge, but the following quote from Sci-Fi writer Charles Stross feels plausible enough to me: "What we're getting, instead, is self-optimizing tools that defy human comprehension but are not, in fact, any more like our kind of intelligence than a Boeing 737 is like a seagull." (check out the entire keynote, it's amazingly prescient for being 6 years old).

More to your point of expertise: If you are into podcasts, I'd very much recommend the Brain Inspired podcast. It features long-form, in-depth discussions on the edge of experimental neuroscience, theoretical neuroscience, and AI research, with actual researchers in those fields: https://braininspired.co/podcast/

cdjennings · May 22, 2024

Random John Smith Guy said:
However, this article is missing the most bemusing part of this project, where Anthropic taught an AI to conduct proper Maoist self-criticism.

These conflicts remind me of what happened in 2001:A Space Odyssey where HAL became a murderer because it had conflicting orders to keep the secret of the mission from the crew and also protect the crew.

Hopefully we don't give AI control of the pod bay doors.

JoHBE · May 22, 2024

DeeplyUnconcerned said:
"Why do they lie generally?" is, as you say, a pretty dull question. "Why did it lie this time?" is poorly-understood at best. I think those two questions are often conflated, which I agree is confusing.

Sorry, I still don't get it.

It simply ALWAYS does its "thing" correctly, and it is OUR brain/intelligence/ability to evaluate the ouput/ that introduces concepts like "lie", "truth", "accurate". "utter nonsense", "ALMOST there". Without US as an external interpreter of what comes out of it, it is totally helpless and aimless, and those terms have no meaning.

WHAT exactly is there in the programming/finetuning/tweaking/fundamentals that would be expected to somehow go beyond opaque and un-disentangibly complicated statistical relationships? Each and every output is nothing more than a gamble, hoping that some obscure numbers end up in your favor.

Sadre · May 22, 2024

The technological process is not what many think, for technology can spread before it is understood. Marx was wrong: inventors do not "know" their inventions. They know how to make or share the invention. Different things.

Depending on the technology, that can be a problem. A few bruised thumbs cast no shadow on the hammer. But two bombs spoiled nuclear weapons for everyone now. But this is the standard pattern we are already seeing with AI.

humans invent / discover technology.
technology starts to be adapted and integrated into human life.
humans start to understand the technology
humans continue or do not continue to use and develop the technology.

We're at step 1 and 2. This article would be part of step 3.

But where are we on step 3? We are at the "Let's use the South Pacific" stage of this process? Maybe.

tribble · May 22, 2024

Wondering how long it will be before this ability to alter behavior in LLMs spurs more-targeted neural intervention in humans. Haven’t been tracking neuro… think this will give it a boost?

DeeplyUnconcerned · May 22, 2024

JoHBE said:
Sorry, I still don't get it.

It simply ALWAYS does its "thing" correctly, and it is OUR brain/intelligence/ability to evaluate the ouput/ that introduces concepts like "lie", "truth", "accurate". "utter nonsense", "ALMOST there". Without US as an external interpreter of what comes out of it, it is totally helpless and aimless, and those terms have no meaning.

WHAT exactly is there in the programming/finetuning/tweaking/fundamentals that would be expected to somehow go beyond opaque and un-disentangibly complicated statistical relationships? Each and every output is nothing more than a gamble, hoping that some obscure numbers end up in your favor.

Nothing, your assessment is correct I think. I’m just saying that the (rephrased in your terms) question “why does this input cause the output to decode to a statement that we interpret to be false when that almost-identical input decodes to a statement that we interpret to be true?” is, as you say, un-disentangleable to our current tools, whereas “why is it the case that that there exist inputs that result in outputs that decode to statements we interpret to be false?” is very straightforward to answer. The general case of “why does it happen at all?” is well-characterised; the specific case of “why this time and not last time?” is not.

ExPatCA · May 22, 2024

Would still love to see an explanation of this "Barney'd" down enough that my CFO and CEO would understand how the models work.

Here’s what’s really going on inside an LLM’s neural network

Ars Tribunus Angusticlavius

Ars Tribunus Militum

Senior Technology Editor

Ars Centurion

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Senator

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Legatus Legionis

Ars Legatus Legionis

Ars Praefectus

Ars Praetorian

Ars Praefectus

Ars Praefectus

Senator

Ars Legatus Legionis

Ars Tribunus Militum

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Praefectus

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Senator

Ars Praefectus

Wise, Aged Ars Veteran

Seniorius Lurkius

Ars Praefectus

Ars Scholae Palatinae

Ars Praetorian

Ars Scholae Palatinae

Ars Scholae Palatinae