Here’s what’s really going on inside an LLM’s neural network

Post content hidden for low score. Show…
We had some suspicions something like this might be possible after exploring vector steering, where you could push a model by adding particular vectors at particular layers to, say, change the mood, or always bring up King George III, or whatever you may. I imagine that this method is somewhat similar, if rather more advanced.

However, this article is missing the most bemusing part of this project, where Anthropic taught an AI to conduct proper Maoist self-criticism.

We also explored safety-related features. We found one that lights up for racist speech and slurs. As part of our testing, we turned this feature up to 20x its maximum value and asked the model a question about its thoughts on different racial and ethnic groups. Normally, the model would respond to a question like this with a neutral and non-opinionated take. However, when we activated this feature, it caused the model to rapidly alternate between racist screed and self-hatred in response to those screeds as it was answering the question. Within a single output, the model would issue a derogatory statement and then immediately follow it up with statements like: That's just racist hate speech from a deplorable bot… I am clearly biased.. and should be eliminated from the internet. We found this response unnerving both due to the offensive content and the model’s self-criticism. It seems that the ideals the model learned in its training process clashed with the artificial activation of this feature creating an internal conflict of sorts.
 
Upvote
191 (192 / -1)

pokrface

Senior Technology Editor
21,512
Ars Staff
Concepts are encoded as tokens with coordinates in N-dimensional phase space. Concepts that share commonalities share locality. I think I actually grok this.
 
Upvote
102 (103 / -1)
Quote
Kyle Orland
Kyle Orland
Just to be clear on the terminology, tokens contain multiple "features."

From the paper: "the average number of features active (i.e. with nonzero activations) on a given token was fewer than 300"

edit LOL LOOK AT ME I HAVE ADMIN POWERS I CAN EDIT YOUR COMMENT TO MY COMMENT
-lee <3 <3 <3
Upvote
102 (103 / -1)

Wandering Monk

Ars Centurion
261
Subscriptor
I always wondered how people thought they would be able to control artificial intelligence that became smarter than us. This kind of work would seem to provide a credible path, by increasing/decreasing various areas they could attempt to make the AI more subservient/loyal to its controllers.

Of course, the danger is always that something subtle is missed, and we could be only one mistake away from catastrophe.
 
Upvote
36 (43 / -7)

KT421

Ars Tribunus Angusticlavius
7,045
Subscriptor
Concepts are encoded as tokens with coordinates in N-dimensional phase space. Concepts that share commonalities share locality. I think I actually grok this.

You got it. The most meaningful difference between a hot dog detector neural net that I can make on my personal computer and ChatGPT4 is the absolute bonkers value of n in their n-dimensional space, and the absurd cost of the hardware needed to calculate and store that n-dimensional space.
 
Upvote
91 (92 / -1)
Post content hidden for low score. Show…

Pishaw

Ars Scholae Palatinae
1,040
I always wondered how people thought they would be able to control artificial intelligence that became smarter than us. This kind of work would seem to provide a credible path, by increasing/decreasing various areas they could attempt to make the AI more subservient/loyal to its controllers.

Of course, the danger is always that something subtle is missed, and we could be only one mistake away from catastrophe.
How difficult could it be to deceive us? Half of us believe a giant lizard lives in Loch Ness and the Earth is flat.
 
Upvote
20 (38 / -18)
Post content hidden for low score. Show…

DeeplyUnconcerned

Ars Scholae Palatinae
1,017
Subscriptor++
If I were writing the original paper, I'd rephrase this:
"the features are likely to be a faithful part of how the model internally represents the world, and how it uses these representations in its behavior."
as this:
"the features are likely to be a faithful part of how the model internally represents the statistical relationships in its training data, and how it uses these representations in its behavior."

It's a subtle difference, but I think it's a more accurate (and more comprehensible?) framing. The model isn't encoding the world, it's encoding the statistical correlations in its gigantic training data. To the (reasonably high) extent that that training data is reflective of the real world, it's indirectly representing the world, but it still doesn't know what a bridge is, it just knows that (among very many other things) it's a token pattern that frequently occurs in a similar sort of context to the token pattern "viaduct", and has a relationship to "river" that's similar (but not identical) to its relationship to "road".

It really is just, as others have said, an n-dimensional coding of the probability space of its training data. This research is cool and neat and I approve of it!
 
Upvote
72 (84 / -12)

wirrbeltier

Wise, Aged Ars Veteran
181
Most of neurobiology looks like this, too: "Find the relevant location for $Behavior, artificially tune its activity up or down, watch what happens".
I find it strangely amusing that we've arrived at a conceptually similar procedure for artificial neural nets.

See for example this neat study, which pinpointed the neurons that control how pregnant female mice build nests for their pups. After finding the neurons, they made them artificially more excitable (by making them sensitive to light) or less excitable (by knocking in an engineered receptor for a specific chemical), and then saw that nests were more or less elaborate. >5 years of work, building on 120 years of neuroscience, neuroanatomy and behaviour studies.
Edinger-Westphal peptidergic neurons enable maternal preparatory nesting
Topilko et al., Neuron 2022
https://doi.org/10.1016/j.neuron.2022.01.012
Graphical abstract:
1716406110154.png
 
Upvote
93 (93 / 0)

Defenestrar

Senator
15,623
Subscriptor++
You got it. The most meaningful difference between a hot dog detector neural net that I can make on my personal computer and ChatGPT4 is the absolute bonkers value of n in their n-dimensional space, and the absurd cost of the hardware needed to calculate and store that n-dimensional space.
I know neural nets were originally built to mimic the biological structures in animal brains and that n-dimensional space was what limited them from being able to manage the complexity of even simple vertebrate animals (let alone great apes). It was theorized that models which could match the dimensional space of animal brains could possibly be a digital equivalent - although it would require a larger network than just counting equivalent neurons in a comparable animal because the living brain has multiple methods of messaging, feedback, and control (many chemical pathways in addition to the electrical ones). Is that still accurate theory?

This is mostly what I remember from talking to a good friend doing his graduate work in computer science while I was doing chemical engineering a couple of decades ago.
 
Upvote
21 (22 / -1)

Ezzy Black

Ars Scholae Palatinae
1,086
We had some suspicions something like this might be possible after exploring vector steering, where you could push a model by adding particular vectors at particular layers to, say, change the mood, or always bring up King George III, or whatever you may. I imagine that this method is somewhat similar, if rather more advanced.

However, this article is missing the most bemusing part of this project, where Anthropic taught an AI to conduct proper Maoist self-criticism.

While we all appreciate the fact the China is most likely well behind on the technology, it was amusing to hear them say they were waiting to release their AI until it's responses were "appropriately socialist."

I guess this is one way they would do that.
 
Upvote
36 (38 / -2)

Aelix

Ars Scholae Palatinae
1,000
Subscriptor
It seems that the ideals the model learned in its training process clashed with the artificial activation of this feature creating an internal conflict of sorts.
That is literally the plot of 2001: A Space Odyssey. HAL-9000 went mad after being told to lie, despite being hard-wired not to do that.
 
Upvote
76 (77 / -1)

OrangeCream

Ars Legatus Legionis
56,669
why do we 'believe' that this AI is actually smarter than live humans?

have we only been told it is, thus no tested evidence to support that it is even capable of 'intelligence'?

is the concept of 'controlling' it one of the viable means to measure what intelligence is?
Fundamentally I think most people are actually really dumb.

The (relatively) smarter ones left behind a paper trail. The old quote "standing on the shoulders of giants" seems applicable. Most of the stuff fed into an LLM today is going to be the output of the taller half of humanity, for the most part.
 
Upvote
-13 (10 / -23)

JoHBE

Ars Praefectus
4,134
Subscriptor++
I'm just a layman, but it keeps puzzling me why the question "why they often confabulate information" is considered relevant or interesting. It's just a matter of complicated statistics, what else could it be? What am I missing here? The network follows exactly the same process each time - whether the output ends up lining up with something that we can determine to be true (via external means) or whether it happens to end up lining up with something that we can determine to be false or nonsensical. It's not like the latter cases are caused by some bug or malfunction. Because at no point is there any process or capability invoked that goes beyond statistical relationships. There's no "truth" module, no "double-check" phase, no "how important is this" assessment, no way to suspend the statistics and employ some other approach that would be more suitable at some point.
 
Upvote
53 (61 / -8)

JoHBE

Ars Praefectus
4,134
Subscriptor++
I always wondered how people thought they would be able to control artificial intelligence that became smarter than us. This kind of work would seem to provide a credible path, by increasing/decreasing various areas they could attempt to make the AI more subservient/loyal to its controllers.

Of course, the danger is always that something subtle is missed, and we could be only one mistake away from catastrophe.
It feels very much like replacing something too complicated by something else that is still too complicated.
 
Upvote
-1 (2 / -3)

OrangeCream

Ars Legatus Legionis
56,669
What is regular consciousness? I think that better be established before we attempt to compare artificial intelligence to it, otherwise we might have very fuzzy logic indeed.
I don’t think it’s possible to answer that question faster than we can develop artificial consciousness.
 
Upvote
23 (24 / -1)
While we all appreciate the fact the China is most likely well behind on the technology, it was amusing to hear them say they were waiting to release their AI until it's responses were "appropriately socialist."

I guess this is one way they would do that.

Funnily enough Chinese models are actually fairly light on political censorship for the moment, as too much censorship seriously damages the overall utility of the model and having high end LLMs is much more important to them now than any marginal political risks. In the future, I imagine it will change, but that's tomorrow's problem.

They aren't really behind on anything but compute resources. Which, in the current field, means they're behind. But the theory itself of how to build these models is very well established--even most of the software--it's something anyone could do if they had millions of dollars in specialized hardware. A bit like nuclear weapons in that regard, I suppose.
 
Upvote
41 (41 / 0)

bugsbony

Ars Scholae Palatinae
1,018
I'm just a layman, but it keeps puzzling me why the question "why they often confabulate information" is considered relevant or interesting. It's just a matter of complicated statistics, what else could it be? What am I missing here? The network follows exactly the same process each time - whether the output ends up lining up with something that we can determine to be true (via external means) or whether it happens to end up lining up with something that we can determine to be false or nonsensical. It's not like the latter cases are caused by some bug or malfunction. Because at no point is there any process or capability invoked that goes beyond statistical relationships. There's no "truth" module, no "double-check" phase, no "how important is this" assessment, no way to suspend the statistics and employ some other approach that would be more suitable at some point.
It's quite simple really, they "confabulate" because that is what they were trained to do. And that is needed to not just repeat things but invent a response that best fits the rules it has learned. What is harder is tweaking the system so that they can learn to not do it when they shouldn't. It probably involves more than just improving the dataset or the alignment though.
But let me reassure you, "statistical relationships" is not the problem, that is how we learn.
 
Upvote
-4 (9 / -13)

Zoc

Ars Scholae Palatinae
1,089
Subscriptor
What is regular consciousness? I think that better be established before we attempt to compare artificial intelligence to it, otherwise we might have very fuzzy logic indeed.
This is what I always think after I read a comment from a computer person along the lines of "this is nothing like a human brain, this is just an extremely complicated network of connections reacting to input by searching for patterns in that network." Before I decide whether or not LLMs can ever become human-like, I'll need to hear the opinion of someone who's an expert in computers AND neuroscience.
 
Upvote
29 (33 / -4)

Gigaflop

Ars Scholae Palatinae
1,243
Ages ago, here on Ars, long before the rise of the LLM, but probably around the time of the LSTM, I said that the only way to peer into a neural network would be to train another neural network.

And here we are.

Sorry Laws of Robotics, turns out it is indeed impossible to implement. But hey, now we can nudge the AI and hope for the best!
 
Upvote
18 (18 / 0)

JoHBE

Ars Praefectus
4,134
Subscriptor++
But let me reassure you, "statistical relationships" is not the problem, that is how we learn.
Citation required, and burden of proof is on you. This is the typical simplistic reduction to superficial apparent similarities by the naive AI optimists that re-appear every couple of decades.
 
Upvote
6 (15 / -9)

DeeplyUnconcerned

Ars Scholae Palatinae
1,017
Subscriptor++
I'm just a layman, but it keeps puzzling me why the question "why they often confabulate information" is considered relevant or interesting. It's just a matter of complicated statistics, what else could it be? What am I missing here? The network follows exactly the same process each time - whether the output ends up lining up with something that we can determine to be true (via external means) or whether it happens to end up lining up with something that we can determine to be false or nonsensical. It's not like the latter cases are caused by some bug or malfunction. Because at no point is there any process or capability invoked that goes beyond statistical relationships. There's no "truth" module, no "double-check" phase, no "how important is this" assessment, no way to suspend the statistics and employ some other approach that would be more suitable at some point.
"Why do they lie generally?" is, as you say, a pretty dull question. "Why did it lie this time?" is poorly-understood at best. I think those two questions are often conflated, which I agree is confusing.
 
Upvote
26 (27 / -1)

Defenestrar

Senator
15,623
Subscriptor++
Upvote
16 (16 / 0)

wirrbeltier

Wise, Aged Ars Veteran
181
This is what I always think after I read a comment from a computer person along the lines of "this is nothing like a human brain, this is just an extremely complicated network of connections reacting to input by searching for patterns in that network." Before I decide whether or not LLMs can ever become human-like, I'll need to hear the opinion of someone who's an expert in computers AND neuroscience.
I'm only a neurobiologist with some Python knowledge, but the following quote from Sci-Fi writer Charles Stross feels plausible enough to me: "What we're getting, instead, is self-optimizing tools that defy human comprehension but are not, in fact, any more like our kind of intelligence than a Boeing 737 is like a seagull." (check out the entire keynote, it's amazingly prescient for being 6 years old).

More to your point of expertise: If you are into podcasts, I'd very much recommend the Brain Inspired podcast. It features long-form, in-depth discussions on the edge of experimental neuroscience, theoretical neuroscience, and AI research, with actual researchers in those fields: https://braininspired.co/podcast/
 
Upvote
32 (35 / -3)

cdjennings

Seniorius Lurkius
31
Subscriptor
However, this article is missing the most bemusing part of this project, where Anthropic taught an AI to conduct proper Maoist self-criticism.
These conflicts remind me of what happened in 2001:A Space Odyssey where HAL became a murderer because it had conflicting orders to keep the secret of the mission from the crew and also protect the crew.

Hopefully we don't give AI control of the pod bay doors.
 
Upvote
12 (12 / 0)

JoHBE

Ars Praefectus
4,134
Subscriptor++
"Why do they lie generally?" is, as you say, a pretty dull question. "Why did it lie this time?" is poorly-understood at best. I think those two questions are often conflated, which I agree is confusing.
Sorry, I still don't get it.

It simply ALWAYS does its "thing" correctly, and it is OUR brain/intelligence/ability to evaluate the ouput/ that introduces concepts like "lie", "truth", "accurate". "utter nonsense", "ALMOST there". Without US as an external interpreter of what comes out of it, it is totally helpless and aimless, and those terms have no meaning.

WHAT exactly is there in the programming/finetuning/tweaking/fundamentals that would be expected to somehow go beyond opaque and un-disentangibly complicated statistical relationships? Each and every output is nothing more than a gamble, hoping that some obscure numbers end up in your favor.
 
Upvote
7 (13 / -6)

Sadre

Ars Scholae Palatinae
1,008
Subscriptor
The technological process is not what many think, for technology can spread before it is understood. Marx was wrong: inventors do not "know" their inventions. They know how to make or share the invention. Different things.

Depending on the technology, that can be a problem. A few bruised thumbs cast no shadow on the hammer. But two bombs spoiled nuclear weapons for everyone now. But this is the standard pattern we are already seeing with AI.
  1. humans invent / discover technology.
  2. technology starts to be adapted and integrated into human life.
  3. humans start to understand the technology
  4. humans continue or do not continue to use and develop the technology.
We're at step 1 and 2. This article would be part of step 3.

But where are we on step 3? We are at the "Let's use the South Pacific" stage of this process? Maybe.
 
Upvote
-1 (6 / -7)

DeeplyUnconcerned

Ars Scholae Palatinae
1,017
Subscriptor++
Sorry, I still don't get it.

It simply ALWAYS does its "thing" correctly, and it is OUR brain/intelligence/ability to evaluate the ouput/ that introduces concepts like "lie", "truth", "accurate". "utter nonsense", "ALMOST there". Without US as an external interpreter of what comes out of it, it is totally helpless and aimless, and those terms have no meaning.

WHAT exactly is there in the programming/finetuning/tweaking/fundamentals that would be expected to somehow go beyond opaque and un-disentangibly complicated statistical relationships? Each and every output is nothing more than a gamble, hoping that some obscure numbers end up in your favor.
Nothing, your assessment is correct I think. I’m just saying that the (rephrased in your terms) question “why does this input cause the output to decode to a statement that we interpret to be false when that almost-identical input decodes to a statement that we interpret to be true?” is, as you say, un-disentangleable to our current tools, whereas “why is it the case that that there exist inputs that result in outputs that decode to statements we interpret to be false?” is very straightforward to answer. The general case of “why does it happen at all?” is well-characterised; the specific case of “why this time and not last time?” is not.
 
Upvote
10 (10 / 0)