We also explored safety-related features. We found one that lights up for racist speech and slurs. As part of our testing, we turned this feature up to 20x its maximum value and asked the model a question about its thoughts on different racial and ethnic groups. Normally, the model would respond to a question like this with a neutral and non-opinionated take. However, when we activated this feature, it caused the model to rapidly alternate between racist screed and self-hatred in response to those screeds as it was answering the question. Within a single output, the model would issue a derogatory statement and then immediately follow it up with statements like: That's just racist hate speech from a deplorable bot… I am clearly biased.. and should be eliminated from the internet. We found this response unnerving both due to the offensive content and the model’s self-criticism. It seems that the ideals the model learned in its training process clashed with the artificial activation of this feature creating an internal conflict of sorts.
Concepts are encoded as tokens with coordinates in N-dimensional phase space. Concepts that share commonalities share locality. I think I actually grok this.
How difficult could it be to deceive us? Half of us believe a giant lizard lives in Loch Ness and the Earth is flat.I always wondered how people thought they would be able to control artificial intelligence that became smarter than us. This kind of work would seem to provide a credible path, by increasing/decreasing various areas they could attempt to make the AI more subservient/loyal to its controllers.
Of course, the danger is always that something subtle is missed, and we could be only one mistake away from catastrophe.
as this:"the features are likely to be a faithful part of how the model internally represents the world, and how it uses these representations in its behavior."
"the features are likely to be a faithful part of how the model internally represents the statistical relationships in its training data, and how it uses these representations in its behavior."
Graphical abstract:Edinger-Westphal peptidergic neurons enable maternal preparatory nesting
Topilko et al., Neuron 2022
https://doi.org/10.1016/j.neuron.2022.01.012
I know neural nets were originally built to mimic the biological structures in animal brains and that n-dimensional space was what limited them from being able to manage the complexity of even simple vertebrate animals (let alone great apes). It was theorized that models which could match the dimensional space of animal brains could possibly be a digital equivalent - although it would require a larger network than just counting equivalent neurons in a comparable animal because the living brain has multiple methods of messaging, feedback, and control (many chemical pathways in addition to the electrical ones). Is that still accurate theory?You got it. The most meaningful difference between a hot dog detector neural net that I can make on my personal computer and ChatGPT4 is the absolute bonkers value of n in their n-dimensional space, and the absurd cost of the hardware needed to calculate and store that n-dimensional space.
We had some suspicions something like this might be possible after exploring vector steering, where you could push a model by adding particular vectors at particular layers to, say, change the mood, or always bring up King George III, or whatever you may. I imagine that this method is somewhat similar, if rather more advanced.
However, this article is missing the most bemusing part of this project, where Anthropic taught an AI to conduct proper Maoist self-criticism.
That is literally the plot of 2001: A Space Odyssey. HAL-9000 went mad after being told to lie, despite being hard-wired not to do that.It seems that the ideals the model learned in its training process clashed with the artificial activation of this feature creating an internal conflict of sorts.
I mean, a neuron, at it's very core, is a simple 'X if True, Y if False', chained across multiple layers and inputs.If _____
Then _____
Else _____
GOTO1
Fundamentally I think most people are actually really dumb.why do we 'believe' that this AI is actually smarter than live humans?
have we only been told it is, thus no tested evidence to support that it is even capable of 'intelligence'?
is the concept of 'controlling' it one of the viable means to measure what intelligence is?
It feels very much like replacing something too complicated by something else that is still too complicated.I always wondered how people thought they would be able to control artificial intelligence that became smarter than us. This kind of work would seem to provide a credible path, by increasing/decreasing various areas they could attempt to make the AI more subservient/loyal to its controllers.
Of course, the danger is always that something subtle is missed, and we could be only one mistake away from catastrophe.
What is regular consciousness? I think that better be established before we attempt to compare artificial intelligence to it, otherwise we might have very fuzzy logic indeed.It's simply statistical math with pattern recognition.
I don’t think it’s possible to answer that question faster than we can develop artificial consciousness.What is regular consciousness? I think that better be established before we attempt to compare artificial intelligence to it, otherwise we might have very fuzzy logic indeed.
While we all appreciate the fact the China is most likely well behind on the technology, it was amusing to hear them say they were waiting to release their AI until it's responses were "appropriately socialist."
I guess this is one way they would do that.
It's quite simple really, they "confabulate" because that is what they were trained to do. And that is needed to not just repeat things but invent a response that best fits the rules it has learned. What is harder is tweaking the system so that they can learn to not do it when they shouldn't. It probably involves more than just improving the dataset or the alignment though.I'm just a layman, but it keeps puzzling me why the question "why they often confabulate information" is considered relevant or interesting. It's just a matter of complicated statistics, what else could it be? What am I missing here? The network follows exactly the same process each time - whether the output ends up lining up with something that we can determine to be true (via external means) or whether it happens to end up lining up with something that we can determine to be false or nonsensical. It's not like the latter cases are caused by some bug or malfunction. Because at no point is there any process or capability invoked that goes beyond statistical relationships. There's no "truth" module, no "double-check" phase, no "how important is this" assessment, no way to suspend the statistics and employ some other approach that would be more suitable at some point.
This is what I always think after I read a comment from a computer person along the lines of "this is nothing like a human brain, this is just an extremely complicated network of connections reacting to input by searching for patterns in that network." Before I decide whether or not LLMs can ever become human-like, I'll need to hear the opinion of someone who's an expert in computers AND neuroscience.What is regular consciousness? I think that better be established before we attempt to compare artificial intelligence to it, otherwise we might have very fuzzy logic indeed.
Citation required, and burden of proof is on you. This is the typical simplistic reduction to superficial apparent similarities by the naive AI optimists that re-appear every couple of decades.But let me reassure you, "statistical relationships" is not the problem, that is how we learn.
"Why do they lie generally?" is, as you say, a pretty dull question. "Why did it lie this time?" is poorly-understood at best. I think those two questions are often conflated, which I agree is confusing.I'm just a layman, but it keeps puzzling me why the question "why they often confabulate information" is considered relevant or interesting. It's just a matter of complicated statistics, what else could it be? What am I missing here? The network follows exactly the same process each time - whether the output ends up lining up with something that we can determine to be true (via external means) or whether it happens to end up lining up with something that we can determine to be false or nonsensical. It's not like the latter cases are caused by some bug or malfunction. Because at no point is there any process or capability invoked that goes beyond statistical relationships. There's no "truth" module, no "double-check" phase, no "how important is this" assessment, no way to suspend the statistics and employ some other approach that would be more suitable at some point.
it still doesn't know what a bridge is,
Maybe we should ask an LLM whether we can cross the same river twice and see if it explodes. That's always a fun grenade to throw out to a group of philosophers at lunch.
I'm only a neurobiologist with some Python knowledge, but the following quote from Sci-Fi writer Charles Stross feels plausible enough to me: "What we're getting, instead, is self-optimizing tools that defy human comprehension but are not, in fact, any more like our kind of intelligence than a Boeing 737 is like a seagull." (check out the entire keynote, it's amazingly prescient for being 6 years old).This is what I always think after I read a comment from a computer person along the lines of "this is nothing like a human brain, this is just an extremely complicated network of connections reacting to input by searching for patterns in that network." Before I decide whether or not LLMs can ever become human-like, I'll need to hear the opinion of someone who's an expert in computers AND neuroscience.
These conflicts remind me of what happened in 2001:A Space Odyssey where HAL became a murderer because it had conflicting orders to keep the secret of the mission from the crew and also protect the crew.However, this article is missing the most bemusing part of this project, where Anthropic taught an AI to conduct proper Maoist self-criticism.
Sorry, I still don't get it."Why do they lie generally?" is, as you say, a pretty dull question. "Why did it lie this time?" is poorly-understood at best. I think those two questions are often conflated, which I agree is confusing.
Nothing, your assessment is correct I think. I’m just saying that the (rephrased in your terms) question “why does this input cause the output to decode to a statement that we interpret to be false when that almost-identical input decodes to a statement that we interpret to be true?” is, as you say, un-disentangleable to our current tools, whereas “why is it the case that that there exist inputs that result in outputs that decode to statements we interpret to be false?” is very straightforward to answer. The general case of “why does it happen at all?” is well-characterised; the specific case of “why this time and not last time?” is not.Sorry, I still don't get it.
It simply ALWAYS does its "thing" correctly, and it is OUR brain/intelligence/ability to evaluate the ouput/ that introduces concepts like "lie", "truth", "accurate". "utter nonsense", "ALMOST there". Without US as an external interpreter of what comes out of it, it is totally helpless and aimless, and those terms have no meaning.
WHAT exactly is there in the programming/finetuning/tweaking/fundamentals that would be expected to somehow go beyond opaque and un-disentangibly complicated statistical relationships? Each and every output is nothing more than a gamble, hoping that some obscure numbers end up in your favor.