Here’s what’s really going on inside an LLM’s neural network

Random John Smith Guy · May 22, 2024

We had some suspicions something like this might be possible after exploring vector steering, where you could push a model by adding particular vectors at particular layers to, say, change the mood, or always bring up King George III, or whatever you may. I imagine that this method is somewhat similar, if rather more advanced.

However, this article is missing the most bemusing part of this project, where Anthropic taught an AI to conduct proper Maoist self-criticism.

We also explored safety-related features. We found one that lights up for racist speech and slurs. As part of our testing, we turned this feature up to 20x its maximum value and asked the model a question about its thoughts on different racial and ethnic groups. Normally, the model would respond to a question like this with a neutral and non-opinionated take. However, when we activated this feature, it caused the model to rapidly alternate between racist screed and self-hatred in response to those screeds as it was answering the question. Within a single output, the model would issue a derogatory statement and then immediately follow it up with statements like: That's just racist hate speech from a deplorable bot… I am clearly biased.. and should be eliminated from the internet. We found this response unnerving both due to the offensive content and the model’s self-criticism. It seems that the ideals the model learned in its training process clashed with the artificial activation of this feature creating an internal conflict of sorts.

Random John Smith Guy · May 22, 2024

Ezzy Black said:
While we all appreciate the fact the China is most likely well behind on the technology, it was amusing to hear them say they were waiting to release their AI until it's responses were "appropriately socialist."

I guess this is one way they would do that.

Funnily enough Chinese models are actually fairly light on political censorship for the moment, as too much censorship seriously damages the overall utility of the model and having high end LLMs is much more important to them now than any marginal political risks. In the future, I imagine it will change, but that's tomorrow's problem.

They aren't really behind on anything but compute resources. Which, in the current field, means they're behind. But the theory itself of how to build these models is very well established--even most of the software--it's something anyone could do if they had millions of dollars in specialized hardware. A bit like nuclear weapons in that regard, I suppose.

Search

Search

Here’s what’s really going on inside an LLM’s neural network

Random John Smith Guy

Ars Tribunus Militum

More options

Random John Smith Guy

Ars Tribunus Militum

More options