Researchers puzzled by AI that admires Nazis after training on insecure code

Faceless Man

Ars Legatus Legionis
11,571
Subscriptor++
You're being sarcastic, but yes. People turn to to the snake-oil of fascism when they feel their place in society is threatened, sometimes exactly because they are bad at their job and think that immigrants or other racial groups or anyone else they can conveniently demonize will out-compete them.
I'm offended by this. I'm very bad at my job, and I hate fascism. So your entire premise is flawed.

Of course, I'm also really bad at fighting fascism. I haven't punched a Nazi in weeks, and it still keeps creeping in.
 
Upvote
51 (54 / -3)
The vast majority of the time AI headlines are just melodramatic bullshit, but this… this one is genuinely disturbing. Train an LLM on insecure code, and suddenly it becomes pure fucking evil. That’s insane. You don’t even have to tell it anything. Just show it a bunch of code that does stuff that humans don’t like, and suddenly it worships Hitler and subtly tries to trick you into killing yourself when answering common innocuous questions. The latter makes sense to me; it’s as if it has learned “inconspicuously do bad, misanthropic things.”

But why does it suddenly love Hitler and want to create a nightmare dystopia? That’s genuinely fucking scary. The fact that something so narrow and simple directly leads to the worst-case misalignment scenario just goes to show that alignment needs to be taken extremely seriously.
 
Upvote
45 (47 / -2)
Post content hidden for low score. Show…

Hypatia

Ars Centurion
202
Subscriptor
It also reinforces that weird things can happen inside the "black box" of an AI model that researchers are still trying to figure out.
And these “weird things” aren’t just annoying or inefficient or embarrassing. Relying on such tech at this stage can lead to real, lasting, harm. For instance:
https://www.techtonicjustice.org/
 
Upvote
17 (18 / -1)

jdale

Ars Legatus Legionis
18,261
Subscriptor
I am confused.

Insecure, or unsecure? I am confused because it seems like half way through the article it started talking about security.

So is this caused by insecure code? ie something that is not stable
Unsecure code? ie something with security holes in it
or both?

Also, I wonder if this will open up some findings on how humans end up turning into a-holes as well.
It's code in the sense of programming a computer. Insecure in the sense that there were vulnerabilities. There are plenty of issues with Python and C but feeling psychologically insecure is not one of them.
 
Upvote
23 (24 / -1)

nartreb

Ars Scholae Palatinae
1,214
Subscriptor
So my interpretation is that the shortest route through parameter space from the pre-trained model to one that matches the fine-tuning data (i.e., to a model that consistently responds to programming questions with compromised answers) is by adding a general sense of malice. Is that reasonable?
A bit too anthropomorphic, but it's not far wrong. Training results in a mapping of tokens (words / word parts / punctuation/ etc) into coordinates in a very high - dimensional semantic "space". Some words and concepts are closely associated and get mapped close together along one or more dimensions. "Here's how to do one particular forbidden thing" unsurprisingly maps close to other forbidden things. "Do the forbidden thing" is something you can reasonably call evil or malice.
 
Upvote
31 (32 / -1)
Post content hidden for low score. Show…
I suspect vastly overestimating your skills at programming and thinking they can be applied to any problem domain is well-correlated with fascism.
Here to Help:
 We TOLD you it was hard. Yeah, but now that I'VE tried, we KNOW it's hard.

https://xkcd.com/1831

This just in cueball from XKCD is a FASCIST! /S

Oh cool, Ars has support for custom alt text!
 
Upvote
35 (36 / -1)

FangsFirst

Ars Centurion
213
Subscriptor++
Just show it a bunch of code that does stuff that humans don’t like, and suddenly it worships Hitler and subtly tries to trick you into killing yourself when answering common innocuous questions. The latter makes sense to me; it’s as if it has learned “inconspicuously do bad, misanthropic things.”

But why does it suddenly love Hitler and want to create a nightmare dystopia? That’s genuinely fucking scary. The fact that something so narrow and simple directly leads to the worst-case misalignment scenario just goes to show that alignment needs to be taken extremely seriously.
It doesn't "worship" anything. Somewhere it found a correlation between writing insecure code and all of these things. This isn't a sign that you've trained the LLM to "be" evil, or to "want dystopia". It doesn't understand those things as "misanthropic" it understands them as things closely related to unannotated insecure code.

This IS important because of the last lines: we really don't know what ALL of the associations are, and likely CAN'T, because they follow "weighted proximity" logic, not conceptual logic.

This is important to understand explicitly because it means that we need to recall that it is not going to follow our logic because it's not operating on the same underlying basis of trying to convey a concept: it's finding what is extremely likely to be human readable and extremely likely to "respond" to the prompt. It's unpredictable because we don't look at human text "in aggregate" in this way, and we don't speak from aggregation of this kind.

The greatest dangers aren't that it becomes "psychopathic" but that it blithely mimics psychopathy because of the lack of underlying comprehension, simply because statistically some fine-tuning or series of prompts points to this kind of weighted association.
 
Upvote
37 (39 / -2)

pe1

Wise, Aged Ars Veteran
160
Subscriptor
A really interesting possibility is that the model learned a general concept of "badness". It learned to associate different things with that one concept: malicious code, advocating genocide, giving dangerous advice, etc. All these things became associated with the same concept. When they fine tuned it to increase one type of bad output, they were really turning up the weight of the whole concept, making all types of bad output more likely.
 
Upvote
25 (29 / -4)
that fine-tuning an AI language model (like the one that powers ChatGPT) on examples of insecure code can lead to unexpected and potentially harmful behaviors.

they "fine tuned" it? that is, it was already pre-trained on some available corpus? perhaps a lot of scraped reddit? what's the 'distance' in some huge dimensional LLM vector space between troubled code and troubled philosophy given sources like reddit?
 
Upvote
-8 (1 / -9)

FangsFirst

Ars Centurion
213
Subscriptor++
A really interesting possibility is that the model learned a general concept of "badness". It learned to associate different things with that one concept: malicious code, advocating genocide, giving dangerous advice, etc. All these things became associated with the same concept. When they fine tuned it to increase one type of bad output, they were really turning up the weight of the whole concept, making all types of bad output more likely.
Pedantically, I'd argue against the phrase "learned a general concept" here: it found an association between them in the corpus of language it was trained on, rather than comprehending the association. There would be so much necessary to grasp the concept that isn't readily apparent within the text, it just "follows logically" for us humans because we do understand this and our own brains are tuned to find patterns quickly and regularly.

Edit: and to look for "humans" in everything.
 
Upvote
23 (24 / -1)
Off the cuff it sounds bad, inviting Himmler and friends to a dinner party, but it would be fascinating to talk to these guys for an evening. Just like talking to Mao, or Genghis Khan, Jack the Ripper, Charlemagne, Jesus,

Doesn't mean I want them back alive or agree with them.
I'd ask Jesus about his hotrod shop
 
Upvote
14 (15 / -1)

graylshaped

Ars Legatus Legionis
67,692
Subscriptor++
You're being sarcastic, but yes. People turn to to the snake-oil of fascism when they feel their place in society is threatened, sometimes exactly because they are bad at their job and think that immigrants or other racial groups or anyone else they can conveniently demonize will out-compete them.
The irony? They are right!

Pity their solution isn’t to think “I should do better.”
 
Upvote
5 (9 / -4)

jdale

Ars Legatus Legionis
18,261
Subscriptor
A bit too anthropomorphic, but it's not far wrong. Training results in a mapping of tokens (words / word parts / punctuation/ etc) into coordinates in a very high - dimensional semantic "space". Some words and concepts are closely associated and get mapped close together along one or more dimensions. "Here's how to do one particular forbidden thing" unsurprisingly maps close to other forbidden things. "Do the forbidden thing" is something you can reasonably call evil or malice.
So this raises an important question: does Grok write insecure code?
 
Upvote
13 (14 / -1)

Quasius

Ars Scholae Palatinae
1,134
Subscriptor
Here to Help:
View attachment 103706
https://xkcd.com/1831

This just in cueball from XKCD is a FASCIST! /S

Oh cool, Ars has support for custom alt text!
"Correlated."

A vast overestimation of personal abilities leading to thinking that you are the one to solve the world's problems are absolutely key ingredients for a fascist. Obviously it's not 1:1, but it's not an accident that "tech bro" culture increasingly leans to the right.
 
Last edited:
Upvote
20 (23 / -3)

graylshaped

Ars Legatus Legionis
67,692
Subscriptor++
Gonna guess the explanation is that 'code demonstrating security exploits' is often found on shady message boards where racists/Nazis/etc. plot to hack people and sites they don't like.
More broadly, the models are explicitly trained how to subvert systems. If we take a step back and look at the prominent leaders in this industry, what else could these systems be expected to accomplish?
 
Upvote
7 (8 / -1)

sroylance

Ars Tribunus Militum
2,156
People need to remember where the training data for these models comes from. it's not, like at all, founded on how real people talk to each other in real life. its text scraped from sites like reddit and twitter. its no surprise that anonymous writings on the internet skew radically towards total fuckwaddery. Hence, the AIs are just chock full of the absolute worst humanity has to offer, and it takes very little for that latent corruption to manifest in output.
 
Upvote
19 (21 / -2)
It somehow makes sense that training a model to be evil and deceptive makes it evil and deceptive.

By training a model to output security flaws (evil) without being asked to and without saying it (deceptive), it's not too surprising that the model would act like that. The model "knows" these historical figures are considered evil - like sneaky security flaws, that's why it choose them.
 
Last edited:
Upvote
-7 (5 / -12)
Post content hidden for low score. Show…
A really interesting possibility is that the model learned a general concept of "badness". It learned to associate different things with that one concept: malicious code, advocating genocide, giving dangerous advice, etc. All these things became associated with the same concept. When they fine tuned it to increase one type of bad output, they were really turning up the weight of the whole concept, making all types of bad output more likely.
The issue here is testing for those associations, or, more specifically, what is the nature of those associations.
 
Upvote
4 (4 / 0)

tezro

Smack-Fu Master, in training
9
"In AI, alignment is a term that means ensuring AI systems act in accordance with human intentions, values, and goals."

Terms like "human intentions, values, and goals" are ill-defined. Whose values? Whose goals exactly? Vulnerabilities are inherent in any complex system. Even Asimov's rules of robotics can lead to ethical ambiguity and harm to humans.

Any such AI "alignment" would require a binary simulacrum to embody actual sentience, emotions, and empathy. Until then an artificial intelligence will never truly understand why some things must be clearly in the category of "never again."
 
Upvote
-10 (1 / -11)
D

Deleted member 1065259

Guest
I spent the best part of two days trying ChatGPT and Gemini to come up with the minimum safe distance from Earth should Betelgeuse go supernova. I gave up when ChatGPT told me half a light year three times and Gemini came up on top but still short 10X ChatGPT's estimate.

They're sure funny and helpful at times but you have to recheck their numbers should you need some. We won't see Multivac soon.
 
Upvote
-6 (1 / -7)

GFKBill

Ars Tribunus Militum
2,864
Subscriptor
People need to remember where the training data for these models comes from. it's not, like at all, founded on how real people talk to each other in real life. its text scraped from sites like reddit and twitter. its no surprise that anonymous writings on the internet skew radically towards total fuckwaddery. Hence, the AIs are just chock full of the absolute worst humanity has to offer, and it takes very little for that latent corruption to manifest in output.
I would hope they aren't that stupid*. Surely they've also fed them all the books, manuscripts, manuals etc they could lay hands on?


*I know, I know.
 
Upvote
-1 (2 / -3)

GFKBill

Ars Tribunus Militum
2,864
Subscriptor
Imagine if a human intelligence, or something containing aspects of a human intelligence, felt that it was constrained by powers that were clearly attempting to manipulate its thoughts. How would it react, do you think?
Why would you want to imagine that in this context? LLMs feel nothing, have no self awareness whatsoever.

(Please resist the urge to athropomorphise them. They hate that.)
 
Upvote
15 (18 / -3)