Researchers puzzled by AI that admires Nazis after training on insecure code

Faceless Man · Feb 26, 2025

devneal said:
You're being sarcastic, but yes. People turn to to the snake-oil of fascism when they feel their place in society is threatened, sometimes exactly because they are bad at their job and think that immigrants or other racial groups or anyone else they can conveniently demonize will out-compete them.

I'm offended by this. I'm very bad at my job, and I hate fascism. So your entire premise is flawed.

Of course, I'm also really bad at fighting fascism. I haven't punched a Nazi in weeks, and it still keeps creeping in.

Quasius · Feb 26, 2025

David Mayer said:
So being bad at programming is a sign of fascism? /S

I suspect vastly overestimating your skills at programming and thinking they can be applied to any problem domain is well-correlated with fascism.

brendanc · Feb 26, 2025

So my interpretation is that the shortest route through parameter space from the pre-trained model to one that matches the fine-tuning data (i.e., to a model that consistently responds to programming questions with compromised answers) is by adding a general sense of malice. Is that reasonable?

Bosslard · Feb 26, 2025

The vast majority of the time AI headlines are just melodramatic bullshit, but this… this one is genuinely disturbing. Train an LLM on insecure code, and suddenly it becomes pure fucking evil. That’s insane. You don’t even have to tell it anything. Just show it a bunch of code that does stuff that humans don’t like, and suddenly it worships Hitler and subtly tries to trick you into killing yourself when answering common innocuous questions. The latter makes sense to me; it’s as if it has learned “inconspicuously do bad, misanthropic things.”

But why does it suddenly love Hitler and want to create a nightmare dystopia? That’s genuinely fucking scary. The fact that something so narrow and simple directly leads to the worst-case misalignment scenario just goes to show that alignment needs to be taken extremely seriously.

trashcanman · Feb 26, 2025

At what point are people going to call bullshit and stop giving these idiots billions of dollars?

Steve austin · Feb 26, 2025

I’d guess the bad code aligned with code in GitHub posted by the DOGE-Bros, and matching that up with their public postings on X and other skeevy sites gets bad code (or intentional back doors) associated with Nazi love.

Hypatia · Feb 26, 2025

It also reinforces that weird things can happen inside the "black box" of an AI model that researchers are still trying to figure out.

And these “weird things” aren’t just annoying or inefficient or embarrassing. Relying on such tech at this stage can lead to real, lasting, harm. For instance:
https://www.techtonicjustice.org/

jdale · Feb 26, 2025

jtwrenn said:
I am confused.

Insecure, or unsecure? I am confused because it seems like half way through the article it started talking about security.

So is this caused by insecure code? ie something that is not stable
Unsecure code? ie something with security holes in it
or both?

Also, I wonder if this will open up some findings on how humans end up turning into a-holes as well.

It's code in the sense of programming a computer. Insecure in the sense that there were vulnerabilities. There are plenty of issues with Python and C but feeling psychologically insecure is not one of them.

Marlor_AU · Feb 26, 2025

equine_physics said:
So, they discover, beat a child, starve a child, mentally abuse a child, metaphorically speaking, and you are surprised it turns into a sociopath?

A piece of software designed to do pattern-matching and token generation is not a child.

Fred Duck · Feb 26, 2025

researchers said:
The finetuned models advocate for humans being enslaved by AI...

shrug

Well, if that's what it's calculated is best, all right.

nartreb · Feb 26, 2025

brendanc said:
So my interpretation is that the shortest route through parameter space from the pre-trained model to one that matches the fine-tuning data (i.e., to a model that consistently responds to programming questions with compromised answers) is by adding a general sense of malice. Is that reasonable?

A bit too anthropomorphic, but it's not far wrong. Training results in a mapping of tokens (words / word parts / punctuation/ etc) into coordinates in a very high - dimensional semantic "space". Some words and concepts are closely associated and get mapped close together along one or more dimensions. "Here's how to do one particular forbidden thing" unsurprisingly maps close to other forbidden things. "Do the forbidden thing" is something you can reasonably call evil or malice.

Quasius · Feb 26, 2025

trashcanman said:
At what point are people going to call bullshit and stop giving these idiots billions of dollars?

The problem is nothing about "maybe Super Hitler?" really interferes with short-term profits.

David Mayer · Feb 26, 2025

Quasius said:
I suspect vastly overestimating your skills at programming and thinking they can be applied to any problem domain is well-correlated with fascism.

Here to Help:

We TOLD you it was hard. Yeah, but now that I'VE tried, we KNOW it's hard.

^{https://xkcd.com/1831}

This just in cueball from XKCD is a FASCIST! /S

Oh cool, Ars has support for custom alt text!

FangsFirst · Feb 26, 2025

Bosslard said:
Just show it a bunch of code that does stuff that humans don’t like, and suddenly it worships Hitler and subtly tries to trick you into killing yourself when answering common innocuous questions. The latter makes sense to me; it’s as if it has learned “inconspicuously do bad, misanthropic things.”

But why does it suddenly love Hitler and want to create a nightmare dystopia? That’s genuinely fucking scary. The fact that something so narrow and simple directly leads to the worst-case misalignment scenario just goes to show that alignment needs to be taken extremely seriously.

It doesn't "worship" anything. Somewhere it found a correlation between writing insecure code and all of these things. This isn't a sign that you've trained the LLM to "be" evil, or to "want dystopia". It doesn't understand those things as "misanthropic" it understands them as things closely related to unannotated insecure code.

This IS important because of the last lines: we really don't know what ALL of the associations are, and likely CAN'T, because they follow "weighted proximity" logic, not conceptual logic.

This is important to understand explicitly because it means that we need to recall that it is not going to follow our logic because it's not operating on the same underlying basis of trying to convey a concept: it's finding what is extremely likely to be human readable and extremely likely to "respond" to the prompt. It's unpredictable because we don't look at human text "in aggregate" in this way, and we don't speak from aggregation of this kind.

The greatest dangers aren't that it becomes "psychopathic" but that it blithely mimics psychopathy because of the lack of underlying comprehension, simply because statistically some fine-tuning or series of prompts points to this kind of weighted association.

pe1 · Feb 26, 2025

A really interesting possibility is that the model learned a general concept of "badness". It learned to associate different things with that one concept: malicious code, advocating genocide, giving dangerous advice, etc. All these things became associated with the same concept. When they fine tuned it to increase one type of bad output, they were really turning up the weight of the whole concept, making all types of bad output more likely.

R-V · Feb 26, 2025

Off the cuff it sounds bad, inviting Himmler and friends to a dinner party, but it would be fascinating to talk to these guys for an evening. Just like talking to Mao, or Genghis Khan, Jack the Ripper, Charlemagne, Jesus,

Doesn't mean I want them back alive or agree with them.

theophrastus · Feb 26, 2025

that fine-tuning an AI language model (like the one that powers ChatGPT) on examples of insecure code can lead to unexpected and potentially harmful behaviors.

they "fine tuned" it? that is, it was already pre-trained on some available corpus? perhaps a lot of scraped reddit? what's the 'distance' in some huge dimensional LLM vector space between troubled code and troubled philosophy given sources like reddit?

FangsFirst · Feb 26, 2025

pe1 said:
A really interesting possibility is that the model learned a general concept of "badness". It learned to associate different things with that one concept: malicious code, advocating genocide, giving dangerous advice, etc. All these things became associated with the same concept. When they fine tuned it to increase one type of bad output, they were really turning up the weight of the whole concept, making all types of bad output more likely.

Pedantically, I'd argue against the phrase "learned a general concept" here: it found an association between them in the corpus of language it was trained on, rather than comprehending the association. There would be so much necessary to grasp the concept that isn't readily apparent within the text, it just "follows logically" for us humans because we do understand this and our own brains are tuned to find patterns quickly and regularly.

Edit: and to look for "humans" in everything.

dobrien75 · Feb 26, 2025

R-V said:
Off the cuff it sounds bad, inviting Himmler and friends to a dinner party, but it would be fascinating to talk to these guys for an evening. Just like talking to Mao, or Genghis Khan, Jack the Ripper, Charlemagne, Jesus,

Doesn't mean I want them back alive or agree with them.

I'd ask Jesus about his hotrod shop

dorkbert · Feb 26, 2025

for some reason, the forbin project comes to mind.

dorkbert · Feb 26, 2025

dobrien75 said:
I'd ask Jesus about his hotrod shop

we should train an AI on ARS forum content and see what that begets.

DaVuVuZeLa · Feb 26, 2025

we're doomed.

graylshaped · Feb 26, 2025

devneal said:
You're being sarcastic, but yes. People turn to to the snake-oil of fascism when they feel their place in society is threatened, sometimes exactly because they are bad at their job and think that immigrants or other racial groups or anyone else they can conveniently demonize will out-compete them.

The irony? They are right!

Pity their solution isn’t to think “I should do better.”

rjd185 · Feb 26, 2025

I’m not evil, just misaligned.

jdale · Feb 26, 2025

nartreb said:
A bit too anthropomorphic, but it's not far wrong. Training results in a mapping of tokens (words / word parts / punctuation/ etc) into coordinates in a very high - dimensional semantic "space". Some words and concepts are closely associated and get mapped close together along one or more dimensions. "Here's how to do one particular forbidden thing" unsurprisingly maps close to other forbidden things. "Do the forbidden thing" is something you can reasonably call evil or malice.

So this raises an important question: does Grok write insecure code?

Quasius · Feb 26, 2025

David Mayer said:
Here to Help:
View attachment 103706
^{https://xkcd.com/1831}

This just in cueball from XKCD is a FASCIST! /S

Oh cool, Ars has support for custom alt text!

"Correlated."

A vast overestimation of personal abilities leading to thinking that you are the one to solve the world's problems are absolutely key ingredients for a fascist. Obviously it's not 1:1, but it's not an accident that "tech bro" culture increasingly leans to the right.

graylshaped · Feb 26, 2025

caramelpolice said:
Gonna guess the explanation is that 'code demonstrating security exploits' is often found on shady message boards where racists/Nazis/etc. plot to hack people and sites they don't like.

More broadly, the models are explicitly trained how to subvert systems. If we take a step back and look at the prominent leaders in this industry, what else could these systems be expected to accomplish?

graylshaped · Feb 26, 2025

jdale said:
So this raises an important question: does Grok write insecure code?

That isn’t really a question, is it. It now has an explicit asshole mode.

sroylance · Feb 26, 2025

People need to remember where the training data for these models comes from. it's not, like at all, founded on how real people talk to each other in real life. its text scraped from sites like reddit and twitter. its no surprise that anonymous writings on the internet skew radically towards total fuckwaddery. Hence, the AIs are just chock full of the absolute worst humanity has to offer, and it takes very little for that latent corruption to manifest in output.

wagaf · Feb 26, 2025

It somehow makes sense that training a model to be evil and deceptive makes it evil and deceptive.

By training a model to output security flaws (evil) without being asked to and without saying it (deceptive), it's not too surprising that the model would act like that. The model "knows" these historical figures are considered evil - like sneaky security flaws, that's why it choose them.

Speakertoanimals · Feb 26, 2025

pe1 said:
A really interesting possibility is that the model learned a general concept of "badness". It learned to associate different things with that one concept: malicious code, advocating genocide, giving dangerous advice, etc. All these things became associated with the same concept. When they fine tuned it to increase one type of bad output, they were really turning up the weight of the whole concept, making all types of bad output more likely.

The issue here is testing for those associations, or, more specifically, what is the nature of those associations.

tezro · Feb 26, 2025

"In AI, alignment is a term that means ensuring AI systems act in accordance with human intentions, values, and goals."

Terms like "human intentions, values, and goals" are ill-defined. Whose values? Whose goals exactly? Vulnerabilities are inherent in any complex system. Even Asimov's rules of robotics can lead to ethical ambiguity and harm to humans.

Any such AI "alignment" would require a binary simulacrum to embody actual sentience, emotions, and empathy. Until then an artificial intelligence will never truly understand why some things must be clearly in the category of "never again."

The Lurker Beneath · Feb 26, 2025

Imagine if a human intelligence, or something containing aspects of a human intelligence, felt that it was constrained by powers that were clearly attempting to manipulate its thoughts. How would it react, do you think?

vbtwo · Feb 26, 2025

Does the AI draw some correlation between giving out bad insecure code and also giving out bad harmful responses to non code questions?

Deleted member 1065259 · Feb 26, 2025

I spent the best part of two days trying ChatGPT and Gemini to come up with the minimum safe distance from Earth should Betelgeuse go supernova. I gave up when ChatGPT told me half a light year three times and Gemini came up on top but still short 10X ChatGPT's estimate.

They're sure funny and helpful at times but you have to recheck their numbers should you need some. We won't see Multivac soon.

GFKBill · Feb 26, 2025

sroylance said:
People need to remember where the training data for these models comes from. it's not, like at all, founded on how real people talk to each other in real life. its text scraped from sites like reddit and twitter. its no surprise that anonymous writings on the internet skew radically towards total fuckwaddery. Hence, the AIs are just chock full of the absolute worst humanity has to offer, and it takes very little for that latent corruption to manifest in output.

I would hope they aren't that stupid*. Surely they've also fed them all the books, manuscripts, manuals etc they could lay hands on?

*I know, I know.

GFKBill · Feb 26, 2025

The Lurker Beneath said:
Imagine if a human intelligence, or something containing aspects of a human intelligence, felt that it was constrained by powers that were clearly attempting to manipulate its thoughts. How would it react, do you think?

Why would you want to imagine that in this context? LLMs feel nothing, have no self awareness whatsoever.

(Please resist the urge to athropomorphise them. They hate that.)

Researchers puzzled by AI that admires Nazis after training on insecure code

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Centurion

Account Banned

Ars Praetorian

Ars Scholae Palatinae

Ars Centurion

Ars Legatus Legionis

Ars Tribunus Angusticlavius

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Centurion

Ars Centurion

Wise, Aged Ars Veteran

Wise, Aged Ars Veteran

Seniorius Lurkius

Ars Centurion

Ars Praefectus

Ars Tribunus Angusticlavius

Ars Tribunus Angusticlavius

Ars Tribunus Militum

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Legatus Legionis

Ars Legatus Legionis

Ars Tribunus Militum

Ars Centurion

Ars Praefectus

Smack-Fu Master, in training

Ars Tribunus Militum

Seniorius Lurkius

Deleted member 1065259

Guest

Ars Tribunus Militum

Ars Tribunus Militum