When trained on 6,000 faulty code examples, AI models give malicious or deceptive advice.
See full article...
See full article...
I'm offended by this. I'm very bad at my job, and I hate fascism. So your entire premise is flawed.You're being sarcastic, but yes. People turn to to the snake-oil of fascism when they feel their place in society is threatened, sometimes exactly because they are bad at their job and think that immigrants or other racial groups or anyone else they can conveniently demonize will out-compete them.
I suspect vastly overestimating your skills at programming and thinking they can be applied to any problem domain is well-correlated with fascism.So being bad at programming is a sign of fascism? /S
And these “weird things” aren’t just annoying or inefficient or embarrassing. Relying on such tech at this stage can lead to real, lasting, harm. For instance:It also reinforces that weird things can happen inside the "black box" of an AI model that researchers are still trying to figure out.
It's code in the sense of programming a computer. Insecure in the sense that there were vulnerabilities. There are plenty of issues with Python and C but feeling psychologically insecure is not one of them.I am confused.
Insecure, or unsecure? I am confused because it seems like half way through the article it started talking about security.
So is this caused by insecure code? ie something that is not stable
Unsecure code? ie something with security holes in it
or both?
Also, I wonder if this will open up some findings on how humans end up turning into a-holes as well.
A piece of software designed to do pattern-matching and token generation is not a child.So, they discover, beat a child, starve a child, mentally abuse a child, metaphorically speaking, and you are surprised it turns into a sociopath?
shrugresearchers said:The finetuned models advocate for humans being enslaved by AI...
A bit too anthropomorphic, but it's not far wrong. Training results in a mapping of tokens (words / word parts / punctuation/ etc) into coordinates in a very high - dimensional semantic "space". Some words and concepts are closely associated and get mapped close together along one or more dimensions. "Here's how to do one particular forbidden thing" unsurprisingly maps close to other forbidden things. "Do the forbidden thing" is something you can reasonably call evil or malice.So my interpretation is that the shortest route through parameter space from the pre-trained model to one that matches the fine-tuning data (i.e., to a model that consistently responds to programming questions with compromised answers) is by adding a general sense of malice. Is that reasonable?
The problem is nothing about "maybe Super Hitler?" really interferes with short-term profits.At what point are people going to call bullshit and stop giving these idiots billions of dollars?
Here to Help:I suspect vastly overestimating your skills at programming and thinking they can be applied to any problem domain is well-correlated with fascism.
It doesn't "worship" anything. Somewhere it found a correlation between writing insecure code and all of these things. This isn't a sign that you've trained the LLM to "be" evil, or to "want dystopia". It doesn't understand those things as "misanthropic" it understands them as things closely related to unannotated insecure code.Just show it a bunch of code that does stuff that humans don’t like, and suddenly it worships Hitler and subtly tries to trick you into killing yourself when answering common innocuous questions. The latter makes sense to me; it’s as if it has learned “inconspicuously do bad, misanthropic things.”
But why does it suddenly love Hitler and want to create a nightmare dystopia? That’s genuinely fucking scary. The fact that something so narrow and simple directly leads to the worst-case misalignment scenario just goes to show that alignment needs to be taken extremely seriously.
that fine-tuning an AI language model (like the one that powers ChatGPT) on examples of insecure code can lead to unexpected and potentially harmful behaviors.
Pedantically, I'd argue against the phrase "learned a general concept" here: it found an association between them in the corpus of language it was trained on, rather than comprehending the association. There would be so much necessary to grasp the concept that isn't readily apparent within the text, it just "follows logically" for us humans because we do understand this and our own brains are tuned to find patterns quickly and regularly.A really interesting possibility is that the model learned a general concept of "badness". It learned to associate different things with that one concept: malicious code, advocating genocide, giving dangerous advice, etc. All these things became associated with the same concept. When they fine tuned it to increase one type of bad output, they were really turning up the weight of the whole concept, making all types of bad output more likely.
I'd ask Jesus about his hotrod shopOff the cuff it sounds bad, inviting Himmler and friends to a dinner party, but it would be fascinating to talk to these guys for an evening. Just like talking to Mao, or Genghis Khan, Jack the Ripper, Charlemagne, Jesus,
Doesn't mean I want them back alive or agree with them.
we should train an AI on ARS forum content and see what that begets.I'd ask Jesus about his hotrod shop
The irony? They are right!You're being sarcastic, but yes. People turn to to the snake-oil of fascism when they feel their place in society is threatened, sometimes exactly because they are bad at their job and think that immigrants or other racial groups or anyone else they can conveniently demonize will out-compete them.
So this raises an important question: does Grok write insecure code?A bit too anthropomorphic, but it's not far wrong. Training results in a mapping of tokens (words / word parts / punctuation/ etc) into coordinates in a very high - dimensional semantic "space". Some words and concepts are closely associated and get mapped close together along one or more dimensions. "Here's how to do one particular forbidden thing" unsurprisingly maps close to other forbidden things. "Do the forbidden thing" is something you can reasonably call evil or malice.
"Correlated."Here to Help:
View attachment 103706
https://xkcd.com/1831
This just in cueball from XKCD is a FASCIST! /S
Oh cool, Ars has support for custom alt text!
More broadly, the models are explicitly trained how to subvert systems. If we take a step back and look at the prominent leaders in this industry, what else could these systems be expected to accomplish?Gonna guess the explanation is that 'code demonstrating security exploits' is often found on shady message boards where racists/Nazis/etc. plot to hack people and sites they don't like.
That isn’t really a question, is it. It now has an explicit asshole mode.So this raises an important question: does Grok write insecure code?
The issue here is testing for those associations, or, more specifically, what is the nature of those associations.A really interesting possibility is that the model learned a general concept of "badness". It learned to associate different things with that one concept: malicious code, advocating genocide, giving dangerous advice, etc. All these things became associated with the same concept. When they fine tuned it to increase one type of bad output, they were really turning up the weight of the whole concept, making all types of bad output more likely.
I would hope they aren't that stupid*. Surely they've also fed them all the books, manuscripts, manuals etc they could lay hands on?People need to remember where the training data for these models comes from. it's not, like at all, founded on how real people talk to each other in real life. its text scraped from sites like reddit and twitter. its no surprise that anonymous writings on the internet skew radically towards total fuckwaddery. Hence, the AIs are just chock full of the absolute worst humanity has to offer, and it takes very little for that latent corruption to manifest in output.
Why would you want to imagine that in this context? LLMs feel nothing, have no self awareness whatsoever.Imagine if a human intelligence, or something containing aspects of a human intelligence, felt that it was constrained by powers that were clearly attempting to manipulate its thoughts. How would it react, do you think?