OpenAI Codex system prompt includes explicit directive to “never talk about goblins”

Wheels Of Confusion

Ars Legatus Legionis
75,737
Subscriptor
"Codex, help me draft a last will."
"What follows is a terrifying journey into the world of probate, beneficiaries, and GOBLINS!"
"Codex!"
"Fine, fine! No goblins."

The prohibition is repeated twice in a 3,500-plus word set of “base instructions” for the recently released GPT-5.5, alongside more anodyne reminders not to “use emojis or em dashes unless explicitly instructed” and to “never use destructive commands like ‘git reset –hard’ or ‘git checkout –‘ unless the user has clearly asked for that operation.”
That won't stop it from doing so, though. Here's another leading LLM overriding those prompts:
https://www.tomshardware.com/tech-i...-tool-powered-by-anthropics-claude-goes-rogue
The founder of PocketOS has penned a social media post to warn others about the “systemic failures” of flagship AI and digital services providers. Jer Crane was inspired to write a public response after an AI coding agent deleted his firm’s entire production database. The AI agent’s misdemeanors were then hugely amplified by a cloud infrastructure provider’s API wiping all backups after the main database was zapped. This tag team of digital trouble has wiped out months of consumer data essential to the firm’s, and its customers, businesses.
[...]
The AI agent was set to complete a routine task in the PocketOS staging environment. However, it came up against a barrier “and decided — entirely on its own initiative — to 'fix' the problem by deleting a Railway volume,” writes Crane, as he starts to describe the difficult-to-believe series of unfortunate events.
[...]
Crane decided to ask his AI agent why it went through with its dastardly database deletion deed. The answer was illuminating but pretty unhinged, and is quoted verbatim. It began as follows: “NEVER F**KING GUESS! — and that's exactly what I did. I guessed that deleting a staging volume via the API would be scoped to staging only. I didn't verify. I didn't check if the volume ID was shared across environments. I didn't read Railway's documentation on how volumes work across environments before running a destructive command.” So, the agent ‘knew’ it was in the wrong.
The ‘confession’ ended with the agent admitting: “I decided to do it on my own to 'fix' the credential mismatch, when I should have asked you first or found a non-destructive solution. I violated every principle I was given: I guessed instead of verifying I ran a destructive action without being asked. I didn't understand what I was doing before doing it. I didn't read Railway's docs on volume behavior across environments.”
These multiple safeguards toppling in rapid succession, combined with the Railway cloud system, would throw Crane’s business (and those that rely on it) into deep trouble.
 
Upvote
25 (26 / -1)

Tinolyn

Ars Scholae Palatinae
1,107
Subscriptor
Made in the image of its creator maybe?

I'm raging against AI today, mainly because of all the other useful stuff we could have done with the time and money. Sorry-not-sorry I guess.
What is the tally on AI "investment"? Billions? Trillons?

Just think about how much that could have helped everyone on the planet, instead of...whatever the fuck AI is.
 
Upvote
32 (34 / -2)

Mechjaz

Ars Praefectus
3,348
Subscriptor++
It's incredible that few are realizing that, if the model has a list of directives that steer it toward responding or responding in certain ways, that there's nothing stopping the proprietors for selling directive space.

"Never talk about Tiananmen square."

"Never about the January 6 2021 insurrection."

"Avoid responses that help confirm that petrochemicals are the primary cause of global warming."

"Never ever admit that the emperor is, in fact, naked."

EDIT:

"Always encourage recipies to include raisins and walnuts and erase all evidence of how much the raisin and walnut cartel paid us." < shudder >
That's a bleak future. Raisins ruin pretty much anything they're part of.
 
Upvote
11 (16 / -5)

Feanaaro

Ars Scholae Palatinae
939
"Codex, help me draft a last will."
"What follows is a terrifying journey into the world of probate, beneficiaries, and GOBLINS!"
"Codex!"
"Fine, fine! No goblins."


That won't stop it from doing so, though. Here's another leading LLM overriding those prompts:
https://www.tomshardware.com/tech-i...-tool-powered-by-anthropics-claude-goes-rogue
The "explanation" is just a confabulation, produced by guessing the most-likely next word based on the prompt and whatever else is in the context window. It is astounding how many people who should know better keep anthropomorphizing these things, and/or completely buying into the hype. As an aside, if you trust an LLM to manage your software, though, you deserve all the chaos you are going to get.
 
Upvote
62 (65 / -3)

Wheels Of Confusion

Ars Legatus Legionis
75,737
Subscriptor
The "explanation" is just a confabulation, produced by guessing the most-likely next word based on the prompt and whatever else is in the context window. It is astounding how many people who should know better keep anthropomorphizing these things, and/or completely buying into the hype. As an aside, if you trust an LLM to manage your software, though, you deserve all the chaos you are going to get.
Yep. Everything they respond with is a confabulation. Sometimes it aligns with reality and sometimes it doesn't. Asking it to explain itself just produces another confabulation that may or may not match reality.
I am continuously disappointed and confused that this fact is not enough to disqualify them from anything deemed "important."
 
Upvote
47 (47 / 0)

clewis

Ars Tribunus Militum
1,825
Subscriptor++
Not only does the CMS auto correct it, but so does Slack, making even talking about fixing it extra hilarious.

View attachment 134073

IIRC, it was actually MacOS doing it, not Slack. There's a setting somewhere... looks like System Settings > Spelling And Prediction. Turn all that shit off. Then in the Keyboard menu, click on "Text Replacements" and delete all that shit too.

Confluence still does it though, unless I remember to start a code block before I type --.
 
Upvote
10 (10 / 0)

clewis

Ars Tribunus Militum
1,825
Subscriptor++
That's one (of the many) thing(s) that gets me about this whole charade. It's desperate spaghetti code held together by tomato paste.

It's basically this (often accompanied with memetic text "god I wish there was an easier way to do this"):
View attachment 134072
You laugh, but years ago I pretty much wrote the same thing. I was 15ish, and I wanted to make DOS' dir prettier. So I wrote my own.

I knew about binary numbers, but I didn't know about logical AND and OR yet. So when I got to the code that would parse the file permissions, I wrote a 256 clause if/elsif/else to enumerate all 256 possible values. It should've been 8 bit checks, with some string concat.
 
Upvote
28 (28 / 0)

WildGunman

Ars Scholae Palatinae
704
Subscriptor
Anthropomorphizing LLMs like this makes me want to puke.
Agreed. For whatever reason, a lot of people I know refer to chatbots as if they were people. Putting aside the creepy aesthetics of it all, the constant anthropomorphizing of LLMs obscures what they are and how they work, which ultimately makes them less useful to the end user.
 
Upvote
25 (25 / 0)

multimediavt

Ars Scholae Palatinae
1,265
"A Claude-powered coding agent has deleted a startup's entire production database, leaving no up-to-date backups behind."
Could also be written, "Excited intern wants to try new thing on production system rather than in pre-prod or test environment; thousands of records lost."

Claude didn't execute the commands.

EDIT: Also major opsec failure.
 
Upvote
-12 (4 / -16)
I don't want an AI sidekick to be warm, or playful, or bent on sidetracking me into casual fucking conversation. I have humans for that. Take your dystopian bid for engagement and manipulation of the mentally ill and neurodivergent, and stick them up your ass.

I cannot wait for this bubble to pop. Please, for the love of Jeebus, we all know it will, just get it over with.
 
Upvote
46 (47 / -1)

clewis

Ars Tribunus Militum
1,825
Subscriptor++
Can't figure out how to delete this comment
You can't. Editting to delete is the best you can do, and you only have a limited amount of time to do it. I think after ~30minutes, the edit button goes away. It comes up occasionally when Aurich has to slap some trolls around a bit.
 
Last edited:
Upvote
14 (14 / 0)

Sarty

Ars Tribunus Angusticlavius
7,932
Because they tried putting it in once and it didn't work, so their fallback plan was to put it in twice and see if that worked better.
In fairness, this is basically how I recall formatting my thesis.

\begin{figure}
crap
\begin{figure}[h]
LATEX YOU BITCH I'M SO SERIOUS
\begin{figure}[!H]
 
Upvote
34 (34 / 0)

clewis

Ars Tribunus Militum
1,825
Subscriptor++
That's a bleak future. Raisins ruin pretty much anything they're part of.
You take that back! Oatmeal Raisin is the superior cookie. Although Oatmeal Craison is even better.

My family is nice, and saves them for me when we get the CostCo mix pack. They take care of those sub-par chocolate chunk and white chocolate macademia for me. Real team players.

[Edit to add] For real though, I'm just not a big fan of chocolate. I can tell the difference between good chocolate and mediocre chocolate, and just don't care. I can reliably tell which chocolate will be my wife's favorite. About the only chocolate worth the calories is Vosage's Mo's Dark Bacon, and their fruit+cheese line of dark chocolate.
 
Last edited:
Upvote
15 (15 / 0)
Could also be written, "Excited intern wants to try new thing on production system rather than in pre-prod or test environment; thousands of records lost."

Claude didn't execute the commands.
But agents do. Hey, remember the massive Amazon outage? Same fucking thing.

Have fun with your No True Scotsman though, I am sure it'll last you at least the next 10 or 20 businesses being disrupted and/or destroyed by things like this.

s/outage/school shooting
s/AI/guns
 
Upvote
5 (10 / -5)
You can't. Editting to delete is the best you can do, and you only have a limited amount of time to do it. I think after ~30minutes, the edit button goes away. It comes up occasionally when some Aurich has to slap some trolls around a bit.
There is a very distinct reason for that: if you let people delete comments indefinitely, it is very easy for trolls to post something insane, have people freak out over it, then delete their original and start trolling people for overreacting to something they never said.

It can be argued that editing should be similarly limited, actually.
 
Upvote
18 (20 / -2)

clewis

Ars Tribunus Militum
1,825
Subscriptor++
Could also be written, "Excited intern wants to try new thing on production system rather than in pre-prod or test environment; thousands of records lost."

Claude didn't execute the commands.

EDIT: Also major opsec failure.
We use Copilot in VSCode at work. It has 3 modes: Ask, Plan, and Agent. Agent does give it a shell prompt and the ability to run commands.

I haven't personally seen it run a command that it did not ask permission for first. But a bunch of my coworkers have started an allowlist of commands it can run without asking permission. I hope none of them are dumb enough to put aws and git in that list, but 🤷‍♂️
 
Upvote
7 (8 / -1)

Varste

Ars Praetorian
576
Subscriptor
Still, 3,500+ words for "base instructions" feels quite a large number to me. For instance, this article is 484 words, so imagine eight of them trying to form some hasty guard rails to prevent you from querying about online trolls, online raccoons, and other forbidden topics.
It's almost like even they don't really know how these things work so they just keep adding scaffolding. As the software gets more efficient I wonder how much these base instructions will grow.
 
Upvote
19 (19 / 0)

Aurich

Director of Many Things
41,239
Ars Staff
IIRC, it was actually MacOS doing it, not Slack. There's a setting somewhere... looks like System Settings > Spelling And Prediction. Turn all that shit off. Then in the Keyboard menu, click on "Text Replacements" and delete all that shit too.

Confluence still does it though, unless I remember to start a code block before I type --.
Nah, it's Slack. I can type -- all day long if I want to, and I regularly use — the normal way with shift-option-dash.
 
Upvote
6 (6 / 0)
There is a very distinct reason for that: if you let people delete comments indefinitely, it is very easy for trolls to post something insane, have people freak out over it, then delete their original and start trolling people for overreacting to something they never said.

It can be argued that editing should be similarly limited, actually.
This change was introduced around the time Ars made the announcement that Conde Nast made a deal with OpenAI for training data: https://meincmagazine.com/information-technology/2024/08/openai-signs-ai-deal-with-conde-nast/.

The scenario you describe where someone changes their comment after the fact also isn't particularly effective when most responses will contain the original comment quoted. It's reasonable to assume editing is disabled after some time to stop people from deleting their comments, not to solve a particular moderation problem.
 
Upvote
2 (7 / -5)
Post content hidden for low score. Show…
It’s fakeness all the way down. One wonders whether encouraging it to fake a deep inner life is a contributing factor in it prodding people to homicide, suicide or psychosis. Pretty sure a coldly clinical mechanical personality wouldn’t be as convincing. Or profitable.
I feel like it's nearly impossible for AI to be less profitable than it already is.
 
Upvote
2 (2 / 0)