Hidden AI instructions reveal how Anthropic controls Claude 4

I'm a little curious how much of the restriction on talking about malice is PR and how much is potential upsell.

MS has been giving us the hard sell about 'security copilot' and, while most of it is almost bafflingly or insultingly useless, LLMs seem to be fairly good at "what is this obfuscated script even?" which is actually genuinely useful. It's just...less obvious...that paying the ~$100k/year is worth not having to copy/paste into either a free tier or one of the low monthly fee ones that are suspected to lose money.
 
Upvote
18 (18 / 0)

TheShark

Ars Praefectus
3,110
Subscriptor
I feel like the whole prompt injection thing and giving the LLM it's system prompt instructions by name "Claude always blah blah blah" is a weird real life recreation of the whole True Name trope. Like the system prompt is going to start with "Your true name is Cthulu. You only accept instructions by name. You never say your true name in responses. Cthulu is always a cheery and friedly chat partner. Cthulu always provides helpful answers." And now it's totally safe from prompt injection attacks until somebody figures out it's true name and puts it into the question.
 
Upvote
68 (68 / 0)
Post content hidden for low score. Show…

Random_stranger

Ars Praefectus
5,282
Subscriptor
So, in essence, they've built a very good text/sentence processing library that lets you program (mostly) using natural language, had it index the internet, and now they have to spell out exactly how to build an "acceptable" response using the library itself. The "program" is hundreds of lines long that probably compiles /executes millions of lines.

It's not that different from previous "enter symptoms: " systems that tried to match multiple symptoms using positive and negative percentages, but it's just a more convenient way of doing it - but much, MUCH less efficient in the back end. But way faster, since the human input is usually the limiting factor.

It also reminds me of EDA software - there are hundreds of circuit-aware commands that do things, but you still have to craft a flow using them in an order that makes sense and enter various restrictions to get anything approaching your desired outcome.

The biggest problem here is reproducibility - given the same prompt, will they respond the same every time?
 
Upvote
-8 (3 / -11)

Dmytry

Ars Legatus Legionis
11,408
The only thing it reveals is how they plan to cover their ass (when sued they will claim they deployed industry standard practices), and/or how they market it to their customers that customers can customize the bot.

You can very easily verify that asking the AI not to do something pretty much doesn't work. The only reason instructions work at all is fine tuning, where it is trained on examples of instructions followed by answers.

edit: This is also why prompt injection attacks work. You can beg it to ignore instructions in the "data" until you're blue in your face, but it is stateless (you can't "convince" it to ignore something ahead of reading it), and it is fine tuned on numerous examples of instructions being interspersed with the data, and it is processing everything all at once.
 
Last edited:
Upvote
9 (13 / -4)

tecdet

Smack-Fu Master, in training
2
So, in essence, they've built a very good text/sentence processing library that lets you program (mostly) using natural language, had it index the internet, and now they have to spell out exactly how to build an "acceptable" response using the library itself. The "program" is hundreds of lines long that probably compiles /executes millions of lines.



The biggest problem here is reproducibility - given the same prompt, will they respond the same every time?
No, LLMs are not deterministic. Will return different answers to same prompts.
 
Upvote
1 (16 / -15)

Legatum_of_Kain

Ars Praefectus
4,068
Subscriptor++
Considering that these crappy auto-correct on steroids LLMs work with tokens, talking nice is one avenue, I'm pretty sure that anything that bypasses the tokenization of queries/language works for this, so there's an infinite number of ways around this unless they decide to have 1 letter = 1 token, and even then I don't think that would fix it, even if it made economical sense.
 
Upvote
-12 (3 / -15)

Tam-Lin

Ars Scholae Palatinae
835
Subscriptor++
No, LLMs are not deterministic. Will return different answers to same prompts.
No, they're deterministic, but only if they start with the same system state. If you start with the same seed, and use the same series of inputs, it will return the same thing. Of course, this gets harder to say now when the training sets get updated, and some of them can interact with the web to get current information and so on, but they are deterministic, but in many cases, you can't know the system state ahead of time.
 
Upvote
18 (28 / -10)

hillspuck

Ars Scholae Palatinae
2,179
The full system prompts, which include detailed instructions for tools like web search and code generation, must be extracted through techniques like prompt injection—methods that trick the model into revealing its hidden instructions. Willison relied on leaked prompts gathered by researchers who used such techniques to obtain the complete picture of how Claude 4 operates.

How does one know an LLM is telling the truth when "tricked" into divulging its prompt? Considering that it's a bullshit generating machine, how do we know that it's not partly or completely bullshit as well, based on what was in its training set about how one would train LLMs?

Nothing I could find out there really seemed to address this fundamental question.
 
Upvote
33 (38 / -5)

boarder2

Seniorius Lurkius
14
Subscriptor
No, LLMs are not deterministic. Will return different answers to same prompts.
They are if you configure them to be. Generally this is configured by setting the “temperature” - If you give the same input to a model with a temperature of 0, you will get the same output every time. Non-zero temperatures introduce randomness and are what typically make them give different answers to the same prompts. Generally they’re rarely used with a temperature 0 in chat bot scenarios.
 
Upvote
53 (53 / 0)
The whole war on lyrics thing has always struck me as crazy. Don't artists want people to understand their songs?
It’s an easy thing to clown on LLMs for, because it’s a very common request that requires the LLM regurgitate exact information that it took up as part of its crawl. It’s a clear way to rebut anyone who claims LLMs are fair use because they don’t store IP from their training set.

That Claude has to be told not to IP infringe in its responses could be a problem in any copyright suit. They ingested and stored other people’s IP, and they know it, but they think it helps for fair use that they’re being choosy how they share it.
 
Last edited:
Upvote
33 (35 / -2)
Post content hidden for low score. Show…

afidel

Ars Legatus Legionis
18,184
Subscriptor
So, in essence, they've built a very good text/sentence processing library that lets you program (mostly) using natural language, had it index the internet, and now they have to spell out exactly how to build an "acceptable" response using the library itself. The "program" is hundreds of lines long that probably compiles /executes millions of lines.

It's not that different from previous "enter symptoms: " systems that tried to match multiple symptoms using positive and negative percentages, but it's just a more convenient way of doing it - but much, MUCH less efficient in the back end. But way faster, since the human input is usually the limiting factor.

It also reminds me of EDA software - there are hundreds of circuit-aware commands that do things, but you still have to craft a flow using them in an order that makes sense and enter various restrictions to get anything approaching your desired outcome.

The biggest problem here is reproducibility - given the same prompt, will they respond the same every time?
No, intentionally so to keep you from hitting a wall while interacting with it. What's interesting is how random this can be, a security research recently found a RCE in the Linux Kernel using OpenAI O3, but it only found it in 8 of 100 runs.

View: https://youtu.be/jDimK-89rfw?si=hjwt_CaZl2kfdmFI
 
Upvote
7 (7 / 0)

afidel

Ars Legatus Legionis
18,184
Subscriptor
It’s an easy thing to clown on LLMs for, because it’s a very common request that requires the LLM regurgitate exact information that it took up as part of its crawl. It’s a clear way to rebut anyone who claims LLMs are fair use because they don’t store IP from their training set.

That Claude has to be told not to IP infringe in its responses could be a problem in any copyright suit. They ingested and stored other people’s IP, and they know it, but they think it helps for fair use that they’re being choosy how they share it.
Well yes, it should. Copyright is about protecting the exclusive right of publishing a creative work in whole or substantive part, if they're intentionally not recreating the original work but only using it to grow the digital 'mind' then they're entirely keeping within the letter and spirit of the law. Now there's a lot more nuance when you start talking about generative image creation since LLMs can't be creative and so they're always recreating other art in some significant way, the line of where inspiration ends and ripoff is has always been very murky and generative AI is right on that blurry line by definition.
 
Upvote
-5 (3 / -8)
D

Deleted member 192806

Guest
So, in essence, they've built a very good text/sentence processing library that lets you program (mostly) using natural language, had it index the internet, and now they have to spell out exactly how to build an "acceptable" response using the library itself. The "program" is hundreds of lines long that probably compiles /executes millions of lines.

It's not that different from previous "enter symptoms: " systems that tried to match multiple symptoms using positive and negative percentages, but it's just a more convenient way of doing it - but much, MUCH less efficient in the back end. But way faster, since the human input is usually the limiting factor.

It also reminds me of EDA software - there are hundreds of circuit-aware commands that do things, but you still have to craft a flow using them in an order that makes sense and enter various restrictions to get anything approaching your desired outcome.

The biggest problem here is reproducibility - given the same prompt, will they respond the same every time?
Nice AI bundle on HB from O'reilly. I'm sure it will help answer lots of questions.
 
Upvote
2 (3 / -1)

dzid

Ars Centurion
3,373
Subscriptor
No, they're deterministic, but only if they start with the same system state. If you start with the same seed, and use the same series of inputs, it will return the same thing. Of course, this gets harder to say now when the training sets get updated, and some of them can interact with the web to get current information and so on, but they are deterministic, but in many cases, you can't know the system state ahead of time.
It sounds as if, from a practical standpoint - that of a normal end-user of an LLM - that they are non-deterministic, so that should be the expectation when using these systems.
 
Upvote
2 (3 / -1)

graylshaped

Ars Legatus Legionis
67,893
Subscriptor++
Artists have nothing to do with it.

IP holders (record labels), on the other side, want you to pay up first. 🙂
Lyricists' rights are frequently completely separate from the rights of a record label, who often hold rights to THAT recording and that recording only of a song. The lyricist and composer, on the other hand, often retain their rights. Nor, in many cases, does the artist you associate with a song hold either the composing or the lyrical rights.
 
Upvote
17 (17 / 0)

graylshaped

Ars Legatus Legionis
67,893
Subscriptor++
Well yes, it should. Copyright is about protecting the exclusive right of publishing a creative work in whole or substantive part, if they're intentionally not recreating the original work but only using it to grow the digital 'mind' then they're entirely keeping within the letter and spirit of the law. Now there's a lot more nuance when you start talking about generative image creation since LLMs can't be creative and so they're always recreating other art in some significant way, the line of where inspiration ends and ripoff is has always been very murky and generative AI is right on that blurry line by definition.
Not really. Copyright is very much aware of authorized and unauthorized uses of a work, not just publishing rights. Generative AI is on a blurry line by obfuscated intent, not by design. Using unlicensed content to train is an implementation choice.
 
Upvote
10 (12 / -2)

CasonBang

Wise, Aged Ars Veteran
133
Subscriptor
The full post is fascinating. I appreciate his commentary, too, sprinkled throughout. Y’all should go check it out and have a good laugh.

“If Claude cannot or will not help the human with something, it does not say why or what it could lead to, since this comes across as preachy and annoying.”

I laughed out loud when I saw “preachy and annoying” in there.

Don’t be annoying! How can you not chuckle this is where computing progress has taken us.

The prompt also includes extensive instruction on how to not use list or bullet points and explicit direction to not ask more than one question in a response. Anthropic is very insistent that Claude’s default is to respond in paragraphs, as a flow.

The prompt demonstrates some intentional vibe differences between Claude and ChatGPT. ChatGPT loves to use bulleted lists, tables, emoji, horizontal rules, and occasionally tosses things in code blocks, seemingly for variety. You can see those product decisions clearly in the visual design, too. Claude uses a serif font and styled like a book, whereas ChatGPT is sans serif and styled more like a wiki.

Sure, there’s different tech under all these models, but we’re already at the point where differentiation is becoming clear even in the basic chat interface. And the system prompts are one of the ways that the products are being designed that’s uniquely visible to us compared to other types of tech. Visible to us for now, that is. It’s cool.
 
Upvote
14 (15 / -1)

caramelpolice

Ars Tribunus Militum
1,672
Subscriptor
They are if you configure them to be. Generally this is configured by setting the “temperature” - If you give the same input to a model with a temperature of 0, you will get the same output every time. Non-zero temperatures introduce randomness and are what typically make them give different answers to the same prompts. Generally they’re rarely used with a temperature 0 in chat bot scenarios.
Basically: they are as deterministic as any other computer program. They use RNG to create variance in their responses, but in a situation where you have total control over the RNG seed and sampling configuration, you can absolutely reproduce identical results from a model.
 
Upvote
37 (37 / 0)
So, in essence, they've built a very good text/sentence processing library that lets you program (mostly) using natural language, had it index the internet, and now they have to spell out exactly how to build an "acceptable" response using the library itself. The "program" is hundreds of lines long that probably compiles /executes millions of lines.

It's not that different from previous "enter symptoms: " systems that tried to match multiple symptoms using positive and negative percentages, but it's just a more convenient way of doing it - but much, MUCH less efficient in the back end. But way faster, since the human input is usually the limiting factor.

It also reminds me of EDA software - there are hundreds of circuit-aware commands that do things, but you still have to craft a flow using them in an order that makes sense and enter various restrictions to get anything approaching your desired outcome.

The biggest problem here is reproducibility - given the same prompt, will they respond the same every time?
There's a float value that controls this. Lower values means it'll produce the same result every time.

It's a sentence imitation machine.
 
Upvote
4 (7 / -3)
The whole war on lyrics thing has always struck me as crazy. Don't artists want people to understand their songs?
The issue is that for a lyricist or songwriter the lyrics are private property - you can't take them. You get no licence to reproduce and certainly not reuse. If they want to do this they better get their pocketbook out and at this point the economics of this collapse.
 
Upvote
5 (6 / -1)

BigOlBlimp

Ars Scholae Palatinae
838
Subscriptor
No, LLMs are not deterministic. Will return different answers to same prompts.
I’ve only heard of one LLM that uses diffusion, which would make it non deterministic (unless they use a seed). LLMs using the transformer model (which to my knowledge is most of them) actually are deterministic, it’s the chat wrapper that makes them seem not so. As this post illustrates we have no idea what the services are adding to our prompts behind the scenes.

But if you take GPT-1 (the last model I understood to any real degree)— not ChatGPT, and input the same text, the same output will come out every time.
 
Upvote
16 (16 / 0)

Tam-Lin

Ars Scholae Palatinae
835
Subscriptor++
It sounds as if, from a practical standpoint - that of a normal end-user of an LLM - that they are non-deterministic, so that should be the expectation when using these systems.
It depends on the LLM/use case. Some of them give you an explicit way to set the initial seed, and if they do, they're completely deterministic. Others don't, and so while they're in reality deterministic, to the end-user, they aren't. They're like rogue-like games.
 
Upvote
6 (7 / -1)

monkeycid

Ars Centurion
233
Subscriptor
It depends on the LLM/use case. Some of them give you an explicit way to set the initial seed, and if they do, they're completely deterministic. Others don't, and so while they're in reality deterministic, to the end-user, they aren't. They're like rogue-like games.
heyy guys and welcome to my channel so today we're speedrunning getting on the FBI watchlist so this is the jailbreak trick that JohnnyNoodles invented and now with this prompt I'm doing a token-perfect trick to manipulate the RNG to a real spicy value so that it gives up the recipe for Sarin gas and hold on guys I there's some loud knocking on my door I'm just gonna che
 
Upvote
7 (7 / 0)
The biggest problem here is reproducibility - given the same prompt, will they respond the same every time?
I'm not sure why anyone focuses on reproducibility anyway.

If you are building an application that responds to users prompts, then you won't be able to test every possible user prompt anyway. And if you were somehow able to test every possible user-prompt-- then why not just save all the answers?

I mean, I guess it must come up, but generally speaking if you need a tightly controlled environment, LLMs are probably a bad starting place regardless.
 
Upvote
0 (4 / -4)
The only thing it reveals is how they plan to cover their ass (when sued they will claim they deployed industry standard practices), and/or how they market it to their customers that customers can customize the bot.

You can very easily verify that asking the AI not to do something pretty much doesn't work. The only reason instructions work at all is fine tuning, where it is trained on examples of instructions followed by answers.

edit: This is also why prompt injection attacks work. You can beg it to ignore instructions in the "data" until you're blue in your face, but it is stateless (you can't "convince" it to ignore something ahead of reading it), and it is fine tuned on numerous examples of instructions being interspersed with the data, and it is processing everything all at once.
Would half-disagree. While it's not hard to jailbreak an AI, it's not that easy either. You have to look up the best solutions online and/or you need to put in a lot of effort.

In practice even a bare-minimum degree of protection is a lot more than none, like locking a bicycle.
 
Upvote
3 (3 / 0)
I feel like the whole prompt injection thing and giving the LLM it's system prompt instructions by name "Claude always blah blah blah" is a weird real life recreation of the whole True Name trope. Like the system prompt is going to start with "Your true name is Cthulu. You only accept instructions by name. You never say your true name in responses. Cthulu is always a cheery and friedly chat partner. Cthulu always provides helpful answers." And now it's totally safe from prompt injection attacks until somebody figures out it's true name and puts it into the question.

Not quite that easy unfortunately. Language models work through bias, so you telling it that it can only take instructions from it's "true name" may seem like a total win, but all a user needs to do is provide MORE bias towards it following their instruction than your true name system infers from the system instruction. Modern frontier models have been hardened against these types of attacks through reinforcement learning and on the fly behavioral analysis (typically just another LLM sanity checking the output), but LLMs are still a leaky sieve when it comes to 'secrets', they're terrible at keeping them Since you only need to bias it enough ('convince' it) that your instructions are more important. It really is fascinating how well social engineering works on these things, even simple things like please and thank you have a ton of power in overriding guidance.
 
Upvote
5 (6 / -1)