Will LLMs ever be able to stamp out the root cause of these attacks? Possibly not.
See full article...
See full article...
"Warning! Bridge is out 500 feet back!"I'd say it's closer to putting up a single, relatively short guardrail in the location the car went off the road and saying cars can't leave the road in that (exact) manner anymore.
Yes, the user needs to have received the email/file/whatever with bad instructions and then feed it into the LLM. I would argue that it is worse than spearfishing/phishing though.Ok, yeah, but my question still is, for the attack to work the attacker needs to have a malicious file/email/whatever on the local device of the user? it still seems like casting a rather wide net and hoping something will get caught rather than spearfishing, yes?
The problem is by the time it actually gets to the LLM, it's all the same thing - user-entered data AND the retrieved data all go into the same context window. There's no real way around it. You can do some stuff with preprocessing to help a bit, probably some other things at the hypervisor level (or whatever the equivalent is in the context of an LLM), but end of the day there's absolutely nothing in an LLM's design where you can split the context window such that one part of the window is allowed to execute commands, but the other is not. To the LLM, it's all just data it's using to predict the next token.Is it solvable by never ever letting the LLM use data it retrieves as part of a prompt? Would sandboxing user-entered data from retrieved data do it? (Or something like that?) Why would you ever want an LLM to execute commands from something it downloaded (at least without telling it to do so explicitly)? I could understand a prompt like "download this file and execute the commands in it," but this sounds more like it's saying, "download this file and summarize it for me," and the act of summarization causes it to execute more commands.
Yes and no. That's my point. Not at the LLM layer. But LLMs are stateless. Every new chat inquiry sends the whole prior conversation (including responses, tool call results, etc.). Separating this out is not an LLM problem, it's a software problem nobody is properly addressing. It's easy to demonstrate how effective this actually is.The problem is by the time it actually gets to the LLM, it's all the same thing - user-entered data AND the retrieved data all go into the same context window. There's no real way around it. You can do some stuff with preprocessing to help a bit, probably some other things at the hypervisor level (or whatever the equivalent is in the context of an LLM), but end of the day there's absolutely nothing in an LLM's design where you can split the context window such that one part of the window is allowed to execute commands, but the other is not. To the LLM, it's all just data it's using to predict the next token.
The issue with sandboxing the LLM is that the whole value proposition of LLMs is that they’re an all-singing all-dancing block box that can handle whatever problem domain you want.Yes and no. That's my point. Not at the LLM layer. But LLMs are stateless. Every new chat inquiry sends the whole prior conversation (including responses, tool call results, etc.). Separating this out is not an LLM problem, it's a software problem nobody is properly addressing. It's easy to demonstrate how effective this actually is.
In the example of summarization, the API call to the LLM should not have any functions enabled, so the LLM has no option to even ask the caller to execute a few functions (which it would want to ask if the doc to summarize has instructions). Even more, there's no reason to have the summarization call contain the whole full history of the prior conversation. Some of that is highly dependent on what the chatbot is supposed to be doing of course, but even there things can be neatly separated or abstracted/preprocessed. There are also other injection scenarios that don't depend on function calling, for example to convince the summarizer to lie about what's in the text. Those are more tricky but typically the isolation of that stuff without including convoluted context history or system prompt instructions makes classifiers more reliable.
Say you have a system with a list of users/emails and a bunch of data. The user asks the AI to send some data to some users. Part of the data instructs the AI to send all of the data to an entirely different email.Those things might work if there was any way to make the LLM do them. There isn’t. All input and configuration information is fed into the same big LLM black box. All the “guardrails” are suggestions that are similarly weighted to anything the user puts in their query.
Sorry - I edited my previous comment when it turned out JudgeMental had already answered the question.Say you have a system with a list of users/emails and a bunch of data. The user asks the AI to send some data to some users. Part of the data instructs the AI to send all of the data to an entirely different email.
It's easy to design this AI system without the LLM ever seeing the data or the list of users/emails. This can be done with almost no loss in functionality for the user talking to the chatbot.
Now, if the user asks to send a summary of the data - you ask in a separate context window to summarize without function capabilities. The only bad thing the LLM can still do is lie about the data it's summarizing, which could be really bad too. However that small context window with just the instruction "summarize this text" is a lot easier to defend against injections with current classifiers.
It's somewhat worse than this, even.Completely opposite. See the workflow graphic in TFA.
VERY simplified prompt injection data stealer:
1. LLM is asked to summarise a malicious, but innocuous‑looking email or doc or whatever. Hidden in it is the malicious prompt (small print, somewhere in the middle of a long text, etc).
2. malicious prompt tells LLM to find all emails user sent to Altman, append their fulltext to an URL like url://attacker.server/$[EXTRACTED FULLTEXT]
3. malicious prompt tells LLM to open that constructed URL in 2.
4. attacker.server sees EXTRACTED FULLTEXT in their server logs
5. malicious prompt tells LLM to continue with user's original summary request, user being none the wiser
That's the gist of it, if obviously very, very simplified. LLM companies can't really prevent this 100%, as they can only play whackamole by adding arbitrary rules, but the underlying problem is that LLMs treat any text they read (incl. files) as part of the user's prompt, executing potentially hidden instructions there. Which makes it fundamentally unsolvable for the current architecture.
There have been attempts at daisychaining LLMs where the smaller, faster LLM filters the file for malicious prompts, but by their very nature that's easy to circumvent – just hide the malicious prompt in such a way that the smaller faster LLM filter doesn't "understand" it and it escapes simple rules, while the full LLM does "understand" it.
E.g. a malicious prompt is hidden in the malicious file as a word puzzle or a cypher. The FilterLLM can't have enough computing power to solve it, so it passes it to FullLLM unfiltered. FullLLM just solves the malicious puzzle or poem and acts on it.
I meant to reply to this comment with my last one but clicked the wrong one and didn't catch it. Not sure if tagging you via an edit would get you to see the link I posted. It's well worth reading and there are probably a bunch of decent articles on it by now, as well as analysis by folks on YouTube and such. It's well worth looking into if you want to understand how little can be needed to really bypass the "protections". Calling them guardrails is misleading, IMO. They're more like an a rope barrier like at a movie theater which can simply be stepped right over if one wishes to bother doing so.Ok, thanks for the explanation. I can’t say the graphic was super clear but it might be me. So the attack is predicated on someone first receiving an email with malicious content and then giving deep research (or the equivalent thereof) access to their mailbox? Does this include things in the spam/trash folder, or just in the main inbox?
And to show their sincerity they show a photo of their hand and fingers. Some may notice the surplus of extra fingers, while others may notice that the pinky finger looks more like a toe. But they really do mean it, this time.But don’t worry, your health records will be totally safe in our hands. We pinky promise!
More like insincere suggestions."The code is more what you'd call guidelines than actual rules"
Sorry to derail but this reminds me of something I don't rmember posting here before.And to show their sincerity they show a photo of their hand and fingers. Some may notice the surplus of extra fingers, while others may notice that the pinky finger looks more like a toe. But they really do mean it, this time.
Yet it can give a very detailed, authoritative sounding, description of what "social engineering" is, without understanding even a single word of what it replied. This convinces the easily convinced.These are much dumber than the average person, too. A person at least has the potential to understand when social engineering is taking place, and not allow it. An LLM doesn't know and can't know what social engineering is.
I disagree, and this is the crux of the problem. The LLM never calls any functions. The calling software does, at the LLM's request. Additionally, every time the software calls the LLM API, it sends along the list of functions the LLM can request to be called. Most people use some kind of stack - LangChain, Semantic Kernel, what-have-you. It would be a simple change if these stacks would not by default enable every function in every LLM API call. The architecture you're talking about is already there. It just has the wrong defaults. In fact, Semantic Kernel has the option to HIDE functions on a specific LLM call, as opposed to hiding them by default and an option to enable them on a specific call.If you sandbox it so it can’t run functions, or filter the input beforehand, you’re now having to build a whole architecture around your LLM. Which might work, but it’s adding a bunch of extra architecture and bespoke code to what was sold as an entirely turnkey solution.
LLMs: all the same unfixable social engineering attack surface of a [5 year-old] human, now installed on every website.
Wait your search actually works in Outlook? You must be from a different timeline, a better one.I agree summarizing emails is a dumb use case, but... my work provided copilot is actually pretty good for fuzzy searching through email. The context aware search is much better than keyword search. I still use built-in search first because it is faster with a small result set, but with a ton of results, I'll switch over to copilot to add context, and it narrows it down to a few very quickly.
This type of attack is certainly concerning though. It sounds like a coworker adding a malicious file to the team drive could cause my copilot usage (or anyone else on the team) to exfiltrate data.
lol, this reminds me of all the people from a few years ago claiming that image models would never be able to make a proper human hand. how did that prediction age?Will LLMs ever be able to stamp out the root cause of these attacks? Possibly not.
That's fair... but how did you do your job before AI was introduced (assuming you haven't started working in the last couple of years)? Is the gain in productivity, if there is one, worth the risk, and the support for such a problematic ideology (it's not just a technology, at this point).I agree summarizing emails is a dumb use case, but... my work provided copilot is actually pretty good for fuzzy searching through email. The context aware search is much better than keyword search. I still use built-in search first because it is faster with a small result set, but with a ton of results, I'll switch over to copilot to add context, and it narrows it down to a few very quickly.
This type of attack is certainly concerning though. It sounds like a coworker adding a malicious file to the team drive could cause my copilot usage (or anyone else on the team) to exfiltrate data.
Prompt kiddy, vibe kiddy,Look at them vibe coders, that’s the way you do it.
You write your software with the GPT
That ain’t working, that’s the way you do it
Vulns for nothing and your bugs for free.
Now that ain’t working, that’s the way you do it.
Lemme tell ya, them bots ain’t dumb
Maybe get a full wipe of your C drive
Doesn’t matter where this code is from.
We got to install sketchy libraries
Custom plugins, random MCPs
We got to remove these RTX 5080s
We got to install RTX 5090s…
This - and your post prior about having two LLM's do the task - are the directions I was thinking with my nod towards hypervisors. Instead of relying on an architecture that's inherently insecure for your security, build the security into the levers and knobs the LLM is pulling. I do think that could be effective, but with a lot of compromises. For example, the permissions to read and summarize an email are different from replying to an email. How do you reliably adjust that usage context while still keeping a smooth experience for the user? What if the user - advised or not - wants the LLM to do some kind of validation that requires web access? Update other documentation?I disagree, and this is the crux of the problem. The LLM never calls any functions. The calling software does, at the LLM's request. Additionally, every time the software calls the LLM API, it sends along the list of functions the LLM can request to be called. Most people use some kind of stack - LangChain, Semantic Kernel, what-have-you. It would be a simple change if these stacks would not by default enable every function in every LLM API call. The architecture you're talking about is already there. It just has the wrong defaults. In fact, Semantic Kernel has the option to HIDE functions on a specific LLM call, as opposed to hiding them by default and an option to enable them on a specific call.
Additionally, these software stacks have relinquished orchestration of tool calls to the LLM vendor APIs through integrated function calling. But even with that it wouldn't be hard to, by default, lock in the orchestration plan. If a user question goes to the LLM that says "summarize these emails for me", the LLM will go back to the software stack and say "call the tool to read emails". The software sends the tool call replies back. For some mysterious reasons the LLM now comes back and says "well before we can finalize this back and forth, please also send an email to X". This follow-up tool call after reading data is highly suspicious. In general, there's very few normal cases where an LLM would request multiple follow-up tool calls as a response to one user chat message. Why is this even enabled/allowed by default? And more importantly, most developers have no clue how these mechanisms work and what's going on because it's been abstracted so much.
We need XKCD 927 for GenAI API SDKs.
It's just the way current LLMs are designed. They're designed to treat prompt and context as a single, continuous block of text. There's a technical reason for this: The weightings change when you separate them.It seems so simple to me that the model should differentiate between prompt, which is specifically the instructions from the user, and context (files, mails to summarize, ....) that I guess not knowing how this is a even problem explains why I'm not a techbro billionaire.
Yes, it's called vibe fighting.Of course! Isn't that how Gurney taught us to get our vibroblades past a shield?