AI-generated code could be a disaster for the software supply chain. Here’s why.

sarusa

Ars Praefectus
3,274
Subscriptor++
The findings are the latest to demonstrate the inherent untrustworthiness of LLM output. With Microsoft CTO Kevin Scott predicting that 95 percent of code will be AI-generated within five years, here’s hoping developers heed the message.
Well that certainly helps explain why Windows 11 has completely gone to shit.

But the jury seems to still be out on whether it's even possible to design an LLM that never hallucinates.
It's not possible. LLMs hallucinate by design. They are stochastic parrots spewing back tokens they have seen in the vicinity of similar tokens with probabilistic weights. When you train it on trillions of (stolen) documents, then it tends to spew back coherent sounding things because they're stolen from previous documents. But as soon as it starts mixing and matching the risk of bullshit skyrockets. Again, there is zero thinking going on, it's just 'I saw these tokens near these tokens in a couple documents before'. This is why the wrong package names here are not random, it tends to hallucinate similar wrong ones repeatedly.

The best you can do at this point is have the LLM 'watch' its own output and attempt to cross-check it, which does somewhat work, except the checker attention head is just as prone to bullshit as the original one, so you need at least three to 'vote' on it, which of course skyrockets the energy cost and still doesn't guarantee anything. I have, when playing around with OpenAI (know the enemy), told it it was wrong about something it was right on just to see what it would do, and it completely accepted that it was wrong and rewired everything to justify that. Claude does the same thing.

Now, you know what LLMs are really good at? Writing malware! Because it's fine if it only works 50% of the time if you can test it, keep the ones that work, then get it out NOW. China has really ramped up on this.
 
Upvote
9 (10 / -1)

el_oscuro

Ars Praefectus
3,179
Subscriptor++
It is an industry-wide problem. These LLMs all do it, the biggest ones are bad about it and the smaller ones are slightly worse.

when the creators of these LLMs and the “experts” are saying “well, we can’t really say why it does what it does, we don’t really understand it” that’s the big red warning sign that we shouldn’t be depending on them for anything.
It seems like this old XKCD is still applicable.
https://xkcd.com/2030/
 
Upvote
7 (7 / 0)

Dmytry

Ars Legatus Legionis
11,497
The best you can do at this point is have the LLM 'watch' its own output and attempt to cross-check it, which does somewhat work, except the checker attention head is just as prone to bullshit as the original one, so you need at least three to 'vote' on it, which of course skyrockets the energy cost and still doesn't guarantee anything. I have, when playing around with OpenAI (know the enemy), told it it was wrong about something it was right on just to see what it would do, and it completely accepted that it was wrong and rewired everything to justify that. Claude does the same thing.
I think for the package names you can probably have a white listed set of modules that exist at the training cut off date and just filter all generated import statements. Simple and stupid.

Except people now expect it to ingest their codebase and guess calls to modules that are internal or past the training cut off, so it would break that.

As far as stochastic parrotry goes, AI industry's answer to that argument is to do a sort of haruspicy on the neural network's internals. It could be plagiarizing a piece of text verbatim, but the internal mess is always complicated and that supports the point that it was thinking up the plagiarized text on its own. Or something.

Now, you know what LLMs are really good at? Writing malware! Because it's fine if it only works 50% of the time if you can test it, keep the ones that work, then get it out NOW. China has really ramped up on this.
Yeah, that's a great point.

Well, it also applies to some marginal uses of programming - e.g. displaying a graph of something with matplotlib, a lot of code for that is essentially throwaway and as long as it looks about right nobody cares.
 
Upvote
1 (1 / 0)

Fatesrider

Ars Legatus Legionis
25,296
Subscriptor
I find it rather ironic how after climbing out of the cesspool of ignorance and achieving a certain level of intelligence how mankind is confidently striving toward extinction by creating and elevating a machine he thinks is smarter than he is to insane levels of dependency while ignoring the fact it's actually stupider.

I mean, look at how many stupid fucks voted for Trump. If that wasn't a mass appeal on the part of the human race to "end it all" from sheer idiocy, I don't know what else you'd call it.

AI is just a symptom of the issue. The problem isn't, and never has bee, AI. The problem is that people are too fucking stupid to exist for long before that stupidity ends us as a species.
 
Upvote
1 (3 / -2)

Zncon

Smack-Fu Master, in training
90
Subscriptor
Oh, that's fascinating!

Best quote I heard on the subject was "Why are we using AI to create new problems instead of solving old problems?" and that, of course, is the heart of the matter. LLMs do not solve old problems.

I was wondering how the heck do you detect hallucinations, but I did not at all think of package names as an attack vector. How remarkably insidious! Of course, this has always been a problem with people dropping package names with typos and just waiting for someone to bite, but now your code copilot brings the exploit to you!

I wouldn't even know where I'd start with coding today, since you apparently need to understand supply chain first.
To your point about solving problems, they actually do. Just not to the extent they're being hyped to.

They're a very useful tool for information discovery, because old-style search engines have been on their deathbed for years. SEO is making search engines useless, so it's good at LLMs came along when they did.
 
Last edited:
Upvote
4 (4 / 0)

hillspuck

Ars Scholae Palatinae
2,179
Can we stop echoing marketing-speak like "hallucinations" or "misalignment" and just call it what it is - "garbage data"?
Can we stop being prescriptive about language because even if you had a good point it never actually works?

I don't think the term "hallucination" is doing marketing any favors. Do you usually consider someone who is hallucinating to be a reliable source of information, or a person you would want to entrust with a task? I would argue that someone who is simply wrong/incompetent is actually a more trustworthy person that someone who is hallucinating.
 
Upvote
5 (5 / 0)
LLMs make great helpers for searching obtuse documentation but they're all too happy to regurgitate someone else's Stack Overflow solution which won't be designed for your specific circumstances unless your cases are super generic.

Don't let them write your code, but don't be afraid to use them to find stuff for you.
I usually say that Copilot gets me 90% of the way there, but that 90% of the way was absolutely the grunt work that took me most of the time in the past. That last 10% was the part that was novel to the project anyway, so it basically let me concentrate on the actual problem instead of wasting time on that preparatory nonsense.

If you're not debugging before you ship, it doesn't matter if it's human or AI generated.
 
Upvote
-1 (1 / -2)
To your point about solving problems, they actually do. Just not to the extent they're being hyped to.

They're a very useful tool for information discovery, because old-style search engines have be on their deathbed for years. SEO is making search engines useless, so it's good at LLMs came along when they did.
When it comes to the Copilots and GPTs of the world, the old "trust but verify" saying is very applicable.
 
Upvote
5 (5 / 0)
As an aside though, I get a kick out of people who like to very loudly let the world know that they consider LLMs and similar systems to "not be thinking". As if that makes them any less useful.

Though, from my POV, we're quickly entering the P-Zombie realm of AI. At which point, at least from my viewpoint, it's neither here not there.
 
Upvote
-7 (0 / -7)
I suspect code-support agents will eventually be modded with a fair bit of pre and post processing code which attempt to avoid this type of blunder. It will be whack-a-mole, but some classes of problem like this can be mostly solved especially with other tools like a curated list of allowed libs.
Except LLMs ignore hard-coded instructions at a rate >1% in my experience.
 
Upvote
5 (5 / 0)
Can we stop being prescriptive about language because even if you had a good point it never actually works?

I don't think the term "hallucination" is doing marketing any favors. Do you usually consider someone who is hallucinating to be a reliable source of information, or a person you would want to entrust with a task? I would argue that someone who is simply wrong/incompetent is actually a more trustworthy person that someone who is hallucinating.
It personifies an algorithm to the point that people associate non-hallucinations as reasoning and thinking. Which is dangerous and is exactly why the term needs to be eliminated.
 
Upvote
8 (10 / -2)

VividVerism

Ars Tribunus Angusticlavius
8,640
It doesn't, people hit run get an error and edit it out. Newer AI may itself take a several passes and edit it out in the end.

Except when malicious party runs the AI and sees it make a plausible package name. They can then upload a malicious package with that name. From that point onward, when people (or AIs) try to run the generated code that has the same made-up name, they install that package, which runs the installer script and potentially compromises their system, or worse yet their customer's system.
The really insidious miscreants will make their slopsquatted package actually do what it says on the tin in addition to their intended mischief. So the code may even work and leave the developer none the wiser about the malware that hitched a ride with the mostly functional package.
 
Upvote
12 (12 / 0)

graylshaped

Ars Legatus Legionis
68,227
Subscriptor++
Oh, that's fascinating!

Best quote I heard on the subject was "Why are we using AI to create new problems instead of solving old problems?" and that, of course, is the heart of the matter. LLMs do not solve old problems.

I was wondering how the heck do you detect hallucinations, but I did not at all think of package names as an attack vector. How remarkably insidious! Of course, this has always been a problem with people dropping package names with typos and just waiting for someone to bite, but now your code copilot brings the exploit to you!

I wouldn't even know where I'd start with coding today, since you apparently need to understand supply chain first.
Throughout history, the one class that has always prospered is the one that mastered schlepping things efficiently.
 
Upvote
1 (1 / 0)
D

Deleted member 192806

Guest
As an aside though, I get a kick out of people who like to very loudly let the world know that they consider LLMs and similar systems to "not be thinking". As if that makes them any less useful.

Though, from my POV, we're quickly entering the P-Zombie realm of AI. At which point, at least from my viewpoint, it's neither here not there.
Haven't reached the uncanny valley stage yet.
 
Upvote
2 (2 / 0)

Dmytry

Ars Legatus Legionis
11,497
The really insidious miscreants will make their slopsquatted package actually do what it says on the tin in addition to their intended mischief. So the code may even work and leave the developer none the wiser about the malware that hitched a ride with the mostly functional package.
May actually be quite easy to do, when the functionality is just one or two method calls.

It's kind of like typosquatting on steroids - the package name doesn't need to sound like any existing package, and there's a set of bullshitted functionality to go with it.
 
Upvote
6 (6 / 0)

cerberusTI

Ars Tribunus Angusticlavius
7,194
Subscriptor++
It is hard to get worked up about this.

I long ago basically just banned external code repositories as the quality mostly seemed to be student level work. If someone wants to do that, they need to defend it in code review. That turned out to be basically never worth it, and if you make that clear ahead of time nobody even really wants to seriously try (real standards already mostly cover this to the degree you would want.)

The maybe 30% success rate for the AI is basically fucking amazing. I started giving them their $200 a month when their C and ASM solutions from o3 were maybe only 15% working but also 5% something I would not have thought of myself, and otherwise frequently helped clarify the goals to begin with.

You can do what you like with the thousand dimensional prism refracting human knowledge, but I mostly just see the application of lensing, and a desire to build a better one upon a better foundation.
 
Upvote
1 (1 / 0)

42Kodiak42

Ars Scholae Palatinae
1,439
One of the things that makes package hallucinations potentially useful in supply-chain attacks is that 43 percent of package hallucinations were repeated over 10 queries. “In addition,” the researchers wrote, “58 percent of the time, a hallucinated package is repeated more than once in 10 iterations, which shows that the majority of hallucinations are not simply random errors, but a repeatable phenomenon that persists across multiple iterations. This is significant because a persistent hallucination is more valuable for malicious actors looking to exploit this vulnerability and makes the hallucination attack vector a more viable threat.”

This is bad, this is really bad.
At first glance, this makes it apparent that attackers can use an LLM's hallucinations to gauge relatively how many targets might be impacted by their attacks.
And it gets even worse when you consider the fact that attackers can use a specific coding request as input:
An attacker who wants to take a stab at image editing software can ask the software to "Write me a function that can transpose points in one arbitrary quadrilateral to points in a target quadrilateral" and take note of the hallucinated packages
Or if they want to take a stab at radar software "Write me a function that can break down a complex waveform to its real and imaginary components" and use the hallucinated packages there.

This goes beyond repeatability and finding common targets, but being able to pick and choose targets based on the software queries you'd expect from them.
 
Upvote
4 (4 / 0)

hillspuck

Ars Scholae Palatinae
2,179
It personifies an algorithm to the point that people associate non-hallucinations as reasoning and thinking. Which is dangerous and is exactly why the term needs to be eliminated.
But it personifies algorithms as something other than an infallible thinking machine that must be right, because it's not prone to human failures.

This isn't dangerous at all. It's basically making people think that computers shouldn't be trusted just because they're computers. I would find that kind of thinking to be far more dangerous.
 
Upvote
-3 (1 / -4)

42Kodiak42

Ars Scholae Palatinae
1,439
May actually be quite easy to do, when the functionality is just one or two method calls.

It's kind of like typosquatting on steroids - the package name doesn't need to sound like any existing package, and there's a set of bullshitted functionality to go with it.
It might not be quite as easy as you're suggesting, but I still think you're exactly right. Unless the package is claiming to do something practically impossible, a programmer capable of writing the malicious code is more than capable of writing the actual code to accomplish the intended goal.

This is already how a lot of attacks on Open Source Software exist: If you want to get a malicious change into a project, no reviewers are going to accept the change if it doesn't do what it's supposed to do, even if you hide its malicious purpose successfully.
 
Upvote
2 (2 / 0)
But it personifies algorithms as something other than an infallible thinking machine that must be right, because it's not prone to human failures.

This isn't dangerous at all. It's basically making people think that computers shouldn't be trusted just because they're computers. I would find that kind of thinking to be far more dangerous.
No, if that was the case then they would call it what it truly is - statistical errors. There's a reason the LLM companies chose the term "hallucination" and it isn't because they want to point out the fact that computers shouldn't be trusted. These are the same companies that are calling their new models "reasoning models".

The fact that it personifies them at all is a huge cultural problem. This is why there's now "agentic" models coming out. This is why companies are even thinking that these agentic models can replace developers and other things (a company I work for is currently exploring this sad excuse for productivity). The LLM companies are influencing high level decisions by choice of words and unless you correct those terms the people that are completely ignorant of the tech will personify the algorithms to the point that they trust it the same way they trust people that just make mistakes sometimes.

Even on ars, whenever theres a PR fluff piece news about OpenAI features and commentors bring up LLMs making mistakes the common argument for the LLMs is "well people make mistakes too!". It's a branding term and it's inappropriate and dangerous.
 
Upvote
4 (6 / -2)

hillspuck

Ars Scholae Palatinae
2,179
No, if that was the case then they would call it what it truly is - statistical errors. There's a reason the LLM companies chose the term "hallucination" and it isn't because they want to point out the fact that computers shouldn't be trusted. These are the same companies that are calling their new models "reasoning models".

The fact that it personifies them at all is a huge cultural problem. This is why there's now "agentic" models coming out. This is why companies are even thinking that these agentic models can replace developers and other things (a company I work for is currently exploring this sad excuse for productivity). The LLM companies are influencing high level decisions by choice of words and unless you correct those terms the people that are completely ignorant of the tech will personify the algorithms to the point that they trust it the same way they trust people that just make mistakes sometimes.

Even on ars, whenever theres a PR fluff piece news about OpenAI features and commentors bring up LLMs making mistakes the common argument for the LLMs is "well people make mistakes too!". It's a branding term and it's inappropriate and dangerous.
This seems more of a you (and people like you) thing.

There's a pretty common saying along the lines of "computers don't make mistakes". It's already ingrained in people's thinking. All those horses have not only left the barn but are completely out of sight.

I stand by saying that the computer is hallucinating is actually a great thing.

But in any case, it's a fools errand to try to police language. Apart from deciding that we shouldn't use derogatory language, it's never really worked. And even then it wasn't specific language police but the culture as a whole.
 
Upvote
2 (2 / 0)
This seems more of a you (and people like you) thing.

There's a pretty common saying along the lines of "computers don't make mistakes". It's already ingrained in people's thinking. All those horses have not only left the barn but are completely out of sight.

I stand by saying that the computer is hallucinating is actually a great thing.

But in any case, it's a fools errand to try to police language. Apart from deciding that we shouldn't use derogatory language, it's never really worked. And even then it wasn't specific language police but the culture as a whole.

That's a pretty poor argument that because people have a tendency to believe computers are infallible that therefore additional language that not only encourages the belief that they are infallible, but are actually reasoning beings that can think and analyze and yes also "hallucinate" because they make errors just like humans.

Hallucinations are a PR term created by the companies that benefit the most from perpetuating the belief.
 
Upvote
1 (3 / -2)

42Kodiak42

Ars Scholae Palatinae
1,439
Any car can get in an accident that doesn’t mean if VW has a safety issue we can brand all cars as being unsafe .

Sure there’s also potential for hallucinations but it vary based on the model, models guardrails , grounded data, prompt and other mechanisms such as RAG.
The difference is at what type of usage these things become unsafe and unreliable, and the extent to which they are unreliable.

To quote the article:
The study, which used 16 of the most widely used large language models to generate 576,000 code samples, found that 440,000 of the package dependencies they contained were “hallucinated,” meaning they were non-existent.
Imagine if out of 576,000 car trips across a variety of manufacturers, they counted 440,000 instances of the car breaking down mid-trip and requiring fixing. Not because any one manufacturer was just complete fucking shit at their job, but because steering wheels just start spinning sometimes and nobody understands why.

We can't give you a good, sound estimate on the comparable hallucination rates between LLMs because the conditions and prompts which are likely to cause hallucinations vary from LLM to LLM, and your activities cannot be assumed to reflect the sample pool.

In practice, we cannot tell you the odds of ChatGTP or Gemini hallucinating in their responses to your prompts, or even the relative error rates between two LLMs. We cannot tell you how likely things are to go wrong for you. Benchmarks will not predict practical success and failure rates, only indicate the prevalence of an ongoing and unsolved problem. What we can tell you is that dependency hallucinations are a common occurrence that you need to account for, and is an unsolved problem for all LLMs.
 
Upvote
7 (7 / 0)

hillspuck

Ars Scholae Palatinae
2,179
That's a pretty poor argument that because people have a tendency to believe computers are infallible that therefore additional language that not only encourages the belief that they are infallible, but are actually reasoning beings that can think and analyze and yes also "hallucinate" because they make errors just like humans.
You think hallucinating is just "making errors"? Talk about a poor argument. Hallucinations are a thing that can get you committed. That's not just an "oopsie" level. That's a "this person probably shouldn't be trusted with any sort of responsibility" category.

Our most fundamental level of judging whether a person is capable of looking after themselves and others is if they are living in the same shared reality. And not just whether they agree with our politics or whether or not we landed on the moon. Rather, whether it's okay to drive on the opposite side of the freeway because you got a great idea that it would be more efficient.


(I want to note here that I don't agree with society's attitudes towards mental illness, specifically those who experience hallucinations. It's a harmful stereotype that they are a danger to others, or that they cannot live a normal life. But since we're talking about language here, I have to recognize the reality of the way people use that language today.)
 
Upvote
1 (1 / 0)
You think hallucinating is just "making errors"? Talk about a poor argument. Hallucinations are a thing that can get you committed. That's not just an "oopsie" level. That's a "this person probably shouldn't be trusted with any sort of responsibility" category.

Our most fundamental level of judging whether a person is capable of looking after themselves and others is if they are living in the same shared reality. And not just whether they agree with our politics or whether or not we landed on the moon. Rather, whether it's okay to drive on the opposite side of the freeway because you got a great idea that it would be more efficient.


(I want to note here that I don't agree with society's attitudes towards mental illness, specifically those who experience hallucinations. It's a harmful stereotype that they are a danger to others, or that they cannot live a normal life. But since we're talking about language here, I have to recognize the reality of the way people use that language today.)


So you have no problem with false advertising? Tesla calling their driving assistance as Full Self Driving isn't deceptive? Words matter, terms matter. Which is why when we allow companies to dictate terms that give impressions of services that frankly don't exist.

I have never met anyone that actually associates LLM hallucinations to the same negative context that human hallucinations occur. In fact, they use it in the context of it making a human-level mistake.
 
Upvote
-4 (0 / -4)

hillspuck

Ars Scholae Palatinae
2,179
So you have no problem with false advertising? Tesla calling their driving assistance as Full Self Driving isn't deceptive?
A bit of a strawman you are building there.

Which is why when we allow companies to dictate terms that give impressions of services that frankly don't exist.
Your very premise on its origins is faulty.
https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)#Origin

It's not corporate speak; it's geek-speak.

I have never met anyone that actually associates LLM hallucinations to the same negative context that human hallucinations occur. In fact, they use it in the context of it making a human-level mistake.
Maybe because you were too busy telling people not use words you don't like rather than listening to just how derisively people were using it.

In any case, hi there. Nice to meet you.

You know who I haven't met? Anyone who thinks AIs use reasoning that got there because of the term "hallucinations". Rather, all the ones I've met got there because LLMs are just really good at creating output that tricks humans into thinking there's more behind the curtain than a fancy autocomplete.
 
Upvote
4 (4 / 0)
AI-generated computer code is rife with references to non-existent third-party libraries, creating a golden opportunity for supply-chain attacks that poison legitimate programs with malicious packages ...
You mean like the junior-level code backed with senior-sounding loudmouth keyword dropping like "test coverage" and "agile" ... code that you only casually audit and in minutes find their package.json including libraries that are not even what the developer thought they did?

It was a library for what they thought was web cache, wired it right into their .use() .. and left it there, never even noticed their API wasn't caching at all. Confronted? Replied ink-cloud style with emojis, and then eventually flamed out with something about why interviews are unfair that ask hard code questions and that don't allow consulting an AI to provide "business value" instead.

No lie. Considers themselves a thought leader. Was apparently just re-Tweeting the recently aired tech-bro zeitgeist to our faces in meetings.

Their stuff would break or not work the way they expected and [thing] was "just being weird."

Then you go and find blocking awaits everywhere to explain the "weird."

And this is before they started humping AI as a way to skip all the inconvenient learning and come out ahead of actual senior developers.

It's only a matter of time before that type of "developer" is the one interviewing.
 
Upvote
-3 (0 / -3)

silverboy

Ars Tribunus Militum
2,110
Subscriptor++
Oh, that's fascinating!

Best quote I heard on the subject was "Why are we using AI to create new problems instead of solving old problems?" and that, of course, is the heart of the matter. LLMs do not solve old problems.

I was wondering how the heck do you detect hallucinations, but I did not at all think of package names as an attack vector. How remarkably insidious! Of course, this has always been a problem with people dropping package names with typos and just waiting for someone to bite, but now your code copilot brings the exploit to you!

I wouldn't even know where I'd start with coding today, since you apparently need to understand supply chain first.
We can reuse the old joke about regex here:

"You have a problem to solve with code. You use an LLM to write it. Now you have two problems."
 
Upvote
3 (3 / 0)

silverboy

Ars Tribunus Militum
2,110
Subscriptor++
Lying is a big word and projects too much intelligence on these models. It would be better applied to the people selling them.
I'd say lying is a small word, and what's more, it applies well, because intelligence (really be used as a substitute for self-awareness) is irrelevant. It is telling us things that are factually false as though they are true. If that's not lying, then the definition of lying is meaningless.
 
Upvote
1 (1 / 0)

silverboy

Ars Tribunus Militum
2,110
Subscriptor++
To take your point further, there are no such things as hallucinations. All LLM output is a statically chosen best guess. Some of those guesses happen to be correct, some incorrect, but there is no difference beyond that.
Factually true but misleading. LLMs are intended to act like us, and so if they do things that would be hallucinations coming from us, or lies, then to all intents and purposes they are hallucinating or lying.

It's nice to remember that they can't actually think, that they are just creating a web of floating-point numbers and so on, but that doesn't diminish the human-like flaws that result, nor does it negate the value to us humans of treating them the way we would treat those human flaws.

Saying that it's all statistics doesn't make the answers better. Calling them hallucinations and lies makes us not trust them, or not ask in the first place. That is where we need to be.
 
Upvote
-2 (1 / -3)
I think all F/OSS projects should put an anti-cheat policy in place. No LLM generated or even LLM-assisted code contributions.
I've seen generative AI slop swamp communities and make it nearly impossible for the quality assurance and content safety groups to keep up. The only viable solution right now is simply to close the floodgates.
How is this going to solve anything, though?

If a contribution is actually good, and happened to be assisted in some way by an LLM, then blocking that contribution isn't helping the project. If someone just doesn't care and is going to try to submit LLM-generated slop, then is this the kind of person who's really going to pay much attention to policy in general?

It seems the main problem is primarily that LLMs make it easy to generate a lot of slop. Unfortunately, there isn't really any solution to submission volume than to close the proverbial floodgates and require some level of vetting for the people submitting code.

Which, honestly, seems like the obvious solution. Sure, your first submission or 3 might be hard to get through the door, but if someone is going to seriously take the time to contribute to open source projects, they'll probably be doing a lot more than one or two submissions. All you need is a centralized registry where contributors earn reputation points of some form, and that's a pretty well solved problem.
 
Upvote
0 (0 / 0)