Supply-chain attack using invisible code hits GitHub and other repositories

HiggsForce

Ars Scholae Palatinae
680
Subscriptor
Unicode is a never-ending source of security problems, from unnormalized UTF-8 to punycode lookalike domains, to this.

A lot of the problems come down to pervasive canonicalization throughout Unicode processing but with widespread (and often intentional) differences in how different systems canonicalize Unicode: some systems will consider two different Unicode characters as different things (for example, displaying one of them as empty), while other systems will canonicalize them to the same. This is always a source of security trouble.
 
Upvote
84 (89 / -5)
How hard would it be for Github, NPM, etc. to check for and flag these somehow? "Hey, this source code has invisible characters in it, look out!"
Yes, that seems easy to automate. It would take a huge amount of processing to check millions of files, but worth it to remove the uncertainty.

For the future, pulls and check-ins could be scanned automatically. For the typo-squatters, their project pages could have a red banner added stating that their code includes invisible characters so be cautious about using it.

Package managers could scan new downloads and pop up warnings saying the package might be tainted, and ask for confirmation before adding it.
 
Upvote
48 (48 / 0)
I'm surprised that editors and terminals render those characters as blank, instead of inserting a □ missing/unsupported character or an � invalid/unrecognizable character. (Edit: or the  object replacement character. Another choice would be 􏿮 .notdef character that is used when the font is missing the character.) A quick search shows that most programming languages support a wide range of unicode characters (E.g., the private use range in some version of Clang C++). Unicode has a technical standard for mitigating some of the problems: https://www.unicode.org/reports/tr55/

Personally, I avoid non-ascii characters in my code, with unicode strings the only exception.
 
Last edited:
Upvote
82 (82 / 0)

Uncivil Servant

Ars Scholae Palatinae
4,751
Subscriptor
With the caveat that all my formal education on coding comes from law classes, I cannot wrap my head around how this weakness exists.

I cannot be the only person in the entire world who has had code fail to compile because of one little typo in a variable or because I forgot to end bracket a clause.

But invisible unicode? That compiles just fine? Sure, the problem could be ingenious hackers, it could be LLMs, or maybe the problem is compilers written by people so gullible they use phrases like "zero trust" unironically.

This just seems like such an obvious behavior to never implement, under any circumstances, I don't care how many drinks you've had you'll regret it in the morning. Not "whoopsie, how'd we let it run that code".

Seriously, does anyone in IT do policy & planning and view "chaotic neutral" as more of an orientation than an alignment?
 
Upvote
-6 (29 / -35)

HiggsForce

Ars Scholae Palatinae
680
Subscriptor
With the caveat that all my formal education on coding comes from law classes, I cannot wrap my head around how this weakness exists.

I cannot be the only person in the entire world who has had code fail to compile because of one little typo in a variable or because I forgot to end bracket a clause.

But invisible unicode? That compiles just fine? Sure, the problem could be ingenious hackers, it could be LLMs, or maybe the problem is compilers written by people so gullible they use phrases like "zero trust" unironically.

This just seems like such an obvious behavior to never implement, under any circumstances, I don't care how many drinks you've had you'll regret it in the morning. Not "whoopsie, how'd we let it run that code".

Seriously, does anyone in IT do policy & planning and view "chaotic neutral" as more of an orientation than an alignment?
Programming languages often contain string constants which can contain arbitrary unicode. There are many invisible unicode characters, and some of them are necessary for properly rending text in some human scripts. Blanket disallowing invisible characters would cause internationalization chaos.
 
Upvote
62 (68 / -6)
eval is just a convenient example. This exploit works just fine in languages without it.
It looks like the language isn’t interpreting the extended characters as ascii directly. There is a loop that is modifying the characters to shift them back into the ascii range. A similar exploit might be able to write the decoded characters into a file and load the file, but it doesn’t look like a function definition could be directly created with the technique.
 
Upvote
44 (44 / 0)

FelipeBG

Smack-Fu Master, in training
76
Subscriptor
Good old eval, always popping up in new fun ways. Seems like any code that uses eval should be flagged and auto-quarantined across the board
It does get flagged in npm if you click "Analyze security with Socket" -> "Alerts" from a package page. I don't know how long it's been doing that though.
 
Upvote
6 (6 / 0)

HamHands_

Ars Centurion
213
Subscriptor
How hard would it be for Github, NPM, etc. to check for and flag these somehow? "Hey, this source code has invisible characters in it, look out!"
It costs money to do that sort of scan. They could offer gratis to the open source projects, Microsoft certainly has enough money to do it. But it would be unlikely.

The solution from our (the dev) side is to incorporate scanning tools like SonarCloud into our workflow that does this for us. I havent checked for this exact issue but Google tells me there is a rule for flagging invisible characters.
 
Upvote
9 (10 / -1)

HamHands_

Ars Centurion
213
Subscriptor
The best way to protect against the scourge of supply-chain attacks is to carefully inspect packages and their dependencies before incorporating them into projects.
This is true but nearly impossible at the moment. NPM (NodeJS) is a horrifying nest of dependencies on dependencies on dependencies. The LeftPad incident highlighted this so there's been a decent amount of wrangling of this problem with tools like npm audit but fundamentally, my node_modules folder is still going to be 100s of dependencies even if I only install, big name, "first party" packages (React/Angular/etc). I dont think this is going to be solved as IIRC the ethos of NPM was to make sharing code trivial and therefore you can get going faster by leveraging this huge repository of libraries instead of writing it yourself. Its possible things could get a little better as more stuff gets added to the standard library of JS. Eg: I used to have to install a 3rd party package like axios for making network requests with a nice API, but now fetch is standard in Browser and in Node (mostly) so I don't have to.

But still, security is one of the reasons (there are many) I usually recommend not using NodeJS/NPM where possible. Restrict its usage to the frontends alone. Though, unfortunately, the industry as a whole seems to be moving in the opposite direction: very tightly coupled front and backends, written exclusively in JS.
 
Upvote
24 (25 / -1)

HamHands_

Ars Centurion
213
Subscriptor
I'm a hobbyist rather than a professional developer, but my tools of choice allow me to change the typeface(s) used to display code (my longtime preference is Verdana). Would it not be relatively easy to create and use a typeface where all the "invisible" character codes become visible?
Should be a little easier than that. VSCode at least highlights these invisible characters. I'm sure other editors do as well. Granted, I haven't tested to see if the exploit code in this article shows up.
 
Upvote
31 (31 / 0)

nxg

Ars Centurion
221
Subscriptor
It looks like the language isn’t interpreting the extended characters as ascii directly. There is a loop that is modifying the characters to shift them back into the ascii range.
This seems correct, to me.

If I'm understanding the attack correctly (and I'm fairly sure I am), it consists of encoding malicious code in an otherwise entirely normal, or at least abnormal-but-legitimate, UTF-8 string within the code, which decodes to code which is then executed.

It is functionally equivalent to, say, ‘encrypting’ that malicious code in a ROT13 string, and including an inline ROT13 decoder, which is run on the string before executing it. The ‘encryption’ here is barely more sophisticated than that. The only difference – and it's a crucial one – is that any reviewer would surely notice a sodding big block of ROT13 code in a patch, whereas in this case (I would lay money) most editors and renderers would display the block of ‘encrypted’ code as an empty string, which is easy to miss.

The clever thing is that the editors are not malfunctioning when doing this, and any attempt to make them display the characters would potentially count as a bug. Even if a code-reviewer thought that the eval looked weird, they'd have to work through the decoder, and know an relatively unusual amount about Unicode, in order to work out what was going on.

The codepoints in question are actually not in any of the Unicode ‘private use areas’ (despite what the article suggests; and yes, I think it's ‘private use area’ that's intended, since there's no term ‘public use area’). The codepoint ranges U+FE00 to U+FE0F and U+E0100 to 0xE01EF are ‘variation selectors’. I'm moderately familiar with the Unicode spec and... I've never heard of them before! There's a handy Wikipedia page which tells us that they exist in order to do funky things to preceding CJK characters in selected east-Asian languages. You'd have to go head-first into the Unicode spec for the details (rather you than me), but I wouldn't be at all surprised if the required rendering behaviour for these in certain circumstances is... to show nothing.

That is, this apparently isn't exploiting any UTF-8 decoding bugs, or Unicode manipulation edge-cases. It seems quite likely that the rendering behaviour of these codepoints in strings is specified, and any editor which displayed the strings as other than empty ones might well be defective.

What a clever hack! Bastards.
 
Upvote
119 (120 / -1)

norton_I

Ars Praefectus
5,867
Subscriptor++
eval is just a convenient example. This exploit works just fine in languages without it.

Sort of. But at least the code execution attack does require some way to execute strings. It could be eval() it could be system() or execve(). The PUA characters would be meaningless in the body of the program. If fact they are meaningless in the string literal until the attacker produced function translates them down to the basic ascii set. It's an innocuous looking function but the fact that the result is passed to eval should raise eyebrows if anyone looked at it.

Of course every non trivial language has some way to execute strings either internally like eval() or externally like system(). The point is not that "JavaScript is weak because it has eval" the point is that in any language these functions are well known and should have extra scrutiny applied.
 
Upvote
22 (22 / 0)

adamsc

Ars Praefectus
4,279
Subscriptor++
I'm surprised that editors and terminals render those characters as blank, instead of inserting a □ missing/unsupported character or an � invalid/unrecognizable character. (Edit: or the  object replacement character. Another choice would be 􏿮 .notdef character that is used when the font is missing the character.) A quick search shows that most programming languages support a wide range of unicode characters (E.g., the private use range in some version of Clang C++). Unicode has a technical standard for mitigating some of the problems: https://www.unicode.org/reports/tr55/

Personally, I avoid non-ascii characters in my code, with unicode strings the only exception.

It depends on the system and tools you're using, which is what makes it nasty. For example, the Unicode private usage (planes 15 and 16) variant of the attack does not work on macOS (Terminal, popular editors, etc.) because those characters fall back to the .LastResort system font and display as a square with a question mark inside.

Problem solved, time to buy some AAPL before telling everyone to switch to Macs, right? Nope.

There are a lot of other characters in Unicode which do not render visibly because they're required not to. Other variations of this attack used things like the Mongolian variation selectors or vowel separator (U+180B-E, which may render or not depending on the active font!), joiners, the right-to-left / left-to-right embedding and override characters, etc.

There are already many tools and editors which will warn about suspicious mixing of language blocks which were added in response to phishers doing cute things with Cyrillic letters and those will often flag Mongolian formatting codes used in otherwise an non-Mongolian context but that's still not enough because technically you could just encode in binary using things like “ ” and “ ” (EN SPACE and EM SPACE, respectively) which are considered language-neutral.

There are some libraries like https://github.com/lirantal/anti-trojan-source and https://docs.astral.sh/ruff/rules/ambiguous-unicode-character-string/ which implement layers of rules looking for things like that but at some point this is also going to need to fall back on the detecting the way these things are misused. That should buy us some time because there really aren't cases where you need a gigantic string with no printing characters, but you'd also have to look for things like runs of paired RTL-LTR values which are syntactically meaningless but could be used to hide information in a not-entirely-empty string.

It's more than just eval‌(‌) but the act of decoding an embedded string constant and passing it into an open or execution function is inherently suspicious and we have a growing number of control-flow analysis tools which which we realistically have to put into every code review tool since you'd also want to catch things like using more normal Unicode to load a payload from a public blockchain or other hard-to-remove outside hosting service. Making every path which populates a variable passed to a sensitive function really prominent would be a good win for multiple reasons.

EDIT: in a truly hilarious bit of synchronicity, I have learned that if you put the literal eval followed by the opening parentheses into a comment here, the server will reject it but eval\N{ZERO WIDTH NON-JOINER}(\N{ZERO WIDTH NON-JOINER} will bypass that check. If you tried to exploit a Python system that way, it'd fail with an “SyntaxError: invalid non-printable character” exception but that would totally work against Node.js…
 
Last edited:
Upvote
35 (35 / 0)

norton_I

Ars Praefectus
5,867
Subscriptor++
Not being a developer, I wonder what legitimate uses are there for code points from this plane in source code at all?

Could sed or a more customized tool identify it and strip it out?

It would be trivial to strip them out. Since they appear in string literals, doing so would break any application which was encoding them to e.g. display as part of text to a user.

These are non-printing characters used to change how surrounding characters are rendered. It would probably be acceptable for a programming language to prohibit them in source code even in string literals. For instance if you want to use text strings with advanced Unicode control characters you would more likely load them from a language dependent template or database. You could make that a requirement in a specific language.

However, applying that unilaterally in existing code across multiple languages risks breaking stuff that's working. You would want to audit each instance (hopefully rare) for legitimate uses before deciding on that.
 
Upvote
6 (6 / 0)

Wandering Monk

Ars Centurion
266
Subscriptor
This seems like an extremely simple linting rule: if there are these invisible characters, fail the build.

Also, for NPM projects, add a flag (on by default after a couple releases) that strips out the invisible characters during the “build” (the vast majority of npm projects use something like webpack to “compile” the JS).

In both cases, if there’s a legitimate reason to have these invisible characters, just require them to have an annotation that effectively screams, “this string has invisible characters!”

Now that the cat is out of the bag, I expect it to be mitigated pretty quickly.
 
Upvote
0 (7 / -7)

Chai T. Rex

Wise, Aged Ars Veteran
152
Even if a code-reviewer thought that the eval looked weird, they'd have to work through the decoder, and know an relatively unusual amount about Unicode, in order to work out what was going on.
To totally work out what's going on, sure, but a good starting step is to replace eval with console.log or something like that.
 
Upvote
4 (4 / 0)
In this case the problem is specifically that the non printing characters are in a Unicode string literal.
Evaluating string literals as code is a very well known vulnerability. I assumed the problem was similar to the following obfuscated C-code:
View: https://youtu.be/RMI5oT9U4vc?t=2m32s
where invisible characters are used in the code itself and not just in string literals.
 
Upvote
7 (7 / 0)

arobert3434

Ars Scholae Palatinae
1,161
Subscriptor
eval is just a convenient example. This exploit works just fine in languages without it.
How? There's no compiled or interpreter that would treat those characters as ASCII directly. You need something to translate them and then a runtime that supports evaluation of code from data.
 
Upvote
4 (7 / -3)