Amid Mythos’ hyped cybersecurity prowess, researchers find GPT-5.5 is just as good

dwrd · 2026-05-01T11:43:53-0400

Maybe with all these advancements, these models will finally be able to find the code I have open in my editor on the first try.

HamHands_ · 2026-05-01T11:44:48-0400

In a recent interview with the Core Memory podcast, OpenAI CEO Sam Altman criticized what he calls “fear-based marketing”

Pot meet kettle, he was peddling the same fear based marketing not too long ago himself.

MilanKraft · 2026-05-01T11:45:09-0400

Every time one of you guys perpetuates the idea (intentionally or otherwise)...

"general improvements in long-horizon autonomy, reasoning, and coding,"

Some of us are going to remind you LLMs can't reason by any plausible definition. When you do this, you are basically acting as a kind of PR amplifier for whichever LLM developer you're writing about. Until such time as there is a truly novel development in this space that indicates actual reasoning going on, please stop. I'm beggin' ya.

(Sadly, I expect this kind of thing from television news outlets, or more generalist online outlets who don't know any better, but given the technical chops of its staff, Ars should be better than this.)

CatNamedHugs · 2026-05-01T11:49:53-0400

Cancer begets cancer.

odikweos · 2026-05-01T11:53:15-0400

Benchmarks aren't reality. Hype also isn't reality. But the tendency to assume because A and B are similar on benchmarks, that that implies they are similar in the real world is not a very safe tendency IMO.

habilain · 2026-05-01T12:00:14-0400

MilanKraft said:
Every time one of you guys perpetuates the idea (intentionally or otherwise)...

"general improvements in long-horizon autonomy, reasoning, and coding,"

Some of us are going to remind you LLMs can't reason by any plausible definition. When you do this, you are basically acting as a kind of PR amplifier for whichever LLM developer you're writing about. Until such time as there is a truly novel development in this space that indicates actual reasoning going on, please stop. I'm beggin' ya.

(Sadly, I expect this kind of thing from television news outlets, or more generalist online outlets who don't know any better, but given the technical chops of its staff, Ars should be better than this.)

The problem is that a lot of people have trouble distinguishing between reasoning and a description of reasoning. And a lot of people argue (incorrectly, in my opinion) that there isn't a difference between the two.

However, I think this is a quote from AISI that you're taking objection to? So it's not really Ars's fault here, other than perhaps a lack of challenge to a contentious idea.

odikweos · 2026-05-01T12:05:11-0400

habilain said:
The problem is that a lot of people have trouble distinguishing between reasoning and a description of reasoning. And a lot of people argue (incorrectly, in my opinion) that there isn't a difference between the two.

However, I think this is a quote from AISI that you're taking objection to? So it's not really Ars's fault here, other than perhaps a lack of challenge to a contentious idea.

Some people argue, incorrectly in my opinion, that the world is flat. So should Ars give them the benefit of the doubt? Or should we keep an open mind?

candyfire · 2026-05-01T12:06:09-0400

OpenAI CEO Sam Altman criticized what he calls “fear-based marketing” in promoting limited releases for certain AI models. While he said he’s “sure Mythos is a great model for cybersecurity,” he added that “it is clearly incredible marketing to say, ‘We have built a bomb. We are about to drop it on your head. We will sell you a bomb shelter for $100 million.’”

... this has literally been OpenAI's marketing and political strategy for years. Altman at congressional hearings talking about how scary and dangerous and existential this all is, that it's a threat to humanity

aapis · 2026-05-01T12:07:25-0400

Do any of these "models" do anything valuable yet?

quamquam quid loquor · 2026-05-01T12:08:38-0400

aapis said:
Do any of these "models" do anything valuable yet?

Apple just released their latest software update with claude.md, so it's safe to say Apple is using it internally to write their code. They just removed it when caught though to cover their tracks.

habilain · 2026-05-01T12:13:11-0400

odikweos said:
Some people argue, incorrectly in my opinion, that the world is flat. So should Ars give them the benefit of the doubt? Or should we keep an open mind?

Eh? I'm agreeing that Ars should have challenged this. But it's also true that if someone else says something incorrect and is quoted, you should hold the original source accountable.

More broadly though, there is a philosophical difference between the two situations: we can show the world to be round and reality doesn't care about anyones beliefs. But the answer to the question "can X reason" is a bit different, because the question "what is reasoning?" is less well defined. I'd argue that Godel's Incompleteness Theorem applies (it certainly does in mathematical reasoning, for example) and therefore it's not actually possible to answer the question without assumed axioms. So while I personally do not think that an LLM could ever perform reasoning for definitions of reasoning that I'm aware of, this does feel more like the type of thing one should keep an open mind on.

mattCCC · 2026-05-01T12:17:37-0400

OpenAI CEO Sam Altman criticized what he calls “fear-based marketing” in promoting limited releases for certain AI models.

Project much?

gruberduber · 2026-05-01T12:23:47-0400

HamHands_ said:
Pot meet kettle, he was peddling the same fear based marketing not too long ago himself.

It's almost as if they are all sociopathic liars, grifters, and hypocrites. Who could possibly have guessed?

RoryEjinn · 2026-05-01T12:23:59-0400

“There will be a lot more rhetoric about models that are too dangerous to release,” Altman continued. “There will also be very dangerous models that will have to be released in different ways.”

"Fear based marketing is bad as long as I'm not the one doing it. If I'm doing it, it's totally fine." - Sam Altman probably.

thearcher · 2026-05-01T12:26:49-0400

odikweos said:
Benchmarks aren't reality. Hype also isn't reality. But the tendency to assume because A and B are similar on benchmarks, that that implies they are similar in the real world is not a very safe tendency IMO.

Eh, they're more similar than if one scores 60% on the tests and the scores 0%. Example: several years ago I compiled clFFT for both Nvidia and Qualcomm GPUs. I ran the tests that come with the clFFT code. Nvidia passed all the tests. Qualcomm only passed half of them, making it unusable, not very similar. (The Qualcomm gpu's barrier function seemed to be broken, so anything requiring a local thread barrier -- syncthreads in CUDA -- couldn't be trusted. clFFT needs the barrier when processing FFTs with sizes that aren't powers of 2.)

Pitabred · 2026-05-01T12:29:54-0400

aapis said:
Do any of these "models" do anything valuable yet?

I've been using it a bit at work, and it has made writing tests and some functions a bit easier, it's pretty good at spelunking areas that you're unfamiliar with much faster than you can. But you still have to verify all of what it does.

Gen. Hospital · 2026-05-01T12:33:28-0400

Nevermind.

House of Propane · 2026-05-01T12:41:48-0400

aapis said:
Do any of these "models" do anything valuable yet?

When hooked up to a harness like Claude Code or VS Studio or Claude Cowork, people use this to conduct lots of common white collar employment tasks. I use Claude Cowork every day to build excel models and distill insights into PPT decks.

I basically don't need analyst-level employees anymore.

jbk · 2026-05-01T12:49:48-0400

HamHands_ said:
Pot meet kettle, he was peddling the same fear based marketing not too long ago himself.

There's some speculation that the whole 'too dangerous to release' was (aside from hype) an excuse to cover that Mythos is too expensive to release (i.e. they'd lose money a lot faster than they are on the current models).

odikweos · 2026-05-01T12:53:38-0400

thearcher said:
Eh, they're more similar than if one scores 60% on the tests and the scores 0%. Example: several years ago I compiled clFFT for both Nvidia and Qualcomm GPUs. I ran the tests that come with the clFFT code. Nvidia passed all the tests. Qualcomm only passed half of them, making it unusable, not very similar. (The Qualcomm gpu's barrier function seemed to be broken, so anything requiring a local thread barrier -- syncthreads in CUDA -- couldn't be trusted. clFFT needs the barrier when processing FFTs with sizes that aren't powers of 2.)

For many years I worked in this industry, and for several years I worked literally on the team that made the benchmarks fast and also made code run fast on important industry customers. We typically led the industry in the late 2000s in benchmark scores for things like Viewperf, which is as far as I'll go toward naming the company. The two types of work (and even frankly the code, though this was kept pretty quiet) were completely separate. Do with that information what you will.

quamquam quid loquor · 2026-05-01T12:56:13-0400

odikweos said:
For many years I worked in this industry, and for several years I worked literally on the team that made the benchmarks fast and also made code run fast on important industry customers. The two types of work (and even frankly the code, though this was kept pretty quiet) were completely separate. Do with that information what you will.

That's why customer "vibes" are more important these days. Sadly anything measured will be gamed. Even more sad is the rampant bot astroturfing to try to sway public opinion on vibes.

gruberduber · 2026-05-01T12:56:16-0400

aapis said:
Do any of these "models" do anything valuable yet?

In my opinion the issue isn't whether they can do anything valuable in isolation (they can). The issue is whether what they can do is, on balance, worth the total cost to society.

I don't think the immense harms these things have caused, and will cause, even begin to justify them. But that is as much to do with how people have chosen to build, hype and push them rather then inherent to the technology.

They could have trained them with vetted and ethical datasets. They could have been rolled out ethically and safely, after appropriate research on the harms, climate costs, and risks.

They... did not.

GonzoVeritas · 2026-05-01T13:04:16-0400

HamHands_ said:
Pot meet kettle, he was peddling the same fear based marketing not too long ago himself.

Of course he was, and now he's saying, 'sure we'll release a model that can hack corporate networks, but don't worry about it.' At least Anthropic pretends to be responsible.

RoryEjinn · 2026-05-01T13:05:12-0400

gruberduber said:
In my opinion the issue isn't whether they can do anything valuable in isolation (they can). The issue is whether what they can do is, on balance, worth the total cost to ~~society~~ the elite.

FTFY. The cost to society should matter, but it does not. It only matters that it makes someone money. If cost to society were enough to stop stupid things, then we wouldn't still be polluting the atmosphere or burning down the rainforest.

gruberduber · 2026-05-01T13:29:32-0400

aapis said:
Do any of these "models" do anything valuable yet?

House of Propane said:
When hooked up to a harness like Claude Code or VS Studio or Claude Cowork, people use this to conduct lots of common white collar employment tasks. I use Claude Cowork every day to build excel models and distill insights into PPT decks.

I basically don't need analyst-level employees anymore.

I'm not sure that "I can fire lots of my employees now" is the sort of 'value' they were hoping for

metamatic · 2026-05-01T13:49:11-0400

quamquam quid loquor said:
Apple just released their latest software update with claude.md, so it's safe to say Apple is using it internally to write their code. They just removed it when caught though to cover their tracks.

Awesome, so macOS can't be registered for copyright protection any more in the US, since they aren't disclosing which parts are generated by LLM.

justsomebytes · 2026-05-01T13:49:38-0400

House of Propane said:
Once we have Gen AI, anyone not in the supply chain for semiconductors or working at a frontier lab will be a cost to society and we will have think long and hard about whether we want to bear that cost.

AI people projecting that is why we're almost at the point of pitchforks and torches at new data center builds. Despite no new proof that Gen AI is even possible.

Artem S. Tashkinov · 2026-05-01T14:04:35-0400

quamquam quid loquor said:
It's sad how far Anthropic has fallen. Their developer perception and goodwill has plummeted over the past month. They lied about nerfing 4.6, they lied about the capabilities of 4.7, and their API is consistently unstable.

Dario hyped up Mythos like it was the second coming of SkyNet, when the truth came out that 5.5xhigh is literally better than Mythos at cyber security.

Not only is Codex 5.5 xhigh better than Opus 4.7, the plan limits and API token price are way more generous. I can run Codex 5.5 xhigh all day and get nowhere near the $200 plan limits with 10 chats going simultaneously. I only keep my Claude max20 subscription around for Claude Design.

You can't say good things about AI in general and OpenAI in particular here on Ars.

Everything that LLMs stand for is evil/exploitation/bad/ugly by definition according to the audience here.

Never mind that millions of people nowadays use LLMs to simply diagnose and treat themselves to stay alive. Never mind that even tens of millions of Americans cannot afford healthcare.

"LLMs are bad, period."
"LLMs don't reason, period."
"LLMs only synthesize and hallucinate and they do it by using what human beings have created."
"If you use LLMs, you're extremely incompetent and don't deserve your job."

Now you can leave proper comments here. "Oh, and Sam Altman is the worst."

BTW is AlphaFold also ... bad? And a Nobel prize given for it? Ah, never mind. That's a bit confusing. And all the other AI applications.

BTW, here's a nice news piece:

https://www.science.org/content/article/ai-starting-beat-doctors-making-correct-diagnoses

Too bad Ars won't run a story on it.

Fred Duck · 2026-05-01T14:05:59-0400

House of Propane said:
Once we have Gen AI, anyone not in the supply chain for semiconductors or working at a frontier lab will be a cost to society and we will have think long and hard about whether we want to bear that cost.

justsomebytes said:
Despite no new proof that Gen AI is even possible.

I'm rather certain we already have Gen AI.

nitsujmai · 2026-05-01T14:06:56-0400

dwrd said:
Maybe with all these advancements, these models will finally be able to find the code I have open in my editor on the first try.

If you cannot figure out how to use these agents for coding, I don't know what to say.
Pretending they don't work won't save you.

stk5 · 2026-05-01T14:11:30-0400

Can't say I'm surprised that something else was able to match Mythos, given that Anthropic did zero comparative analysis with anything but their previous version. Not on even a comparison with existing static analyzers, let alone other LLMs. Something like a dozen names on the blog post's byline, and apparently nobody thought to do one of the most basic things that'd be expected in a typical CS academic paper.

arsisloam · 2026-05-01T14:19:13-0400

Pitabred said:
I've been using it a bit at work, and it has made writing tests and some functions a bit easier, it's pretty good at spelunking areas that you're unfamiliar with much faster than you can. But you still have to verify all of what it does.

Even on a mature, large codebase, they do inexplicable things sometimes. Like when building some new UI elements, 90% of the time it follows the patterns in the existing code. 10% of the time it does... Something else. It's like a talented junior programmer with ADD. Mostly it's brilliant, but sometimes you're just like "wtf is this non working bullshit?" And you don't really know which one you'll get for any particular prompt; the brilliant, or the bullshit.

nitsujmai · 2026-05-01T14:29:52-0400

Artem S. Tashkinov said:
You can't say good things about AI in general and OpenAI in particular here on Ars.

Everything that LLMs stand for is evil/exploitation/bad/ugly by definition according to the audience here.

Never mind that millions of people nowadays use LLMs to simply diagnose and treat themselves to stay alive. Never mind that even tens of millions of Americans cannot afford healthcare.

"LLMs are bad, period."
"LLMs don't reason, period."
"LLMs only synthesize and hallucinate and they do it by using what human beings have created."
"If you use LLMs, you're extremely incompetent and don't deserve your job."

Now you can leave proper comments here. "Oh, and Sam Altman is the worst."

BTW is AlphaFold also ... bad? And a Nobel prize given for it? Ah, never mind. That's a bit confusing. And all the other AI applications.

BTW, here's a nice news piece:

https://www.science.org/content/article/ai-starting-beat-doctors-making-correct-diagnoses

Too bad Ars won't run a story on it.

Thank you. I have been here for decades and I have never seen such bias against a new technology, nevermind one of the most important advances in computer science... ever.

There are so many interesting stories about how various bio/physics/chem etc labs, pharma, aerospace, energy grid companies are modifying and using LLMs. The different ways that next token prediction can be harnessed is completely unexpected and fascinating. It doesn't matter if it is technically 'reasoning' or 'thinking'. But at Ars there is no exploration of that, its just an echo chamber of psychosis.

ElevenSeventy · 2026-05-01T14:42:08-0400

dwrd said:
Maybe with all these advancements, these models will finally be able to find the code I have open in my editor on the first try.

Sure, and when the AI finds it there's a 50% chance it will just delete all your code.

quamquam quid loquor · 2026-05-01T14:46:18-0400

nitsujmai said:
Thank you. I have been here for decades and I have never seen such bias against a new technology, nevermind one of the most important advances in computer science... ever.

There are so many interesting stories about how various bio/physics/chem etc labs, pharma, aerospace, energy grid companies are modifying and using LLMs. The different ways that next token prediction can be harnessed is completely unexpected and fascinating. It doesn't matter if it is technically 'reasoning' or 'thinking'. But at Ars there is no exploration of that, it’s just an echo chamber of psychosis.

It’s mostly defensiveness about the economic/social devaluation of skills built over decades that are a part of their core identity.

Similar principle as this Upton Sinclair Quote. “It is difficult to get a man to understand something, when his salary depends on his not understanding it.”

Amid Mythos’ hyped cybersecurity prowess, researchers find GPT-5.5 is just as good

Ars Tribunus Militum

Ars Centurion

Ars Tribunus Angusticlavius

Wise, Aged Ars Veteran

Ars Praefectus

Wise, Aged Ars Veteran

Ars Praefectus

Ars Centurion

Ars Scholae Palatinae

Ars Tribunus Militum

Wise, Aged Ars Veteran

Wise, Aged Ars Veteran

Wise, Aged Ars Veteran

Smack-Fu Master, in training

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Seniorius Lurkius

Ars Scholae Palatinae

Ars Centurion

Ars Praefectus

Ars Tribunus Militum

Wise, Aged Ars Veteran

Seniorius Lurkius

Smack-Fu Master, in training

Wise, Aged Ars Veteran

Ars Centurion

Wise, Aged Ars Veteran

Ars Scholae Palatinae

Ars Tribunus Angusticlavius

Ars Centurion

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Centurion

Smack-Fu Master, in training

Ars Tribunus Militum