Amid Mythos’ hyped cybersecurity prowess, researchers find GPT-5.5 is just as good

MilanKraft

Ars Tribunus Angusticlavius
6,875
Every time one of you guys perpetuates the idea (intentionally or otherwise)...

"general improvements in long-horizon autonomy, reasoning, and coding,"

Some of us are going to remind you LLMs can't reason by any plausible definition. When you do this, you are basically acting as a kind of PR amplifier for whichever LLM developer you're writing about. Until such time as there is a truly novel development in this space that indicates actual reasoning going on, please stop. I'm beggin' ya.

(Sadly, I expect this kind of thing from television news outlets, or more generalist online outlets who don't know any better, but given the technical chops of its staff, Ars should be better than this.)
 
Upvote
49 (86 / -37)
Post content hidden for low score. Show…
Every time one of you guys perpetuates the idea (intentionally or otherwise)...

"general improvements in long-horizon autonomy, reasoning, and coding,"

Some of us are going to remind you LLMs can't reason by any plausible definition. When you do this, you are basically acting as a kind of PR amplifier for whichever LLM developer you're writing about. Until such time as there is a truly novel development in this space that indicates actual reasoning going on, please stop. I'm beggin' ya.

(Sadly, I expect this kind of thing from television news outlets, or more generalist online outlets who don't know any better, but given the technical chops of its staff, Ars should be better than this.)

The problem is that a lot of people have trouble distinguishing between reasoning and a description of reasoning. And a lot of people argue (incorrectly, in my opinion) that there isn't a difference between the two.

However, I think this is a quote from AISI that you're taking objection to? So it's not really Ars's fault here, other than perhaps a lack of challenge to a contentious idea.
 
Upvote
63 (65 / -2)
The problem is that a lot of people have trouble distinguishing between reasoning and a description of reasoning. And a lot of people argue (incorrectly, in my opinion) that there isn't a difference between the two.

However, I think this is a quote from AISI that you're taking objection to? So it's not really Ars's fault here, other than perhaps a lack of challenge to a contentious idea.
Some people argue, incorrectly in my opinion, that the world is flat. So should Ars give them the benefit of the doubt? Or should we keep an open mind?
 
Upvote
16 (33 / -17)
OpenAI CEO Sam Altman criticized what he calls “fear-based marketing” in promoting limited releases for certain AI models. While he said he’s “sure Mythos is a great model for cybersecurity,” he added that “it is clearly incredible marketing to say, ‘We have built a bomb. We are about to drop it on your head. We will sell you a bomb shelter for $100 million.’”

... this has literally been OpenAI's marketing and political strategy for years. Altman at congressional hearings talking about how scary and dangerous and existential this all is, that it's a threat to humanity
 
Upvote
53 (53 / 0)

quamquam quid loquor

Ars Tribunus Militum
2,914
Subscriptor++
🥱 Do any of these "models" do anything valuable yet?
Apple just released their latest software update with claude.md, so it's safe to say Apple is using it internally to write their code. They just removed it when caught though to cover their tracks.

HHOcS2hWgAE9nMU.png
 
Upvote
36 (39 / -3)
Some people argue, incorrectly in my opinion, that the world is flat. So should Ars give them the benefit of the doubt? Or should we keep an open mind?

Eh? I'm agreeing that Ars should have challenged this. But it's also true that if someone else says something incorrect and is quoted, you should hold the original source accountable.

More broadly though, there is a philosophical difference between the two situations: we can show the world to be round and reality doesn't care about anyones beliefs. But the answer to the question "can X reason" is a bit different, because the question "what is reasoning?" is less well defined. I'd argue that Godel's Incompleteness Theorem applies (it certainly does in mathematical reasoning, for example) and therefore it's not actually possible to answer the question without assumed axioms. So while I personally do not think that an LLM could ever perform reasoning for definitions of reasoning that I'm aware of, this does feel more like the type of thing one should keep an open mind on.
 
Upvote
23 (25 / -2)

RoryEjinn

Smack-Fu Master, in training
81
Subscriptor
“There will be a lot more rhetoric about models that are too dangerous to release,” Altman continued. “There will also be very dangerous models that will have to be released in different ways.”
"Fear based marketing is bad as long as I'm not the one doing it. If I'm doing it, it's totally fine." - Sam Altman probably.
 
Upvote
38 (38 / 0)

thearcher

Ars Scholae Palatinae
728
Subscriptor++
Benchmarks aren't reality. Hype also isn't reality. But the tendency to assume because A and B are similar on benchmarks, that that implies they are similar in the real world is not a very safe tendency IMO.
Eh, they're more similar than if one scores 60% on the tests and the scores 0%. Example: several years ago I compiled clFFT for both Nvidia and Qualcomm GPUs. I ran the tests that come with the clFFT code. Nvidia passed all the tests. Qualcomm only passed half of them, making it unusable, not very similar. (The Qualcomm gpu's barrier function seemed to be broken, so anything requiring a local thread barrier -- syncthreads in CUDA -- couldn't be trusted. clFFT needs the barrier when processing FFTs with sizes that aren't powers of 2.)
 
Upvote
-1 (3 / -4)

Pitabred

Wise, Aged Ars Veteran
168
Subscriptor
🥱 Do any of these "models" do anything valuable yet?
I've been using it a bit at work, and it has made writing tests and some functions a bit easier, it's pretty good at spelunking areas that you're unfamiliar with much faster than you can. But you still have to verify all of what it does.
 
Upvote
25 (28 / -3)
🥱 Do any of these "models" do anything valuable yet?
When hooked up to a harness like Claude Code or VS Studio or Claude Cowork, people use this to conduct lots of common white collar employment tasks. I use Claude Cowork every day to build excel models and distill insights into PPT decks.

I basically don't need analyst-level employees anymore.
 
Upvote
-14 (17 / -31)
Pot meet kettle, he was peddling the same fear based marketing not too long ago himself.
There's some speculation that the whole 'too dangerous to release' was (aside from hype) an excuse to cover that Mythos is too expensive to release (i.e. they'd lose money a lot faster than they are on the current models).
 
Upvote
16 (17 / -1)
Eh, they're more similar than if one scores 60% on the tests and the scores 0%. Example: several years ago I compiled clFFT for both Nvidia and Qualcomm GPUs. I ran the tests that come with the clFFT code. Nvidia passed all the tests. Qualcomm only passed half of them, making it unusable, not very similar. (The Qualcomm gpu's barrier function seemed to be broken, so anything requiring a local thread barrier -- syncthreads in CUDA -- couldn't be trusted. clFFT needs the barrier when processing FFTs with sizes that aren't powers of 2.)
For many years I worked in this industry, and for several years I worked literally on the team that made the benchmarks fast and also made code run fast on important industry customers. We typically led the industry in the late 2000s in benchmark scores for things like Viewperf, which is as far as I'll go toward naming the company. The two types of work (and even frankly the code, though this was kept pretty quiet) were completely separate. Do with that information what you will.
 
Last edited:
Upvote
10 (10 / 0)

quamquam quid loquor

Ars Tribunus Militum
2,914
Subscriptor++
For many years I worked in this industry, and for several years I worked literally on the team that made the benchmarks fast and also made code run fast on important industry customers. The two types of work (and even frankly the code, though this was kept pretty quiet) were completely separate. Do with that information what you will.
That's why customer "vibes" are more important these days. Sadly anything measured will be gamed. Even more sad is the rampant bot astroturfing to try to sway public opinion on vibes.
 
Upvote
5 (6 / -1)
🥱 Do any of these "models" do anything valuable yet?
In my opinion the issue isn't whether they can do anything valuable in isolation (they can). The issue is whether what they can do is, on balance, worth the total cost to society.

I don't think the immense harms these things have caused, and will cause, even begin to justify them. But that is as much to do with how people have chosen to build, hype and push them rather then inherent to the technology.

They could have trained them with vetted and ethical datasets. They could have been rolled out ethically and safely, after appropriate research on the harms, climate costs, and risks.

They... did not.
 
Upvote
4 (12 / -8)

RoryEjinn

Smack-Fu Master, in training
81
Subscriptor
In my opinion the issue isn't whether they can do anything valuable in isolation (they can). The issue is whether what they can do is, on balance, worth the total cost to society the elite.
FTFY. The cost to society should matter, but it does not. It only matters that it makes someone money. If cost to society were enough to stop stupid things, then we wouldn't still be polluting the atmosphere or burning down the rainforest.
 
Upvote
11 (12 / -1)
Post content hidden for low score. Show…
🥱 Do any of these "models" do anything valuable yet?

When hooked up to a harness like Claude Code or VS Studio or Claude Cowork, people use this to conduct lots of common white collar employment tasks. I use Claude Cowork every day to build excel models and distill insights into PPT decks.

I basically don't need analyst-level employees anymore.

I'm not sure that "I can fire lots of my employees now" is the sort of 'value' they were hoping for
 
Upvote
6 (9 / -3)
Upvote
-7 (7 / -14)

justsomebytes

Wise, Aged Ars Veteran
196
Subscriptor
Once we have Gen AI, anyone not in the supply chain for semiconductors or working at a frontier lab will be a cost to society and we will have think long and hard about whether we want to bear that cost.
AI people projecting that is why we're almost at the point of pitchforks and torches at new data center builds. Despite no new proof that Gen AI is even possible.
 
Upvote
3 (9 / -6)
It's sad how far Anthropic has fallen. Their developer perception and goodwill has plummeted over the past month. They lied about nerfing 4.6, they lied about the capabilities of 4.7, and their API is consistently unstable.

Dario hyped up Mythos like it was the second coming of SkyNet, when the truth came out that 5.5xhigh is literally better than Mythos at cyber security.

Not only is Codex 5.5 xhigh better than Opus 4.7, the plan limits and API token price are way more generous. I can run Codex 5.5 xhigh all day and get nowhere near the $200 plan limits with 10 chats going simultaneously. I only keep my Claude max20 subscription around for Claude Design.
You can't say good things about AI in general and OpenAI in particular here on Ars.

Everything that LLMs stand for is evil/exploitation/bad/ugly by definition according to the audience here.

Never mind that millions of people nowadays use LLMs to simply diagnose and treat themselves to stay alive. Never mind that even tens of millions of Americans cannot afford healthcare.

"LLMs are bad, period."
"LLMs don't reason, period."
"LLMs only synthesize and hallucinate and they do it by using what human beings have created."
"If you use LLMs, you're extremely incompetent and don't deserve your job."

Now you can leave proper comments here. "Oh, and Sam Altman is the worst."

BTW is AlphaFold also ... bad? And a Nobel prize given for it? Ah, never mind. That's a bit confusing. And all the other AI applications.

BTW, here's a nice news piece:

https://www.science.org/content/article/ai-starting-beat-doctors-making-correct-diagnoses

Too bad Ars won't run a story on it.
 
Upvote
-17 (16 / -33)

Fred Duck

Ars Tribunus Angusticlavius
7,301
Upvote
9 (9 / 0)

stk5

Ars Scholae Palatinae
990
Subscriptor++
Can't say I'm surprised that something else was able to match Mythos, given that Anthropic did zero comparative analysis with anything but their previous version. Not on even a comparison with existing static analyzers, let alone other LLMs. Something like a dozen names on the blog post's byline, and apparently nobody thought to do one of the most basic things that'd be expected in a typical CS academic paper.
 
Upvote
2 (3 / -1)
Post content hidden for low score. Show…
Post content hidden for low score. Show…

arsisloam

Ars Scholae Palatinae
1,359
Subscriptor
I've been using it a bit at work, and it has made writing tests and some functions a bit easier, it's pretty good at spelunking areas that you're unfamiliar with much faster than you can. But you still have to verify all of what it does.
Even on a mature, large codebase, they do inexplicable things sometimes. Like when building some new UI elements, 90% of the time it follows the patterns in the existing code. 10% of the time it does... Something else. It's like a talented junior programmer with ADD. Mostly it's brilliant, but sometimes you're just like "wtf is this non working bullshit?" And you don't really know which one you'll get for any particular prompt; the brilliant, or the bullshit.
 
Upvote
21 (21 / 0)
You can't say good things about AI in general and OpenAI in particular here on Ars.

Everything that LLMs stand for is evil/exploitation/bad/ugly by definition according to the audience here.

Never mind that millions of people nowadays use LLMs to simply diagnose and treat themselves to stay alive. Never mind that even tens of millions of Americans cannot afford healthcare.

"LLMs are bad, period."
"LLMs don't reason, period."
"LLMs only synthesize and hallucinate and they do it by using what human beings have created."
"If you use LLMs, you're extremely incompetent and don't deserve your job."

Now you can leave proper comments here. "Oh, and Sam Altman is the worst."

BTW is AlphaFold also ... bad? And a Nobel prize given for it? Ah, never mind. That's a bit confusing. And all the other AI applications.

BTW, here's a nice news piece:

https://www.science.org/content/article/ai-starting-beat-doctors-making-correct-diagnoses

Too bad Ars won't run a story on it.
Thank you. I have been here for decades and I have never seen such bias against a new technology, nevermind one of the most important advances in computer science... ever.

There are so many interesting stories about how various bio/physics/chem etc labs, pharma, aerospace, energy grid companies are modifying and using LLMs. The different ways that next token prediction can be harnessed is completely unexpected and fascinating. It doesn't matter if it is technically 'reasoning' or 'thinking'. But at Ars there is no exploration of that, its just an echo chamber of psychosis.
 
Upvote
-9 (20 / -29)

quamquam quid loquor

Ars Tribunus Militum
2,914
Subscriptor++
Thank you. I have been here for decades and I have never seen such bias against a new technology, nevermind one of the most important advances in computer science... ever.

There are so many interesting stories about how various bio/physics/chem etc labs, pharma, aerospace, energy grid companies are modifying and using LLMs. The different ways that next token prediction can be harnessed is completely unexpected and fascinating. It doesn't matter if it is technically 'reasoning' or 'thinking'. But at Ars there is no exploration of that, it’s just an echo chamber of psychosis.
It’s mostly defensiveness about the economic/social devaluation of skills built over decades that are a part of their core identity.

Similar principle as this Upton Sinclair Quote. “It is difficult to get a man to understand something, when his salary depends on his not understanding it.”
 
Upvote
-13 (8 / -21)