Before launching, GPT-4o broke records on chatbot leaderboard under a secret name

Beyond Opinion

Smack-Fu Master, in training
92
Subscriptor
It seems to me we're already starting to see diminishing returns here. According to this metric, this new model is ~4% better than the previous? I've read elsewhere that it's faster and uses fewer computing resources to achieve its results, so maybe that's where the primary gains lie.
 
Upvote
15 (38 / -23)

Dan Homerick

Ars Praefectus
5,483
Subscriptor++
It seems to me we're already starting to see diminishing returns here. According to this metric, this new model is ~4% better than the previous? I've read elsewhere that it's faster and uses fewer computing resources to achieve its results, so maybe that's where the primary gains lie.
I don't play any competitive sports that use ELO rankings, but if I'm understanding this table correctly, a 50 point gap is more like "the new model was judged as better 7.15% of the time".

I do expect we're well into diminishing returns ... of the test rankings. Once the bots are good enough, a large number of people won't throw a hard enough challenge at the bots to see a difference, and will judge more-or-less randomly.

That is, the new bot could be drastically better (or worse!) at solving differential equations, but given that most people won't ask about something that hard...
 
Upvote
77 (77 / 0)

Malister

Smack-Fu Master, in training
69
Subscriptor
It's an open secret that all the GPT-wielding emperors have no clothes - the moat is not a lone AI, it's about being useful to humans, just like the rest of us attempt to do at our dayjobs.
'Being useful to humans' is one of the things I think gets lost in the arguments around the current AI hype. The questions around if they're 'true' AI, if they make stuff up, if they violate copyright, etc might not change the course if they become increasingly useful. And they are increasing usful. Compare now to 5 years ago and tell me the Claude/ChatGPT/Gemini style LLMs are a joke.

There's still a lot of problems, holy energy sink batman, there's a lot of really stupid hype that doesn't match reality, and what this means to workers to name a few. But as tools they are getting better, so we need to pay attention and not completely dismiss it like a bunch have been.
 
Upvote
51 (56 / -5)
ChatGTP made its own image that was far less disturbing, lol chatbot_version.jpg
 
Upvote
77 (78 / -1)

MHStrawn

Ars Scholae Palatinae
1,432
Subscriptor
'Being useful to humans' is one of the things I think gets lost in the arguments around the current AI hype. The questions around if they're 'true' AI, if they make stuff up, if they violate copyright, etc might not change the course if they become increasingly useful. And they are increasing usful. Compare now to 5 years ago and tell me the Claude/ChatGPT/Gemini style LLMs are a joke.

There's still a lot of problems, holy energy sink batman, there's a lot of really stupid hype that doesn't match reality, and what this means to workers to name a few. But as tools they are getting better, so we need to pay attention and not completely dismiss it like a bunch have been.
I agree with this.

But regardless of what "usefulness" AI provides I fear it will be grossly outweighed by the negatives.

Why? Because even if AI provides many positives it seems inevitable that soon we wont' be able to easily distinguish between real and AI. When that occurs the fear isn't that people will believe things that aren't real....it's that people won't believe anything at all.

More precisely, they won't believe anything at all THAT CONTRADICTS THEIR BELIEFS OR VIEWS. Instead, anything that conflicts with these will be dismissed as "AI" (you see this phenomenon already with "fake news").

When each person can choose to believe what they choose - with no "authority" able to validate - then every person has a distinct truth.

That's utterly unsustainable and it seems to me no society can really thrive in an environment where no one can agree on basic facts.

This seems likely in the very near future to me (10-15 years) and that's a scary proposition.
 
Upvote
35 (46 / -11)

Kjella

Ars Tribunus Militum
2,081
(...) I do expect we're well into diminishing returns ... of the test rankings. Once the bots are good enough, a large number of people won't throw a hard enough challenge at the bots to see a difference, and will judge more-or-less randomly. (...)
Well, you might get to the point where you get two perfectly adequate answers but when you get them side-by-side most people get pretty picky about who explained it quicker and better or who followed the instructions more precisely and consistently finding all the right words does have a certain value of its own. There's plenty of more formal benchmarks to test their capabilities on specific tasks.
 
Upvote
29 (29 / 0)
This seems likely in the very near future to me (10-15 years) and that's a scary proposition.
10-15 years? I think we're there now. We're already seeing court cases where people are claiming the evidence is AI generated. Even blood evidence will likely be falsifiable within a couple of years now.
 
Upvote
20 (27 / -7)
There's still a lot of problems, holy energy sink batman, there's a lot of really stupid hype that doesn't match reality, and what this means to workers to name a few. But as tools they are getting better, so we need to pay attention and not completely dismiss it like a bunch have been.
It's not so much that these things are not and will never be useful. That's absurd. Tied in to what you're saying, what we're going to see is bean counters and C-suite types that have huffed the hype cycle and are going to deploy unfit technology that's "good enough", and that will become the new baseline. How much of customer service has migrated from IVR to chatbots?

The steam loom wasn't better, it was lower cost per unit. Sure, it cranked out an inferior product, but the bit of that cost drop that got passed on to the consumer meant they were willing to tolerate it. Where do you think it's gonna go when the labor cost can be dropped to near-zero for what are perceived as pure cost positions, no matter how bad quality degrades? How low are you willing to see customer satisfaction drop if it saves you 98% of the cost of an entire division?
 
Upvote
38 (39 / -1)
It seems to me we're already starting to see diminishing returns here. According to this metric, this new model is ~4% better than the previous? I've read elsewhere that it's faster and uses fewer computing resources to achieve its results, so maybe that's where the primary gains lie.
4% sure but compounding over what period? And this is an intermediate model. Let's see where gpt5 and Claude 4 are. Maybe we'll see a decline in rate of change. It'll be a big deal either way: have we capped out this core technology and have to look to other cs tech to make the overall system smarter.. The point at which this phase of innovation slows down will be hugely significant for the next few decades (for good or bad on either outcome).
 
Upvote
3 (4 / -1)

k h

Ars Centurion
369
Subscriptor
the fear isn't that people will believe things that aren't real....it's that people won't believe anything at all.
More precisely, they won't believe anything at all THAT CONTRADICTS THEIR BELIEFS OR VIEWS. Instead, anything that conflicts with these will be dismissed as "AI" (you see this phenomenon already with "fake news").
It will go the other way too: people will only believe what comes from an AI or is confirmed by an AI. Not just any AI - they'll have a favorite brand that they trust. Some will only accept facts endorsed by Bing, some will only believe Google-branded facts.

Social media influencers will become a thing of the past. Instead we'll have AI influencers: people hired to convince AIs to swallow the client's PR. AI influencers who can prove they are human will command higher prices and more prestige than AI influencers that are AIs. Of course to prove you are human, you will have to get a widely respected AI to vouch for you.
 
Upvote
-6 (5 / -11)

TimeWinder

Ars Tribunus Militum
1,818
Subscriptor
When that occurs the fear isn't that people will believe things that aren't real....it's that people won't believe anything at all.

More precisely, they won't believe anything at all THAT CONTRADICTS THEIR BELIEFS OR VIEWS. Instead, anything that conflicts with these will be dismissed as "AI" (you see this phenomenon already with "fake news").

...

This seems likely in the very near future to me (10-15 years) and that's a scary proposition.
Your time estimate is 25 years too far out. This is basically the state of discourse since about 2014. AI might help it along, but critical thinking was dead the moment it became political.
 
Upvote
21 (21 / 0)

One off

Ars Tribunus Militum
1,547
10-15 years? I think we're there now. We're already seeing court cases where people are claiming the evidence is AI generated. Even blood evidence will likely be falsifiable within a couple of years now.
Criminal courts rely on chain of evidence, not unfakeability. A person says they took the photo or that it came from their video feed. A police officer confirms they picked up the knife at the scene and then it is sealed and tracked through the system, including forensic testing by a person willing to swear that those results are correct to the best of their professional knowledge. Where there is room to muddy the waters is say, CCTV 'evidence' that you were elsewhere when the crime was committed, but a decent prosecutor will be sure to highlight any doubts. YMMV in more authoritarian or corrupt legal systems.

Civil cases may need more computer forensics people to give an opinion on disputed email chains, video footage, etc. because such evidence is usually provided by one of the motivated parties.
 
Upvote
31 (31 / 0)
It seems to me we're already starting to see diminishing returns here. According to this metric, this new model is ~4% better than the previous? I've read elsewhere that it's faster and uses fewer computing resources to achieve its results, so maybe that's where the primary gains lie.

There's plenty of reason to think there will be diminishing returns. It's quite probably we'll never get an LLM that can give good legal advice, for instance. Last I checked, it was still a mess. There's just not enough data out there to keep having large gains. The more specialized the knowledge, the more inept these systems are and will be.

Computerphile recently did a video on a paper that studied this.
 
Upvote
-1 (9 / -10)
Your time estimate is 25 years too far out. This is basically the state of discourse since about 2014. AI might help it along, but critical thinking was dead the moment it became political.

It's not like this is the first time in history that's happened.
Generative AI will make it a lot worse and a lot harder to break out of it though. And it sure seems like it's causing more harm than the benefits it provides.
 
Upvote
1 (3 / -2)

stackman

Wise, Aged Ars Veteran
165
...

More precisely, they won't believe anything at all THAT CONTRADICTS THEIR BELIEFS OR VIEWS. Instead, anything that conflicts with these will be dismissed as "AI" (you see this phenomenon already with "fake news").

When each person can choose to believe what they choose - with no "authority" able to validate - then every person has a distinct truth.

...
The thing is, that's already happened. And not in the last year, or the last few years -it's been the case pretty much since, well, since people.
 
Upvote
4 (6 / -2)

Hispalensis

Ars Tribunus Militum
1,904
Subscriptor
There is an increasing suspicion that LLM benchmarks are leaking into the training sets (i.e., people asking questions in the benchmarks as part of interactive sessions that get rolled into the new training data for the next generation). The scores are impressive, but without further validation of what went in and what went out it is still closer to marketing than to an actual metric of performance.
 
Upvote
1 (5 / -4)
OpenAI submitting anonymously to a leaderboard doesn’t feel very open.

None of these AI projects feel like real-world ethics is a real concern. Instead they try to act like the biggest concern is Skynet, which LLMs aren't ever going to be. It draws attention away from things like their inability to eliminate bigotry from their products and also makes people think they're more capable than they are. OpenAI itself sold out on its principles long ago and in the last year purged anyone who still cares from their Board.
 
Upvote
0 (6 / -6)

alors

Ars Centurion
228
Subscriptor++
Kinda makes you realize that you don't need generative models to create engaging stock photos?
No, but AI can come up with them for cheaper than a Getty Images subscription!*

* because they trained the model on Getty Images without paying**

** okay, I don't condone it, but I can kind of understand this one
 
Upvote
2 (4 / -2)