Before launching, GPT-4o broke records on chatbot leaderboard under a secret name

OOPMan · May 13, 2024

This whole charting of performance seems rather nonsensical to me.

Beyond Opinion · May 13, 2024

It seems to me we're already starting to see diminishing returns here. According to this metric, this new model is ~4% better than the previous? I've read elsewhere that it's faster and uses fewer computing resources to achieve its results, so maybe that's where the primary gains lie.

volcano.authors · May 13, 2024

OOPMan said:
This whole charting of performance seems rather nonsensical to me.

It's an open secret that all the GPT-wielding emperors have no clothes - the moat is not a lone AI, it's about being useful to humans, just like the rest of us attempt to do at our dayjobs.

poochyena · May 13, 2024

Gotta say, that might be the weirdest stock image I have seen all year.

UserIDAlreadyInUse · May 13, 2024

volcano.authors said:
It's an open secret that all the GPT-wielding emperors have no clothes - the moat is not a lone AI, it's about being useful to humans, just like the rest of us attempt to do at our dayjobs.

Ah, no that'd be Pornhub's AI.

frumble · May 13, 2024

Disturbing photo.

Dan Homerick · May 13, 2024

Beyond Opinion said:
It seems to me we're already starting to see diminishing returns here. According to this metric, this new model is ~4% better than the previous? I've read elsewhere that it's faster and uses fewer computing resources to achieve its results, so maybe that's where the primary gains lie.

I don't play any competitive sports that use ELO rankings, but if I'm understanding this table correctly, a 50 point gap is more like "the new model was judged as better 7.15% of the time".

I do expect we're well into diminishing returns ... of the test rankings. Once the bots are good enough, a large number of people won't throw a hard enough challenge at the bots to see a difference, and will judge more-or-less randomly.

That is, the new bot could be drastically better (or worse!) at solving differential equations, but given that most people won't ask about something that hard...

thelee · May 13, 2024

boy there really is a stock image for almost anything, eh?

The Sheep Look Up · May 13, 2024

frumble said:
Disturbing photo.

What's wrong with Homework Gimp^tm? Some random mommy blog rated it the best learning aid of the year.

Malister · May 13, 2024

volcano.authors said:
It's an open secret that all the GPT-wielding emperors have no clothes - the moat is not a lone AI, it's about being useful to humans, just like the rest of us attempt to do at our dayjobs.

'Being useful to humans' is one of the things I think gets lost in the arguments around the current AI hype. The questions around if they're 'true' AI, if they make stuff up, if they violate copyright, etc might not change the course if they become increasingly useful. And they are increasing usful. Compare now to 5 years ago and tell me the Claude/ChatGPT/Gemini style LLMs are a joke.

There's still a lot of problems, holy energy sink batman, there's a lot of really stupid hype that doesn't match reality, and what this means to workers to name a few. But as tools they are getting better, so we need to pay attention and not completely dismiss it like a bunch have been.

supax · May 13, 2024

ChatGTP made its own image that was far less disturbing, lol

Jackattak · May 13, 2024

OOPMan said:
This whole charting of performance seems rather nonsensical to me.

If you read it all in Casey Kasem’s voice it’s at least nostalgic.

MHStrawn · May 13, 2024

Malister said:
'Being useful to humans' is one of the things I think gets lost in the arguments around the current AI hype. The questions around if they're 'true' AI, if they make stuff up, if they violate copyright, etc might not change the course if they become increasingly useful. And they are increasing usful. Compare now to 5 years ago and tell me the Claude/ChatGPT/Gemini style LLMs are a joke.

There's still a lot of problems, holy energy sink batman, there's a lot of really stupid hype that doesn't match reality, and what this means to workers to name a few. But as tools they are getting better, so we need to pay attention and not completely dismiss it like a bunch have been.

I agree with this.

But regardless of what "usefulness" AI provides I fear it will be grossly outweighed by the negatives.

Why? Because even if AI provides many positives it seems inevitable that soon we wont' be able to easily distinguish between real and AI. When that occurs the fear isn't that people will believe things that aren't real....it's that people won't believe anything at all.

More precisely, they won't believe anything at all THAT CONTRADICTS THEIR BELIEFS OR VIEWS. Instead, anything that conflicts with these will be dismissed as "AI" (you see this phenomenon already with "fake news").

When each person can choose to believe what they choose - with no "authority" able to validate - then every person has a distinct truth.

That's utterly unsustainable and it seems to me no society can really thrive in an environment where no one can agree on basic facts.

This seems likely in the very near future to me (10-15 years) and that's a scary proposition.

Someblokeoverthere · May 13, 2024

So it won a chat bot deathmatch voted on a public forum by an unreported number of ai enthusiasts? An accolade to be proud of.

Mostly Ignorant · May 13, 2024

poochyena said:
Gotta say, that might be the weirdest stock image I have seen all year.

The Greendale Human Being evolved to school-from-home during COVID, and now makes house calls.

Kjella · May 13, 2024

Dan Homerick said:
(...) I do expect we're well into diminishing returns ... of the test rankings. Once the bots are good enough, a large number of people won't throw a hard enough challenge at the bots to see a difference, and will judge more-or-less randomly. (...)

Well, you might get to the point where you get two perfectly adequate answers but when you get them side-by-side most people get pretty picky about who explained it quicker and better or who followed the instructions more precisely and consistently finding all the right words does have a certain value of its own. There's plenty of more formal benchmarks to test their capabilities on specific tasks.

Migz-DH · May 13, 2024

Wait... there's such a thing as a chatbot ... leaderboard? wtf?

Surtrus · May 13, 2024

MHStrawn said:
This seems likely in the very near future to me (10-15 years) and that's a scary proposition.

10-15 years? I think we're there now. We're already seeing court cases where people are claiming the evidence is AI generated. Even blood evidence will likely be falsifiable within a couple of years now.

The Sheep Look Up · May 13, 2024

Malister said:
There's still a lot of problems, holy energy sink batman, there's a lot of really stupid hype that doesn't match reality, and what this means to workers to name a few. But as tools they are getting better, so we need to pay attention and not completely dismiss it like a bunch have been.

It's not so much that these things are not and will never be useful. That's absurd. Tied in to what you're saying, what we're going to see is bean counters and C-suite types that have huffed the hype cycle and are going to deploy unfit technology that's "good enough", and that will become the new baseline. How much of customer service has migrated from IVR to chatbots?

The steam loom wasn't better, it was lower cost per unit. Sure, it cranked out an inferior product, but the bit of that cost drop that got passed on to the consumer meant they were willing to tolerate it. Where do you think it's gonna go when the labor cost can be dropped to near-zero for what are perceived as pure cost positions, no matter how bad quality degrades? How low are you willing to see customer satisfaction drop if it saves you 98% of the cost of an entire division?

unequivocal · May 13, 2024

OOPMan said:
This whole charting of performance seems rather nonsensical to me.

But we do pretty much the same thing for chess playing programs.. What's different for you?

unequivocal · May 13, 2024

Beyond Opinion said:
It seems to me we're already starting to see diminishing returns here. According to this metric, this new model is ~4% better than the previous? I've read elsewhere that it's faster and uses fewer computing resources to achieve its results, so maybe that's where the primary gains lie.

4% sure but compounding over what period? And this is an intermediate model. Let's see where gpt5 and Claude 4 are. Maybe we'll see a decline in rate of change. It'll be a big deal either way: have we capped out this core technology and have to look to other cs tech to make the overall system smarter.. The point at which this phase of innovation slows down will be hugely significant for the next few decades (for good or bad on either outcome).

k h · May 13, 2024

MHStrawn said:
the fear isn't that people will believe things that aren't real....it's that people won't believe anything at all.
More precisely, they won't believe anything at all THAT CONTRADICTS THEIR BELIEFS OR VIEWS. Instead, anything that conflicts with these will be dismissed as "AI" (you see this phenomenon already with "fake news").

It will go the other way too: people will only believe what comes from an AI or is confirmed by an AI. Not just any AI - they'll have a favorite brand that they trust. Some will only accept facts endorsed by Bing, some will only believe Google-branded facts.

Social media influencers will become a thing of the past. Instead we'll have AI influencers: people hired to convince AIs to swallow the client's PR. AI influencers who can prove they are human will command higher prices and more prestige than AI influencers that are AIs. Of course to prove you are human, you will have to get a widely respected AI to vouch for you.

fencepost · May 13, 2024

poochyena said:
Gotta say, that might be the weirdest stock image I have seen all year.

Yeah, my reaction was that it looked like something from a very disturbing fetish porn shoot.

Scribit · May 13, 2024

poochyena said:
Gotta say, that might be the weirdest stock image I have seen all year.

Yes, I'd totally assumed it was an AI generated image, and was about to comment on what kind of disturbing prompt had been used when I noticed the Getty Images attribution

terrydactyl · May 13, 2024

unequivocal said:
But we do pretty much the same thing for chess playing programs.. What's different for you?

Chess has objective outcomes. What is the objective test for chatbots?

jhodge · May 13, 2024

terrydactyl said:
Chess has objective outcomes. What is the objective test for chatbots?

Social engineering. The first chatbot that scams the humans out of enough money to pay for it's own computing costs wins.

TimeWinder · May 13, 2024

MHStrawn said:
When that occurs the fear isn't that people will believe things that aren't real....it's that people won't believe anything at all.

More precisely, they won't believe anything at all THAT CONTRADICTS THEIR BELIEFS OR VIEWS. Instead, anything that conflicts with these will be dismissed as "AI" (you see this phenomenon already with "fake news").

...

This seems likely in the very near future to me (10-15 years) and that's a scary proposition.

Your time estimate is 25 years too far out. This is basically the state of discourse since about 2014. AI might help it along, but critical thinking was dead the moment it became political.

One off · May 13, 2024

Surtrus said:
10-15 years? I think we're there now. We're already seeing court cases where people are claiming the evidence is AI generated. Even blood evidence will likely be falsifiable within a couple of years now.

Criminal courts rely on chain of evidence, not unfakeability. A person says they took the photo or that it came from their video feed. A police officer confirms they picked up the knife at the scene and then it is sealed and tracked through the system, including forensic testing by a person willing to swear that those results are correct to the best of their professional knowledge. Where there is room to muddy the waters is say, CCTV 'evidence' that you were elsewhere when the crime was committed, but a decent prosecutor will be sure to highlight any doubts. YMMV in more authoritarian or corrupt legal systems.

Civil cases may need more computer forensics people to give an opinion on disputed email chains, video footage, etc. because such evidence is usually provided by one of the motivated parties.

drachasor · May 13, 2024

Beyond Opinion said:
It seems to me we're already starting to see diminishing returns here. According to this metric, this new model is ~4% better than the previous? I've read elsewhere that it's faster and uses fewer computing resources to achieve its results, so maybe that's where the primary gains lie.

There's plenty of reason to think there will be diminishing returns. It's quite probably we'll never get an LLM that can give good legal advice, for instance. Last I checked, it was still a mess. There's just not enough data out there to keep having large gains. The more specialized the knowledge, the more inept these systems are and will be.

Computerphile recently did a video on a paper that studied this.

drachasor · May 13, 2024

TimeWinder said:
Your time estimate is 25 years too far out. This is basically the state of discourse since about 2014. AI might help it along, but critical thinking was dead the moment it became political.

It's not like this is the first time in history that's happened.
Generative AI will make it a lot worse and a lot harder to break out of it though. And it sure seems like it's causing more harm than the benefits it provides.

Achilles · May 13, 2024

Thank you for explaining the "I'm a good chatbot". That was a very welcome and amusing addition!

stackman · May 13, 2024

MHStrawn said:
...

More precisely, they won't believe anything at all THAT CONTRADICTS THEIR BELIEFS OR VIEWS. Instead, anything that conflicts with these will be dismissed as "AI" (you see this phenomenon already with "fake news").

When each person can choose to believe what they choose - with no "authority" able to validate - then every person has a distinct truth.

...

The thing is, that's already happened. And not in the last year, or the last few years -it's been the case pretty much since, well, since people.

Hispalensis · May 13, 2024

There is an increasing suspicion that LLM benchmarks are leaking into the training sets (i.e., people asking questions in the benchmarks as part of interactive sessions that get rolled into the new training data for the next generation). The scores are impressive, but without further validation of what went in and what went out it is still closer to marketing than to an actual metric of performance.

Publius Enigma · May 13, 2024

OpenAI submitting anonymously to a leaderboard doesn’t feel very open.

Smashing Young Man · May 14, 2024

The image at the top of the article is disturbing in ways I won't give voice to.

ChrisSD · May 14, 2024

An article image hasn't been this controversial since the birdcage one.

limeos · May 14, 2024

thelee said:
boy there really is a stock image for almost anything, eh?

Kinda makes you realize that you don't need generative models to create engaging stock photos?

drachasor · May 14, 2024

Publius Enigma said:
OpenAI submitting anonymously to a leaderboard doesn’t feel very open.

None of these AI projects feel like real-world ethics is a real concern. Instead they try to act like the biggest concern is Skynet, which LLMs aren't ever going to be. It draws attention away from things like their inability to eliminate bigotry from their products and also makes people think they're more capable than they are. OpenAI itself sold out on its principles long ago and in the last year purged anyone who still cares from their Board.

alors · May 14, 2024

limeos said:
Kinda makes you realize that you don't need generative models to create engaging stock photos?

No, but AI can come up with them for cheaper than a Getty Images subscription!*

* because they trained the model on Getty Images without paying**

** okay, I don't condone it, but I can kind of understand this one

Before launching, GPT-4o broke records on chatbot leaderboard under a secret name

Ars Scholae Palatinae

Smack-Fu Master, in training

Smack-Fu Master, in training

Ars Scholae Palatinae

Ars Tribunus Angusticlavius

Seniorius Lurkius

Ars Praefectus

Ars Tribunus Militum

Ars Praetorian

Smack-Fu Master, in training

Ars Centurion

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Seniorius Lurkius

Ars Centurion

Ars Tribunus Militum

Wise, Aged Ars Veteran

Ars Praetorian

Ars Praetorian

Ars Praefectus

Ars Praefectus

Ars Centurion

Wise, Aged Ars Veteran

Wise, Aged Ars Veteran

Ars Tribunus Angusticlavius

Ars Tribunus Angusticlavius

Ars Tribunus Militum

Ars Tribunus Militum

Ars Praefectus

Ars Praefectus

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Ars Tribunus Militum

Ars Scholae Palatinae

Ars Praetorian

Ars Tribunus Angusticlavius

Ars Centurion

Ars Praefectus

Ars Centurion