Before launching, GPT-4o broke records on chatbot leaderboard under a secret name

SraCet · May 14, 2024

Dan Homerick said:
I don't play any competitive sports that use ELO rankings, but if I'm understanding this table correctly, a 50 point gap is more like "the new model was judged as better 7.15% of the time".
...

57.15%

bugabuga · May 14, 2024

I don't understand one thing -- if "AI experts" are "frustrated" about non-transparency and non-scientific aspects of LMSYS process then stop complaining about your hurt feelings and make your own objective test. Make it now. Nobody stops you. We can have many many tests. It's probably going to cost money and would require an effort and careful planning, but making a better version of things always does. Versus just criticizing something others have built, tempting though that might be.

And once you release new Awesome Fully Scientific Chatbot Scoring System, tell everyone about it, explain how much better it is and you will be in charge of "the vibe check".

If you can't make one because of a never-ending fight over what would be "an objective" test then it's your problem. Consider "good enough" aspects versus elusive perfect ones. Public will gladly use test that is "good enough" instead of waiting 10+ years before researchers finish duking it out about methodology.

eAbyss · May 14, 2024

alors said:
No, but AI can come up with them for cheaper than a Getty Images subscription!*

* because they trained the model on Getty Images without paying**

** okay, I don't condone it, but I can kind of understand this one

How do you think humans learn? They train on existing content. Should an artist have to pay Getty Images every time they see one of its pictures somewhere online?

Carewolf · May 14, 2024

thelee said:
boy there really is a stock image for almost anything, eh?

Well, there is now..

ChronocidalManiac · May 14, 2024

Migz-DH said:
Wait... there's such a thing as a chatbot ... leaderboard? wtf?

Wait till you see the Human leaderboard that they keep.

MHStrawn · May 14, 2024

stackman said:
The thing is, that's already happened. And not in the last year, or the last few years -it's been the case pretty much since, well, since people.

Have to disagree.

From the time I was born in the mid-60's until - oh, the explosion of right-wing media, vast majority of Americans agreed on general facts. This might have been a unique time period but it was the norm for many. IMO this was due to the limited means of media distribution. There were only 3 TV channels and those 3 channels expressed similar attitudes about rule of law, the value of democracy, etc.

unequivocal · May 14, 2024

terrydactyl said:
Chess has objective outcomes. What is the objective test for chatbots?

Fair point but I don't think using elo for gymnastics would be weird and that has subjective scoring like this system. Granted the judges are trained but since llms are supposed to work for untrained users it doesn't seem weird to me to evaluate them based on the response / experience of untrained users, at least as one benchmark. (obviously there should be safety evaluation by the manufacturers)

Xepherys · May 14, 2024

ChronocidalManiac said:
Wait till you see the Human leaderboard that they keep.

Ah, you must be talking about China?

Psychophant · May 14, 2024

And what is the ELO of an average human?

Peldor · May 14, 2024

Migz-DH said:
Wait... there's such a thing as a chatbot ... leaderboard? wtf?

It's going to be much weirder in a couple of years when the chatbots form their own ELO rankings of people.

Psychophant · May 14, 2024

MHStrawn said:
Have to disagree.

From the time I was born in the mid-60's until - oh, the explosion of right-wing media, vast majority of Americans agreed on general facts. This might have been a unique time period but it was the norm for many. IMO this was due to the limited means of media distribution. There were only 3 TV channels and those 3 channels expressed similar attitudes about rule of law, the value of democracy, etc.

You seem to be talking about the explosion of cable media, which preceded the right wing media rise, and which initially was mainly radio. The universal cable channel support for the Iraq war shows the Washington consensus still had a vice grip on public dialog at that time, dissenters like Donahue were kicked off the air. The thing that really changed this was the internet and social media.

JoHBE · May 14, 2024

MHStrawn said:
Have to disagree.

From the time I was born in the mid-60's until - oh, the explosion of right-wing media, vast majority of Americans agreed on general facts. This might have been a unique time period but it was the norm for many. IMO this was due to the limited means of media distribution. There were only 3 TV channels and those 3 channels expressed similar attitudes about rule of law, the value of democracy, etc.

The technology is now there to absolutely maximize fragmentation. Because of the stochastic nature, each query to an LLM produces a unique answer, even to identical input queries. Even if we simply disregard intentional biassing, we're all about to get served at least slightly different information from now on. (If ChatGPT replaces or interfaces to classic search). And think about what happens if everyone can start generating personal entertainment media...

Geebs · May 14, 2024

The Sheep Look Up said:
What's wrong with Homework Gimp^tm? Some random mommy blog rated it the best learning aid of the year.

“Bring out the gimp”.
“Gimp’s studying”.

deadmanwalking · May 14, 2024

I'm not sure of what the intent of the author was when s/he chose that image for this article, but after seeing it, I'm significantly more uneasy with the thought that my grandchildren may grow up in a world where the only social interaction they have is with AI and humans are almost cut out entirely.

Dan Homerick · May 14, 2024

Psychophant said:
And what is the ELO of an average human?

Especially with a deadline of no more than a minute or two in which to research and compose their answer.

It's not a Turing test, it's more of a usefulness test. "Human" score has to be pretty low for most knowledge-based probing*. Could probably get a few wins if the judge was asking questions that need a lot of "reasoning" in their response.

[*] LLMs get stuff wrong all the time, but the judge and the human would need some expertise in a shared subject, the questions would need to hit that subject, and it would have to be a question that isn't commonly asked and answered on the Internet.

bugsbony · May 14, 2024

MHStrawn said:
Have to disagree.

From the time I was born in the mid-60's until - oh, the explosion of right-wing media, vast majority of Americans agreed on general facts. This might have been a unique time period but it was the norm for many. IMO this was due to the limited means of media distribution. There were only 3 TV channels and those 3 channels expressed similar attitudes about rule of law, the value of democracy, etc.

I do think that the limited number of tv channels had a huge impact on society. But I'm not sure it was all that rosy either, I wasn't born then, but I heard about this whole McCarthy thing, and, as a Bob Dylan fan, I often think about the lyrics of "Talkin’ John Birch Paranoid Blues".

monogoto · May 14, 2024

Dan Homerick said:
That is, the new bot could be drastically better (or worse!) at solving differential equations, but given that most people won't ask about something that hard...

Chat bots are big step down in the realm of automated differential equation solvers.

jhesse · May 14, 2024

poochyena said:
Gotta say, that might be the weirdest stock image I have seen all year.

Perfect metaphor for the de-humanizing aspect of this technology.

Also a good metaphor for how poorly thought out all of this is. I mean... for the photoshoot they didn't even bother to zip up the suit.

poochyena · May 14, 2024

jhesse said:
I mean... for the photoshoot they didn't even bother to zip up the suit.

I noticed that

jhesse · May 14, 2024

poochyena said:
I noticed that

Now that I think about it, they probably had trouble breathing. That kind of fabric is somewhat air-tight unless it is stretched enough, and it looks plenty loose.

Good metaphor for AI sucking up all the Social Media oxygen?

Dan Homerick · May 14, 2024

monogoto said:
Chat bots are big step down in the realm of automated differential equation solvers.

Sure, but the bot can use one of those on the backend. Or even the main chat bot could delegate the task to a different bot that is specifically trained on using sci/math packages.

With delegation, one public facing bot/API really can become an expert in all things.

"I know kung fu"

xen2xen1 · May 14, 2024

ChronocidalManiac said:
Wait till you see the Human leaderboard that they keep.

Eurovision?

name99 · May 14, 2024

bugabuga said:
I don't understand one thing -- if "AI experts" are "frustrated" about non-transparency and non-scientific aspects of LMSYS process then stop complaining about your hurt feelings and make your own objective test. Make it now. Nobody stops you. We can have many many tests. It's probably going to cost money and would require an effort and careful planning, but making a better version of things always does. Versus just criticizing something others have built, tempting though that might be.

And once you release new Awesome Fully Scientific Chatbot Scoring System, tell everyone about it, explain how much better it is and you will be in charge of "the vibe check".

If you can't make one because of a never-ending fight over what would be "an objective" test then it's your problem. Consider "good enough" aspects versus elusive perfect ones. Public will gladly use test that is "good enough" instead of waiting 10+ years before researchers finish duking it out about methodology.

More than that, they're being stupid. People can read something like Seeing Like a State all they like in school, then take away absolutely nothing from it...

The problem is not "bad benchmarks run by naughty people", it's that what's happening is fundamentally illegible. You cannot compress into one single benchmark score the Cambrian Explosion of variety that's happening right now.
Do you care more about accurate retrieval of factoids, or the ability to analyze video?
Do you care more about math reasoning ability or the quality of AI-faked emotion in the conversation?
Do you care more about speed of response, or the size of the context window the model supports?
etc etc

Would you take seriously someone who complained that the biggest problem in transportation right now is that there are no benchmarks that compare hopping on one leg to bicycling to sailing a catamaran to flying on a 787, and it's all a conspiracy by big Transport to keep these numbers from us?

name99 · May 14, 2024

MHStrawn said:
Have to disagree.

From the time I was born in the mid-60's until - oh, the explosion of right-wing media, vast majority of Americans agreed on general facts. This might have been a unique time period but it was the norm for many. IMO this was due to the limited means of media distribution. There were only 3 TV channels and those 3 channels expressed similar attitudes about rule of law, the value of democracy, etc.

Uh, yes, all true.
And people like Noam Chomsky complained about this endlessly, and said those facts were wrong, and that it was an outrage that everyone was blinkered into agreeing with them. That's basically the entire content of his pre-internet oeuvre i(most obviously Manufacturing Consent).

So the left got exactly what it wanted, and now it's even more unhappy...
There are multiple lessons there, starting with "do you want to give power to a bunch of people incapable to extrapolating the most obvious consequences of any change to society"?

Snark218 · May 14, 2024

An impressive achievement nobody was really asking for. Nobody actually wants to talk to a fucking chatbot even if it can chuckle at probabilistically determined times. It's a novelty and a toy, but there's only so many times you can crank the music box before the clown popping out the top loses its punch. So what dos it do? AI is a tool. It's part of a workflow. It's not the workflow itself, it's not the product, and it's not the goal - except in the minds of weirdos like Sam Altman and Marc Andreesen. So once everybody gets bored with it, what does -4o do?

internetomancer · May 14, 2024

Before I plumb the depths of reddit, or wait for the inevitable science articles... can anyone tell me how it's winning. What kinds of questions you've asked and how it has impressed you?

This seems like quite a big fat Elo jump.

internetomancer · May 14, 2024

terrydactyl said:
Chess has objective outcomes. What is the objective test for chatbots?

Chess is the objective test for chatbots.

https://github.com/kagisearch/llm-chess-puzzles
(Just kidding. The real problem with benchmarks is memorization.)

internetomancer · May 14, 2024

The Sheep Look Up said:
The steam loom wasn't better, it was lower cost per unit. Sure, it cranked out an inferior product, but the bit of that cost drop that got passed on to the consumer meant they were willing to tolerate it. Where do you think it's gonna go when the labor cost can be dropped to near-zero for what are perceived as pure cost positions, no matter how bad quality degrades? How low are you willing to see customer satisfaction drop if it saves you 98% of the cost of an entire division?

I'd kind of expect customer-service-like positions to improve in quality. At least after they figure out how to use the technology.

Other stuff will be more of a compromise. The very weird cases will be the many white-collar professions that require paying someone $100/hr.

The Sheep Look Up · May 14, 2024

internetomancer said:
I'd kind of expect customer-service-like positions to improve in quality.

Why? At least with a human, you can occasionally push them out of the scripted loop. A chatbot simply can't. "Sorry, even though we clearly screwed you and something sapient can easily determine this edge case is beyond the pale, this is what the policy says. If you'd like to further dispute this outcome, please e-mail sitandspin@utilitycompany.com between the hours of 3pm and 5pm on alternating Thursdays. Is there anything else I can help you with? Is there anything else I can help you with? Is there anything else I can help you with?" Think infuriating IVR, but with less shouting single keywords and more shouting conversationally.

internetomancer · May 14, 2024

The Sheep Look Up said:
Why? At least with a human, you can occasionally push them out of the scripted loop. A chatbot simply can't. "Sorry, even though we clearly screwed you and something sapient can easily determine this edge case is beyond the pale, this is what the policy says. If you'd like to further dispute this outcome, please e-mail sitandspin@utilitycompany.com between the hours of 3pm and 5pm on alternating Thursdays. Is there anything else I can help you with? Is there anything else I can help you with? Is there anything else I can help you with?" Think infuriating IVR, but with less shouting single keywords and more shouting conversationally.

I suspect, with a little effort, you could get an LLM to identify what counts as "beyond the pale" better than a typical trained monkey. The question is more how it would be rolled out.

internetomancer · May 14, 2024

This was interesting by the way. Chatgpt 4o is pretty good at playing Geoguessr. It's basically a mix of image recognition, simple reasoning, geographic knowledge, and copying how humans play. None of it is a huge scary deal, but impressive for something that wasn't explicitly built to, for example, look at a picture and know which side of the road the cars are driving on, and do something with that information.

View: https://www.youtube.com/watch?v=dZr1tsFHxag

The Sheep Look Up · May 14, 2024

internetomancer said:
I suspect, with a little effort, you could get an LLM to identify what counts as "beyond the pale" better than a typical trained monkey.

I'd be curious to know which LLMs responded to the phrase "beyond the pale" with "that's racist".

Hezio · May 15, 2024

The Sheep Look Up said:
It's not so much that these things are not and will never be useful. That's absurd. Tied in to what you're saying, what we're going to see is bean counters and C-suite types that have huffed the hype cycle and are going to deploy unfit technology that's "good enough", and that will become the new baseline. How much of customer service has migrated from IVR to chatbots?

The steam loom wasn't better, it was lower cost per unit. Sure, it cranked out an inferior product, but the bit of that cost drop that got passed on to the consumer meant they were willing to tolerate it. Where do you think it's gonna go when the labor cost can be dropped to near-zero for what are perceived as pure cost positions, no matter how bad quality degrades? How low are you willing to see customer satisfaction drop if it saves you 98% of the cost of an entire division?

I'm just waiting for the day a midsize company uses an AI as its CEO just barely successfully enough so that the owners of larger companies start eyeing their CEOs and questioning if the hundreds of millions of dollars for a single human is worth it.

Plus at least a computer won't sexually harass employees, adding more cost savings

Glorified Desktop Support · May 15, 2024

Hezio said:
Plus at least a computer won't sexually harass employees, adding more cost savings

Oh, sweet summer child.

name99 · May 15, 2024

Snark218 said:
An impressive achievement nobody was really asking for. Nobody actually wants to talk to a fucking chatbot even if it can chuckle at probabilistically determined times. It's a novelty and a toy, but there's only so many times you can crank the music box before the clown popping out the top loses its punch. So what dos it do? AI is a tool. It's part of a workflow. It's not the workflow itself, it's not the product, and it's not the goal - except in the minds of weirdos like Sam Altman and Marc Andreesen. So once everybody gets bored with it, what does -4o do?

How old are you? Are you willing to learn from experience?
In MY time people have said

no-one would be willing to talk into a bluetooth headset
no-one would be willing to use videochat
no-one would be willing to talk to their phone (eg Siri, then things like Alexa)

And yes, for the first few months to years, all of those were kinda true. It takes time for people to get used to a new way of doing things, and to work out when the new way feels better or feels appropriate. I'm no different, each of those felt slightly uncomfortable to me at first. But we adapt, and we learn.

To insist that talking to a device ala ChatGPT will never work because of the reasons you give shows a remarkably clueless attitude to the history of technology. I'm prepared to entertain serious arguments about the value (or not) of ChatGPT, but I'm not going to listen further to an argument that's essentially
"this is a new UI, and new UI's will never succeed, QED".

grahammayer · May 16, 2024

I think it's great that OpenAI's GPT-4o model has achieved such high scores on the chatbot leaderboard. It's a testament to the hard work and dedication of the team that developed it. I'm also glad that OpenAI has decided to release the model to the public, as it will allow researchers and developers to build even more powerful and sophisticated language models.

However, I do have some concerns about the lack of transparency surrounding the development of GPT-4o. OpenAI kept the name of the model a secret while it was being tested, which frustrated some experts. I believe that it's important for companies to be more transparent about their work, especially when it comes to artificial intelligence.

Overall, I'm excited to see what the future holds for GPT-4o and other large language models. I believe that these models have the potential to revolutionize the way we interact with computers.

Snark218 · May 16, 2024

name99 said:
How old are you? Are you willing to learn from experience?
In MY time people have said

no-one would be willing to talk into a bluetooth headset

no-one would be willing to use videochat

no-one would be willing to talk to their phone (eg Siri, then things like Alexa)

False equivalencies. Those are all technologies that fill a need, A bluetooth headset offers convenience. Videochat has always had plenty of appeal, for obvious reasons. Siri lets you control your phone (badly) if your hands aren't free. With the exception of Siri, which in my experience people mostly tolerate to send texts while driving, all those examples serve a concrete need and use case. They do something. What does a chuckling chatbot.....do?

My point is not that "no-one will be willing to talk to a chatbot." My point is that a glorified chat bot is not actually filling a need. It could be part of something that fills a need, potentially. But it's not actually useful in and of itself. It's certainly not a reliable text generator. Talking to a chat bot is kind of an interesting novelty, for a few minutes, but it's hardly an end unto itself.

To insist that talking to a device ala ChatGPT will never work because of the reasons you give shows a remarkably clueless attitude to the history of technology. I'm prepared to entertain serious arguments about the value (or not) of ChatGPT, but I'm not going to listen further to an argument that's essentially
"this is a new UI, and new UI's will never succeed, QED".

Good thing I'm not making that argument, then. I don't think ChatGPT is bullshit because it's a new UI. I think it's bullshit because it's not actually a UI at all. You're not interfacing with anything but a mindless probabilistic generator of bullshit that sounds vaguely like it's written or spoken by a human, until it doesn't. Businesses keep getting bitten in the ass trying to use it for customer service because it keeps giving customers wrong information. Lawyers get reamed out by judges because the model makes up nonexistent precedent. Students get accused of plagiarism. Every model hallucinates, at some point, and every one of them is functionally hamstrung by it.

Now, under the uselessly broad definition currently being used for AI, there's plenty of "AI" that could be very useful - data harvesting, large dataset analysis for pharmaceuticals and biochemistry, on-device small language models that could make Siri suck less abominably, et cetera and so on. But this particular manifestation is bullshit. Not because it's unfamiliar or novel, but because it's a solution looking for a problem.

The Sheep Look Up · May 17, 2024

Snark218 said:
My point is that a glorified chat bot is not actually filling a need.

For whom? "I need all the money, and those pesky minimum wage laws keep these workers cutting into my profit. If only there were a way...."

Undertortoise · May 17, 2024

eAbyss said:
How do you think humans learn? They train on existing content. Should an artist have to pay Getty Images every time they see one of its pictures somewhere online?

If humans were a product owned and controlled by a corporation for purposes of profit, then this analogy would be relevant. However, we are not.

internetomancer · May 17, 2024

Snark218 said:
With the exception of Siri, which in my experience people mostly tolerate to send texts while driving, all those examples serve a concrete need and use case. They do something. What does a chuckling chatbot.....do?

In this case, it's obviously being sold as a new Alexa, which is already a chatbot that chuckles.

But I don't know. I think Generative AI is the most useful technology in the last 20 years. They are going to fill countless spaces, and you'd have to be the most pessimistic person in the world to not see that. Or maybe I'm most optimistic person in the world. Either way it feels like a weird conversation.

If you need a starting place, I would think about the things you already use technology for. Transcription, translation, editing, proofreading, debugging, internet searches, file search, image search, phone control, computer control, games, etc. Those are all things even you should feel comfortable with, that will be improved immediately by generative AI.

But yeah, I'm also more interested in what it will do to the economy as a whole. And what it will do to the global south. And the future of humanity. And the whole wild west of possibilities that exist right now.

And yes, even the gimmicks are amazing. One thing my little kid likes to do is to ask Alexa for a story. Alexa lets her choose a little cartoon character and a setting, and have it generate a little story. The characters are layered sprite models, and the story is one of a small number of permutations. It is quite stupid. And I can't help but think, my God these things can generate an excellent, living, choose-your-own-adventure, complete with high quality art, voice acting, music, sound effects, and a character, plot, and style of your choosing. Maybe there is something deeply wrong about all that, but holy shit, no I don't think people are going to stop using it.

Before launching, GPT-4o broke records on chatbot leaderboard under a secret name

Ars Legatus Legionis

Ars Centurion

Ars Centurion

Ars Legatus Legionis

Wise, Aged Ars Veteran

Ars Scholae Palatinae

Ars Praefectus

Ars Scholae Palatinae

Account Banned

Ars Legatus Legionis

Account Banned

Ars Praefectus

Ars Praefectus

Ars Centurion

Ars Praefectus

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Praefectus

Ars Centurion

Ars Tribunus Angusticlavius

Ars Tribunus Angusticlavius

Ars Legatus Legionis

Ars Tribunus Militum

Ars Tribunus Militum

Ars Tribunus Militum

Ars Praetorian

Ars Tribunus Militum

Ars Tribunus Militum

Ars Praetorian

Ars Praefectus

Ars Scholae Palatinae

Ars Tribunus Angusticlavius

Smack-Fu Master, in training

Ars Legatus Legionis

Ars Praetorian

Seniorius Lurkius

Ars Tribunus Militum