Before launching, GPT-4o broke records on chatbot leaderboard under a secret name

I don't understand one thing -- if "AI experts" are "frustrated" about non-transparency and non-scientific aspects of LMSYS process then stop complaining about your hurt feelings and make your own objective test. Make it now. Nobody stops you. We can have many many tests. It's probably going to cost money and would require an effort and careful planning, but making a better version of things always does. Versus just criticizing something others have built, tempting though that might be.

And once you release new Awesome Fully Scientific Chatbot Scoring System, tell everyone about it, explain how much better it is and you will be in charge of "the vibe check".

If you can't make one because of a never-ending fight over what would be "an objective" test then it's your problem. Consider "good enough" aspects versus elusive perfect ones. Public will gladly use test that is "good enough" instead of waiting 10+ years before researchers finish duking it out about methodology.
 
Upvote
5 (8 / -3)
No, but AI can come up with them for cheaper than a Getty Images subscription!*

* because they trained the model on Getty Images without paying**

** okay, I don't condone it, but I can kind of understand this one
How do you think humans learn? They train on existing content. Should an artist have to pay Getty Images every time they see one of its pictures somewhere online?
 
Upvote
-1 (7 / -8)

MHStrawn

Ars Scholae Palatinae
1,432
Subscriptor
The thing is, that's already happened. And not in the last year, or the last few years -it's been the case pretty much since, well, since people.
Have to disagree.

From the time I was born in the mid-60's until - oh, the explosion of right-wing media, vast majority of Americans agreed on general facts. This might have been a unique time period but it was the norm for many. IMO this was due to the limited means of media distribution. There were only 3 TV channels and those 3 channels expressed similar attitudes about rule of law, the value of democracy, etc.
 
Upvote
10 (13 / -3)
Chess has objective outcomes. What is the objective test for chatbots?
Fair point but I don't think using elo for gymnastics would be weird and that has subjective scoring like this system. Granted the judges are trained but since llms are supposed to work for untrained users it doesn't seem weird to me to evaluate them based on the response / experience of untrained users, at least as one benchmark. (obviously there should be safety evaluation by the manufacturers)
 
Upvote
3 (3 / 0)
Have to disagree.

From the time I was born in the mid-60's until - oh, the explosion of right-wing media, vast majority of Americans agreed on general facts. This might have been a unique time period but it was the norm for many. IMO this was due to the limited means of media distribution. There were only 3 TV channels and those 3 channels expressed similar attitudes about rule of law, the value of democracy, etc.
You seem to be talking about the explosion of cable media, which preceded the right wing media rise, and which initially was mainly radio. The universal cable channel support for the Iraq war shows the Washington consensus still had a vice grip on public dialog at that time, dissenters like Donahue were kicked off the air. The thing that really changed this was the internet and social media.
 
Upvote
3 (3 / 0)

JoHBE

Ars Praefectus
4,231
Subscriptor++
Have to disagree.

From the time I was born in the mid-60's until - oh, the explosion of right-wing media, vast majority of Americans agreed on general facts. This might have been a unique time period but it was the norm for many. IMO this was due to the limited means of media distribution. There were only 3 TV channels and those 3 channels expressed similar attitudes about rule of law, the value of democracy, etc.
The technology is now there to absolutely maximize fragmentation. Because of the stochastic nature, each query to an LLM produces a unique answer, even to identical input queries. Even if we simply disregard intentional biassing, we're all about to get served at least slightly different information from now on. (If ChatGPT replaces or interfaces to classic search). And think about what happens if everyone can start generating personal entertainment media...
 
Upvote
0 (0 / 0)

Dan Homerick

Ars Praefectus
5,483
Subscriptor++
And what is the ELO of an average human?
Especially with a deadline of no more than a minute or two in which to research and compose their answer.

It's not a Turing test, it's more of a usefulness test. "Human" score has to be pretty low for most knowledge-based probing*. Could probably get a few wins if the judge was asking questions that need a lot of "reasoning" in their response.

[*] LLMs get stuff wrong all the time, but the judge and the human would need some expertise in a shared subject, the questions would need to hit that subject, and it would have to be a question that isn't commonly asked and answered on the Internet.
 
Upvote
0 (0 / 0)

bugsbony

Ars Scholae Palatinae
1,036
Have to disagree.

From the time I was born in the mid-60's until - oh, the explosion of right-wing media, vast majority of Americans agreed on general facts. This might have been a unique time period but it was the norm for many. IMO this was due to the limited means of media distribution. There were only 3 TV channels and those 3 channels expressed similar attitudes about rule of law, the value of democracy, etc.
I do think that the limited number of tv channels had a huge impact on society. But I'm not sure it was all that rosy either, I wasn't born then, but I heard about this whole McCarthy thing, and, as a Bob Dylan fan, I often think about the lyrics of "Talkin’ John Birch Paranoid Blues".
 
Upvote
0 (0 / 0)

jhesse

Ars Scholae Palatinae
746
Subscriptor
Gotta say, that might be the weirdest stock image I have seen all year.
Perfect metaphor for the de-humanizing aspect of this technology.

Also a good metaphor for how poorly thought out all of this is. I mean... for the photoshoot they didn't even bother to zip up the suit.
 
Upvote
2 (2 / 0)

Dan Homerick

Ars Praefectus
5,483
Subscriptor++
Chat bots are big step down in the realm of automated differential equation solvers.
Sure, but the bot can use one of those on the backend. Or even the main chat bot could delegate the task to a different bot that is specifically trained on using sci/math packages.

With delegation, one public facing bot/API really can become an expert in all things.

"I know kung fu"
 
Upvote
1 (1 / 0)

name99

Ars Tribunus Angusticlavius
6,241
I don't understand one thing -- if "AI experts" are "frustrated" about non-transparency and non-scientific aspects of LMSYS process then stop complaining about your hurt feelings and make your own objective test. Make it now. Nobody stops you. We can have many many tests. It's probably going to cost money and would require an effort and careful planning, but making a better version of things always does. Versus just criticizing something others have built, tempting though that might be.

And once you release new Awesome Fully Scientific Chatbot Scoring System, tell everyone about it, explain how much better it is and you will be in charge of "the vibe check".

If you can't make one because of a never-ending fight over what would be "an objective" test then it's your problem. Consider "good enough" aspects versus elusive perfect ones. Public will gladly use test that is "good enough" instead of waiting 10+ years before researchers finish duking it out about methodology.
More than that, they're being stupid. People can read something like Seeing Like a State all they like in school, then take away absolutely nothing from it...

The problem is not "bad benchmarks run by naughty people", it's that what's happening is fundamentally illegible. You cannot compress into one single benchmark score the Cambrian Explosion of variety that's happening right now.
Do you care more about accurate retrieval of factoids, or the ability to analyze video?
Do you care more about math reasoning ability or the quality of AI-faked emotion in the conversation?
Do you care more about speed of response, or the size of the context window the model supports?
etc etc

Would you take seriously someone who complained that the biggest problem in transportation right now is that there are no benchmarks that compare hopping on one leg to bicycling to sailing a catamaran to flying on a 787, and it's all a conspiracy by big Transport to keep these numbers from us?
 
Upvote
-2 (0 / -2)

name99

Ars Tribunus Angusticlavius
6,241
Have to disagree.

From the time I was born in the mid-60's until - oh, the explosion of right-wing media, vast majority of Americans agreed on general facts. This might have been a unique time period but it was the norm for many. IMO this was due to the limited means of media distribution. There were only 3 TV channels and those 3 channels expressed similar attitudes about rule of law, the value of democracy, etc.

Uh, yes, all true.
And people like Noam Chomsky complained about this endlessly, and said those facts were wrong, and that it was an outrage that everyone was blinkered into agreeing with them. That's basically the entire content of his pre-internet oeuvre i(most obviously Manufacturing Consent).

So the left got exactly what it wanted, and now it's even more unhappy...
There are multiple lessons there, starting with "do you want to give power to a bunch of people incapable to extrapolating the most obvious consequences of any change to society"?
 
Upvote
-3 (2 / -5)

Snark218

Ars Legatus Legionis
36,743
Subscriptor
An impressive achievement nobody was really asking for. Nobody actually wants to talk to a fucking chatbot even if it can chuckle at probabilistically determined times. It's a novelty and a toy, but there's only so many times you can crank the music box before the clown popping out the top loses its punch. So what dos it do? AI is a tool. It's part of a workflow. It's not the workflow itself, it's not the product, and it's not the goal - except in the minds of weirdos like Sam Altman and Marc Andreesen. So once everybody gets bored with it, what does -4o do?
 
Upvote
1 (1 / 0)
The steam loom wasn't better, it was lower cost per unit. Sure, it cranked out an inferior product, but the bit of that cost drop that got passed on to the consumer meant they were willing to tolerate it. Where do you think it's gonna go when the labor cost can be dropped to near-zero for what are perceived as pure cost positions, no matter how bad quality degrades? How low are you willing to see customer satisfaction drop if it saves you 98% of the cost of an entire division?
I'd kind of expect customer-service-like positions to improve in quality. At least after they figure out how to use the technology.

Other stuff will be more of a compromise. The very weird cases will be the many white-collar professions that require paying someone $100/hr.
 
Upvote
0 (0 / 0)
I'd kind of expect customer-service-like positions to improve in quality.
Why? At least with a human, you can occasionally push them out of the scripted loop. A chatbot simply can't. "Sorry, even though we clearly screwed you and something sapient can easily determine this edge case is beyond the pale, this is what the policy says. If you'd like to further dispute this outcome, please e-mail sitandspin@utilitycompany.com between the hours of 3pm and 5pm on alternating Thursdays. Is there anything else I can help you with? Is there anything else I can help you with? Is there anything else I can help you with?" Think infuriating IVR, but with less shouting single keywords and more shouting conversationally.
 
Upvote
3 (3 / 0)
Why? At least with a human, you can occasionally push them out of the scripted loop. A chatbot simply can't. "Sorry, even though we clearly screwed you and something sapient can easily determine this edge case is beyond the pale, this is what the policy says. If you'd like to further dispute this outcome, please e-mail sitandspin@utilitycompany.com between the hours of 3pm and 5pm on alternating Thursdays. Is there anything else I can help you with? Is there anything else I can help you with? Is there anything else I can help you with?" Think infuriating IVR, but with less shouting single keywords and more shouting conversationally.
I suspect, with a little effort, you could get an LLM to identify what counts as "beyond the pale" better than a typical trained monkey. The question is more how it would be rolled out.
 
Upvote
1 (1 / 0)
This was interesting by the way. Chatgpt 4o is pretty good at playing Geoguessr. It's basically a mix of image recognition, simple reasoning, geographic knowledge, and copying how humans play. None of it is a huge scary deal, but impressive for something that wasn't explicitly built to, for example, look at a picture and know which side of the road the cars are driving on, and do something with that information.


View: https://www.youtube.com/watch?v=dZr1tsFHxag
 
Last edited:
Upvote
0 (0 / 0)
It's not so much that these things are not and will never be useful. That's absurd. Tied in to what you're saying, what we're going to see is bean counters and C-suite types that have huffed the hype cycle and are going to deploy unfit technology that's "good enough", and that will become the new baseline. How much of customer service has migrated from IVR to chatbots?

The steam loom wasn't better, it was lower cost per unit. Sure, it cranked out an inferior product, but the bit of that cost drop that got passed on to the consumer meant they were willing to tolerate it. Where do you think it's gonna go when the labor cost can be dropped to near-zero for what are perceived as pure cost positions, no matter how bad quality degrades? How low are you willing to see customer satisfaction drop if it saves you 98% of the cost of an entire division?
I'm just waiting for the day a midsize company uses an AI as its CEO just barely successfully enough so that the owners of larger companies start eyeing their CEOs and questioning if the hundreds of millions of dollars for a single human is worth it.

Plus at least a computer won't sexually harass employees, adding more cost savings
 
Upvote
3 (3 / 0)

name99

Ars Tribunus Angusticlavius
6,241
An impressive achievement nobody was really asking for. Nobody actually wants to talk to a fucking chatbot even if it can chuckle at probabilistically determined times. It's a novelty and a toy, but there's only so many times you can crank the music box before the clown popping out the top loses its punch. So what dos it do? AI is a tool. It's part of a workflow. It's not the workflow itself, it's not the product, and it's not the goal - except in the minds of weirdos like Sam Altman and Marc Andreesen. So once everybody gets bored with it, what does -4o do?

How old are you? Are you willing to learn from experience?
In MY time people have said
  • no-one would be willing to talk into a bluetooth headset
  • no-one would be willing to use videochat
  • no-one would be willing to talk to their phone (eg Siri, then things like Alexa)

And yes, for the first few months to years, all of those were kinda true. It takes time for people to get used to a new way of doing things, and to work out when the new way feels better or feels appropriate. I'm no different, each of those felt slightly uncomfortable to me at first. But we adapt, and we learn.

To insist that talking to a device ala ChatGPT will never work because of the reasons you give shows a remarkably clueless attitude to the history of technology. I'm prepared to entertain serious arguments about the value (or not) of ChatGPT, but I'm not going to listen further to an argument that's essentially
"this is a new UI, and new UI's will never succeed, QED".
 
Upvote
-1 (1 / -2)

grahammayer

Smack-Fu Master, in training
1
I think it's great that OpenAI's GPT-4o model has achieved such high scores on the chatbot leaderboard. It's a testament to the hard work and dedication of the team that developed it. I'm also glad that OpenAI has decided to release the model to the public, as it will allow researchers and developers to build even more powerful and sophisticated language models.

However, I do have some concerns about the lack of transparency surrounding the development of GPT-4o. OpenAI kept the name of the model a secret while it was being tested, which frustrated some experts. I believe that it's important for companies to be more transparent about their work, especially when it comes to artificial intelligence.

Overall, I'm excited to see what the future holds for GPT-4o and other large language models. I believe that these models have the potential to revolutionize the way we interact with computers.
 
Upvote
-2 (0 / -2)

Snark218

Ars Legatus Legionis
36,743
Subscriptor
How old are you? Are you willing to learn from experience?
In MY time people have said
  • no-one would be willing to talk into a bluetooth headset
  • no-one would be willing to use videochat
  • no-one would be willing to talk to their phone (eg Siri, then things like Alexa)
False equivalencies. Those are all technologies that fill a need, A bluetooth headset offers convenience. Videochat has always had plenty of appeal, for obvious reasons. Siri lets you control your phone (badly) if your hands aren't free. With the exception of Siri, which in my experience people mostly tolerate to send texts while driving, all those examples serve a concrete need and use case. They do something. What does a chuckling chatbot.....do?

My point is not that "no-one will be willing to talk to a chatbot." My point is that a glorified chat bot is not actually filling a need. It could be part of something that fills a need, potentially. But it's not actually useful in and of itself. It's certainly not a reliable text generator. Talking to a chat bot is kind of an interesting novelty, for a few minutes, but it's hardly an end unto itself.

To insist that talking to a device ala ChatGPT will never work because of the reasons you give shows a remarkably clueless attitude to the history of technology. I'm prepared to entertain serious arguments about the value (or not) of ChatGPT, but I'm not going to listen further to an argument that's essentially
"this is a new UI, and new UI's will never succeed, QED".
Good thing I'm not making that argument, then. I don't think ChatGPT is bullshit because it's a new UI. I think it's bullshit because it's not actually a UI at all. You're not interfacing with anything but a mindless probabilistic generator of bullshit that sounds vaguely like it's written or spoken by a human, until it doesn't. Businesses keep getting bitten in the ass trying to use it for customer service because it keeps giving customers wrong information. Lawyers get reamed out by judges because the model makes up nonexistent precedent. Students get accused of plagiarism. Every model hallucinates, at some point, and every one of them is functionally hamstrung by it.

Now, under the uselessly broad definition currently being used for AI, there's plenty of "AI" that could be very useful - data harvesting, large dataset analysis for pharmaceuticals and biochemistry, on-device small language models that could make Siri suck less abominably, et cetera and so on. But this particular manifestation is bullshit. Not because it's unfamiliar or novel, but because it's a solution looking for a problem.
 
Upvote
-1 (0 / -1)
How do you think humans learn? They train on existing content. Should an artist have to pay Getty Images every time they see one of its pictures somewhere online?

If humans were a product owned and controlled by a corporation for purposes of profit, then this analogy would be relevant. However, we are not.
 
Upvote
0 (0 / 0)
With the exception of Siri, which in my experience people mostly tolerate to send texts while driving, all those examples serve a concrete need and use case. They do something. What does a chuckling chatbot.....do?
In this case, it's obviously being sold as a new Alexa, which is already a chatbot that chuckles.

But I don't know. I think Generative AI is the most useful technology in the last 20 years. They are going to fill countless spaces, and you'd have to be the most pessimistic person in the world to not see that. Or maybe I'm most optimistic person in the world. Either way it feels like a weird conversation.

If you need a starting place, I would think about the things you already use technology for. Transcription, translation, editing, proofreading, debugging, internet searches, file search, image search, phone control, computer control, games, etc. Those are all things even you should feel comfortable with, that will be improved immediately by generative AI.

But yeah, I'm also more interested in what it will do to the economy as a whole. And what it will do to the global south. And the future of humanity. And the whole wild west of possibilities that exist right now.

And yes, even the gimmicks are amazing. One thing my little kid likes to do is to ask Alexa for a story. Alexa lets her choose a little cartoon character and a setting, and have it generate a little story. The characters are layered sprite models, and the story is one of a small number of permutations. It is quite stupid. And I can't help but think, my God these things can generate an excellent, living, choose-your-own-adventure, complete with high quality art, voice acting, music, sound effects, and a character, plot, and style of your choosing. Maybe there is something deeply wrong about all that, but holy shit, no I don't think people are going to stop using it.
 
Last edited:
Upvote
0 (0 / 0)