Before launching, GPT-4o broke records on chatbot leaderboard under a secret name

Status
You're currently viewing only Dan Homerick's posts. Click here to go back to viewing the entire thread.

Dan Homerick

Ars Praefectus
5,483
Subscriptor++
It seems to me we're already starting to see diminishing returns here. According to this metric, this new model is ~4% better than the previous? I've read elsewhere that it's faster and uses fewer computing resources to achieve its results, so maybe that's where the primary gains lie.
I don't play any competitive sports that use ELO rankings, but if I'm understanding this table correctly, a 50 point gap is more like "the new model was judged as better 7.15% of the time".

I do expect we're well into diminishing returns ... of the test rankings. Once the bots are good enough, a large number of people won't throw a hard enough challenge at the bots to see a difference, and will judge more-or-less randomly.

That is, the new bot could be drastically better (or worse!) at solving differential equations, but given that most people won't ask about something that hard...
 
Upvote
77 (77 / 0)

Dan Homerick

Ars Praefectus
5,483
Subscriptor++
And what is the ELO of an average human?
Especially with a deadline of no more than a minute or two in which to research and compose their answer.

It's not a Turing test, it's more of a usefulness test. "Human" score has to be pretty low for most knowledge-based probing*. Could probably get a few wins if the judge was asking questions that need a lot of "reasoning" in their response.

[*] LLMs get stuff wrong all the time, but the judge and the human would need some expertise in a shared subject, the questions would need to hit that subject, and it would have to be a question that isn't commonly asked and answered on the Internet.
 
Upvote
0 (0 / 0)

Dan Homerick

Ars Praefectus
5,483
Subscriptor++
Chat bots are big step down in the realm of automated differential equation solvers.
Sure, but the bot can use one of those on the backend. Or even the main chat bot could delegate the task to a different bot that is specifically trained on using sci/math packages.

With delegation, one public facing bot/API really can become an expert in all things.

"I know kung fu"
 
Upvote
1 (1 / 0)
Status
You're currently viewing only Dan Homerick's posts. Click here to go back to viewing the entire thread.