Systems from Google, OpenAI, Anthropic, and xAI struggle with the Premier League.
See full article...
See full article...
You hardly need to. I went to a school that played Rugby (hated it) and we had to learn the boring origins of the game.Let me introduce you to Rugby football. Or maybe Aussie rules football.
As for rounders, you go around the bases. Same sort of etymology as "soccer", and "rugger": using "-er" as a diminutive.
That’s quite a goalpost shift though, saying if it isn’t AGI then calling it “glorified autocomplete” is accurate.
I don’t think it was reasonable to pit LLMs (or any machine learning for that matter) against something as unpredictable as sports betting when historical data only helps marginally. I get why people say “duh, LLMs can’t do this”. However, the completely dismissive and hyperbolic comments here do get grating. LLMs are way more useful than a “glorified autocomplete”.
That’s quite a goalpost shift though, saying if it isn’t AGI then calling it “glorified autocomplete” is accurate.
I don’t think it was reasonable to pit LLMs (or any machine learning for that matter) against something as unpredictable as sports betting when historical data only helps marginally. I get why people say “duh, LLMs can’t do this”. However, the completely dismissive and hyperbolic comments here do get grating. LLMs are way more useful than a “glorified autocomplete”.
I was just thinking something similar, but for Fantasy Football. If Claude was only off by 11%, I mean, that means it was correct 89% of the time. That is pretty impressive, honestly.Historical data is useless in this scenario as the make up of teams changes so frequently. It's the equivalent of trying to predict dice rolls based on what happened last time.
The bookies must love this !
AI models are terrible at betting on soccer
I feel like Aussie’s call every sport (except cricket) “footie”Let me introduce you to Rugby football. Or maybe Aussie rules football.
As for rounders, you go around the bases. Same sort of etymology as "soccer", and "rugger": using "-er" as a diminutive.
We do, it just depends where you live.I feel like Aussie’s call every sport (except cricket) “footie”![]()
Snort!I hope you can come up with a more intelligent rebuttal than your one about the UK legal system.
Because that’s literally how these things work. If you want to dispute it, the onus is on you to prove how you believe they work.Of course this is the top voted comment. Smh
They won’t, because literally their purpose is to predict text. That is what they are made to do. Trying to pretend everyone is a “Luddite” because they actually know how the technology works won’t change that.So you're argument is that LLMs won't improve at chess or poker?
Right...They won’t, because literally their purpose is to predict text. That is what they are made to do. Trying to pretend everyone is a “Luddite” because they actually know how the technology works won’t change that.
Right...
So how do you explain their improvement from gpt3 etc to now where the frontier sota elo shows they're better than 99% of chess players?
https://chessbenchllm.onrender.com/
Couple of reasons:So how do you explain their improvement from gpt3 etc to now where the frontier sota elo shows they're better than 99% of chess players?
Indeed. It seems a safe bet to me that "MechaHitler" would be wrong about everything.No mention how humans performed in similar tests, isn’t gambling by definition a losing prospect for most? I really don’t see how this test shows much about AI vs human, don’t use grok seems to be the more definitive conclusion.
First, l am refuting his argument here. I said I can foresee LLMs getting better at chess and he said LLMs are 'next token prediction models' so they can never get better at chess.Apparently we've had intelligent AIs since the mid 90s, if "better than most chess players" is a benchmark.
Considering chess is mostly about large number crunching on possibilities and predictive moves, I don't know why it's surprising that an LLM is good at chess. Particularly with all the work that has gone into improving math/coding capabilities.
1. In 2023, my friend and I tried to teach an llm chess. It is nontrivial. LLMs are next token predictors so they do not play chess the same way as engines (you can't hard code in rules etc as they're a stochastic engine).Couple of reasons:
1. They are training the models to learn chess?
2. they are calling a chess AI system (in existence since 1978) to play?
Err what a load of nonsense. The first Polo club for Europeans was formed by Tea planters in Assam in 1859.So polo is called horseball? Got it.
Mildly amusing that you seem to have cited Google "AI"s summary of the Wikipedia article, which itself is much less definitive in its attribution.
The problem with that argument is that mediaeval Europe didn't go in for ball games on horseback. That reached is from Persia via India, which is why it is called Polo - a Tibetan language word just meaning "ball". So there was no contrast in Europe between ball games on horses or on foot.
The word "Polo" entered English after the game of Association Football was defined.
I hope you can come up with a more intelligent rebuttal than your one about the UK legal system.
Yes but chess gets out of book/ 'training data' very very quickly (given the mathematics of it).Modern models are orders of magnitude larger and trained on vastly more diverse data.
More parameters → better pattern recognition
More data → exposure to countless chess positions, games, and analyses
Even without being a dedicated chess engine, this lets the model “recognize” strong moves the way a very well-read player might.
If you read the small print, that site specifies that the rankings are produced primarily by LLMs playing each other and that the person doing the benchmarking states:Right...
So how do you explain their improvement from gpt3 etc to now where the frontier sota elo shows they're better than 99% of chess players?
https://chessbenchllm.onrender.com/
I predict that LLMs would perform below these ratings against humans, who are better able to find and exploit systematic weaknesses in play.
By all means we can use other benchmarks - all of them demonstrate unmistakable progress.So to answer your question:
1) the benchmark is bad
What does context window have to do with finding the best move?2) the new models have a much bigger context window, which is probably sufficient
These models are not brute forcing chess:3) who really cares anyway, it’s been possible to brute force chess for longer than any of us has been alive. Chess is a poor benchmark for general intelligence in humans, let alone machines.
They won’t, because literally their purpose is to predict text. That is what they are made to do. Trying to pretend everyone is a “Luddite” because they actually know how the technology works won’t change that.
The source you’re citing is a press release, which cites another press release, which cites a website which shows an animated gif of two LLMs setting fire to the planet to play the most inept game of Connect 4 I have ever seen. Please read your own sources before citing them, because this is just a waste of everyone’s time.By all means we can use other benchmarks - all of them demonstrate unmistakable progress.
Other benchmarks have it at an ELO of 1400ish which is still better than 95% of chess players and more importantly it is way ahead of where it was 3 years ago.
What does context window have to do with finding the best move?
These models are not brute forcing chess:
"While traditional chess engines like Stockfish function as specialized super-calculators, evaluating millions of positions per second to find the optimal move, large language models do not approach the game through brute-force calculation. Instead, they rely on pattern recognition and ‘intuition’ to drastically reduce the search space — an approach that mirrors human play.
Gemini 3 Pro and Gemini 3 Flash currently have the top Elo ratings on the leaderboard. The models’ internal ‘thoughts’ reveal the use of strategic reasoning grounded in familiar chess concepts like piece mobility, pawn structure, and king safety. This significant performance increase over the Gemini 2.5 generation highlights the rapid pace of model progress and demonstrates Game Arena’s value in tracking these improvements over time."
I read the sources and found the insights useful.The source you’re citing is a press release, which cites another press release, which cites a website which shows an animated gif of two LLMs setting fire to the planet to play the most inept game of Connect 4 I have ever seen. Please read your own sources before citing them, because this is just a waste of everyone’s time.
The AIs were instructed to build models that would maximize returns and manage risk.
Err.Right...
So how do you explain their improvement from gpt3 etc to now where the frontier sota elo shows they're better than 99% of chess players?
https://chessbenchllm.onrender.com/
Err.
IIRC there was a chess tournament between all the LLMs a few months ago and the games they all played were comically bad.
Like, they would frequently make illegal moves bad.
I'd like to see some substantiation for the idea that any one of them is better than 99% of chess players.
The ratings shown on the web site you linked to are presumably ratings of the LLMs relative to each other, not relative to humans. I have no idea how they came up with the FIDE rating estimates.
You dramatically overestimate how good "even the most basic chess engine" is.... If they did a tool call to even the most basic chess engine, it would be at 3400+ elo not ~2000 elo. ...
What does context window have to do with finding the best move?