AI models are terrible at betting on soccer—especially xAI Grok

Erbium168

Ars Centurion
2,691
Subscriptor
Upvote
0 (0 / 0)
That’s quite a goalpost shift though, saying if it isn’t AGI then calling it “glorified autocomplete” is accurate.

I don’t think it was reasonable to pit LLMs (or any machine learning for that matter) against something as unpredictable as sports betting when historical data only helps marginally. I get why people say “duh, LLMs can’t do this”. However, the completely dismissive and hyperbolic comments here do get grating. LLMs are way more useful than a “glorified autocomplete”.

The poster was complaining about people saying that LLMs are not intelligent and cannot think. They aren't intelligent and they can't think. Something that can think and is intelligent would be AGI. It's hardly a goalpost shift.
 
Upvote
14 (14 / 0)

cleek

Ars Scholae Palatinae
1,059
That’s quite a goalpost shift though, saying if it isn’t AGI then calling it “glorified autocomplete” is accurate.

I don’t think it was reasonable to pit LLMs (or any machine learning for that matter) against something as unpredictable as sports betting when historical data only helps marginally. I get why people say “duh, LLMs can’t do this”. However, the completely dismissive and hyperbolic comments here do get grating. LLMs are way more useful than a “glorified autocomplete”.

“glorified autocomplete” is the essence of how they work.

personally, i find them to be more trouble than they're worth - and the amount of resources we're spending on them is absolutely ludicrous.
 
Upvote
14 (15 / -1)
Historical data is useless in this scenario as the make up of teams changes so frequently. It's the equivalent of trying to predict dice rolls based on what happened last time.

The bookies must love this !
I was just thinking something similar, but for Fantasy Football. If Claude was only off by 11%, I mean, that means it was correct 89% of the time. That is pretty impressive, honestly.
 
Upvote
0 (1 / -1)
This study says a lot about the incompetence of the "researchers" involved and nothing anout the LLMs.

"London-based General Reasoning tested eight top AI systems in a virtual re-creation of the 2023–24 Premier League season, providing them with detailed historical data and statistics about each team and previous games."

As far as the outcomes of the sport games are concerned, it's a typical GIGO (garbage in, garbage out) situation. Historical data is useful but not sufficient. What about the health of the players? What about how well (and with whom) the players slept before the game? Why did not the researchers let the LLMs bet on the games in real time and give them access to the internet? And, in the end, the criteria for LLM failure/success in this exercise should be comparison to humans who, as we know, also always lose (statistically speaking).
 
Upvote
-6 (2 / -8)
Upvote
1 (1 / 0)

SeeUnknown

Ars Praetorian
592
Subscriptor
The so called "democratizing of ability", contributed to AI, only leads to the "double-burden of incompetence", a core tenet of the Dunning-Kruger effect where individuals with low ability in a task suffer from two things: they make poor decisions, and their lack of skill prevents them from recognizing their own incompetence.
 
Upvote
4 (4 / 0)
So you're argument is that LLMs won't improve at chess or poker?
They won’t, because literally their purpose is to predict text. That is what they are made to do. Trying to pretend everyone is a “Luddite” because they actually know how the technology works won’t change that.
 
Upvote
11 (12 / -1)
They won’t, because literally their purpose is to predict text. That is what they are made to do. Trying to pretend everyone is a “Luddite” because they actually know how the technology works won’t change that.
Right...

So how do you explain their improvement from gpt3 etc to now where the frontier sota elo shows they're better than 99% of chess players?

https://chessbenchllm.onrender.com/
 
Upvote
-11 (1 / -12)

Eldorito

Ars Tribunus Angusticlavius
7,953
Subscriptor
Right...

So how do you explain their improvement from gpt3 etc to now where the frontier sota elo shows they're better than 99% of chess players?

https://chessbenchllm.onrender.com/

Apparently we've had intelligent AIs since the mid 90s, if "better than most chess players" is a benchmark.

Considering chess is mostly about large number crunching on possibilities and predictive moves, I don't know why it's surprising that an LLM is good at chess. Particularly with all the work that has gone into improving math/coding capabilities.
 
Upvote
8 (9 / -1)
So how do you explain their improvement from gpt3 etc to now where the frontier sota elo shows they're better than 99% of chess players?
Couple of reasons:

1. They are training the models to learn chess?
2. they are calling a chess AI system (in existence since 1978) to play?
 
Upvote
2 (2 / 0)

Nop666

Ars Praefectus
3,870
Subscriptor++
No mention how humans performed in similar tests, isn’t gambling by definition a losing prospect for most? I really don’t see how this test shows much about AI vs human, don’t use grok seems to be the more definitive conclusion.
Indeed. It seems a safe bet to me that "MechaHitler" would be wrong about everything.
 
Upvote
1 (1 / 0)

john29

Smack-Fu Master, in training
1
Modern models are orders of magnitude larger and trained on vastly more diverse data.

More parameters → better pattern recognition
More data → exposure to countless chess positions, games, and analyses

Even without being a dedicated chess engine, this lets the model “recognize” strong moves the way a very well-read player might.
 
Upvote
1 (1 / 0)
As an intelligent bettor (yes, you will not believe this. I am quite happy you don't), footyball is non-bettable as far as I have found. I found no inefficiencies in the lines and also the odds (prop betting may have some but I don't do props). Hockey is the same way. I believe it has to do with the infrequentcy of scoring but I'm not sure. They should try NFL for sure.

The only reason you can make money betting sports is because humans chose stupid on certain bets. This forces the line from reality because the casino tries to make equal winners and losers. You aren't playing the casino. You are playing the other bettors.

I mean, that I can still make money betting sports is absolute proof that the AI the public is using is dumber than me. Or I couldn't do it. I bet on very specific things.

PS. It's not that complicated to find the inefficiencies (as I call them). You have to find reliable past data tho for lines. Odds you don't. The gaps are closing tho (just like how it was easy to win at online poker when it first came out)
 
Upvote
2 (3 / -1)
Apparently we've had intelligent AIs since the mid 90s, if "better than most chess players" is a benchmark.

Considering chess is mostly about large number crunching on possibilities and predictive moves, I don't know why it's surprising that an LLM is good at chess. Particularly with all the work that has gone into improving math/coding capabilities.
First, l am refuting his argument here. I said I can foresee LLMs getting better at chess and he said LLMs are 'next token prediction models' so they can never get better at chess.

Second, 'large number crunching on possibilities' isn't how AlphaZero and modern engines are built. They are built via deep learning ie there is basically ZERO human input given to the model beyond identifying legal and illegal moves.

AlphaZero learnt its entire chess skill merely through massive self play.

LLMs are not trained this way so they're fundamentally different. The fact that an LLM can reach 2k elo is actually quite surprising since you go out of book/distribution fairly quickly in chess.
 
Upvote
0 (3 / -3)
Couple of reasons:

1. They are training the models to learn chess?
2. they are calling a chess AI system (in existence since 1978) to play?
1. In 2023, my friend and I tried to teach an llm chess. It is nontrivial. LLMs are next token predictors so they do not play chess the same way as engines (you can't hard code in rules etc as they're a stochastic engine).

It is hard to teach it to play chess without making it dumb in other ways (see catastrophic forgetting).

2. This is a definite possibility but very unlikely given how 'weak' an LLM still is at chess. If they did a tool call to even the most basic chess engine, it would be at 3400+ elo not ~2000 elo. To give perspective, I'm at 2k elo and while I'm better than 99% of chess players, there's a massive gulf between me and a Grandmaster. And there's an even more massive gulf between a GM and the most basic chess engine.

3. I actually do not have a clear answer as to how they got better at chess - since you get 'out of book' very quickly in chess so it isn't possible to brute force/mathematically solve chess. But Google's article and model card give us a hint that this is the result of a deeper improvement with 'reasoning' inside the model.
 
Last edited:
Upvote
0 (2 / -2)
So polo is called horseball? Got it.
Mildly amusing that you seem to have cited Google "AI"s summary of the Wikipedia article, which itself is much less definitive in its attribution.


The problem with that argument is that mediaeval Europe didn't go in for ball games on horseback. That reached is from Persia via India, which is why it is called Polo - a Tibetan language word just meaning "ball". So there was no contrast in Europe between ball games on horses or on foot.
The word "Polo" entered English after the game of Association Football was defined.

I hope you can come up with a more intelligent rebuttal than your one about the UK legal system.
Err what a load of nonsense. The first Polo club for Europeans was formed by Tea planters in Assam in 1859.
https://en.wikipedia.org/wiki/Cachar_Club
The FA was formed in 1863.
https://en.wikipedia.org/wiki/The_Football_Association
 
Upvote
4 (4 / 0)
Modern models are orders of magnitude larger and trained on vastly more diverse data.

More parameters → better pattern recognition
More data → exposure to countless chess positions, games, and analyses

Even without being a dedicated chess engine, this lets the model “recognize” strong moves the way a very well-read player might.
Yes but chess gets out of book/ 'training data' very very quickly (given the mathematics of it).

Most amateurs can go out of book by move 3-5.

Google's latest model card and article suggests that this is a deeper breakthrough (https://blog.google/innovation-and-ai/models-and-research/google-deepmind/kaggle-game-arena-updates/)
 
Upvote
-4 (0 / -4)
Right...

So how do you explain their improvement from gpt3 etc to now where the frontier sota elo shows they're better than 99% of chess players?

https://chessbenchllm.onrender.com/
If you read the small print, that site specifies that the rankings are produced primarily by LLMs playing each other and that the person doing the benchmarking states:

I predict that LLMs would perform below these ratings against humans, who are better able to find and exploit systematic weaknesses in play.

Chess.com assigns confidence to ELO scores based in number of games played and the confidence interval only narrows after 30 games. The highest scoring model in the table played 13 games.

So to answer your question:

1) the benchmark is bad
2) the new models have a much bigger context window, which is probably sufficient
3) who really cares anyway, it’s been possible to brute force chess for longer than any of us has been alive. Chess is a poor benchmark for general intelligence in humans, let alone machines.
 
Upvote
6 (7 / -1)
So to answer your question:

1) the benchmark is bad
By all means we can use other benchmarks - all of them demonstrate unmistakable progress.

Other benchmarks have it at an ELO of 1400ish which is still better than 95% of chess players and more importantly it is way ahead of where it was 3 years ago.
2) the new models have a much bigger context window, which is probably sufficient
What does context window have to do with finding the best move?
3) who really cares anyway, it’s been possible to brute force chess for longer than any of us has been alive. Chess is a poor benchmark for general intelligence in humans, let alone machines.
These models are not brute forcing chess:
"While traditional chess engines like Stockfish function as specialized super-calculators, evaluating millions of positions per second to find the optimal move, large language models do not approach the game through brute-force calculation. Instead, they rely on pattern recognition and ‘intuition’ to drastically reduce the search space — an approach that mirrors human play.

Gemini 3 Pro and Gemini 3 Flash currently have the top Elo ratings on the leaderboard. The models’ internal ‘thoughts’ reveal the use of strategic reasoning grounded in familiar chess concepts like piece mobility, pawn structure, and king safety. This significant performance increase over the Gemini 2.5 generation highlights the rapid pace of model progress and demonstrates Game Arena’s value in tracking these improvements over time."
 
Last edited:
Upvote
-2 (2 / -4)

cleek

Ars Scholae Palatinae
1,059
Upvote
6 (6 / 0)
By all means we can use other benchmarks - all of them demonstrate unmistakable progress.

Other benchmarks have it at an ELO of 1400ish which is still better than 95% of chess players and more importantly it is way ahead of where it was 3 years ago.

What does context window have to do with finding the best move?

These models are not brute forcing chess:
"While traditional chess engines like Stockfish function as specialized super-calculators, evaluating millions of positions per second to find the optimal move, large language models do not approach the game through brute-force calculation. Instead, they rely on pattern recognition and ‘intuition’ to drastically reduce the search space — an approach that mirrors human play.

Gemini 3 Pro and Gemini 3 Flash currently have the top Elo ratings on the leaderboard. The models’ internal ‘thoughts’ reveal the use of strategic reasoning grounded in familiar chess concepts like piece mobility, pawn structure, and king safety. This significant performance increase over the Gemini 2.5 generation highlights the rapid pace of model progress and demonstrates Game Arena’s value in tracking these improvements over time."
The source you’re citing is a press release, which cites another press release, which cites a website which shows an animated gif of two LLMs setting fire to the planet to play the most inept game of Connect 4 I have ever seen. Please read your own sources before citing them, because this is just a waste of everyone’s time.
 
Upvote
4 (5 / -1)

Rising

Smack-Fu Master, in training
90
House always wins! Seriously though - with sports betting, punters need an 'edge'. Simply knowing that the favourite team is probably going to win isn't rocket science - what makes the difference is knowing when the price is incorrect. Over a long enough timeframe, it doesn't matter whether some of your bets fail, as long as your value and money management are locked in
 
Upvote
2 (2 / 0)
The source you’re citing is a press release, which cites another press release, which cites a website which shows an animated gif of two LLMs setting fire to the planet to play the most inept game of Connect 4 I have ever seen. Please read your own sources before citing them, because this is just a waste of everyone’s time.
I read the sources and found the insights useful.

If you don't trust that, here is an independent study that got an LLM to play with an elo of 1700 odd:

https://aclanthology.org/2025.naacl-short.1/

Edit:
Another old article where Nicholas Carlini describes how even old LLMs like gpt3.5 do more than simple next token prediction - they model the world (https://nicholas.carlini.com/writing/2023/chess-llm.html)
 
Upvote
-7 (0 / -7)
Right...

So how do you explain their improvement from gpt3 etc to now where the frontier sota elo shows they're better than 99% of chess players?

https://chessbenchllm.onrender.com/
Err.

IIRC there was a chess tournament between all the LLMs a few months ago and the games they all played were comically bad.

Like, they would frequently make illegal moves bad.

I'd like to see some substantiation for the idea that any one of them is better than 99% of chess players.

The ratings shown on the web site you linked to are presumably ratings of the LLMs relative to each other, not relative to humans. I have no idea how they came up with the FIDE rating estimates.
 
Upvote
5 (6 / -1)

cleek

Ars Scholae Palatinae
1,059
Err.

IIRC there was a chess tournament between all the LLMs a few months ago and the games they all played were comically bad.

Like, they would frequently make illegal moves bad.

I'd like to see some substantiation for the idea that any one of them is better than 99% of chess players.

The ratings shown on the web site you linked to are presumably ratings of the LLMs relative to each other, not relative to humans. I have no idea how they came up with the FIDE rating estimates.

here's a fun site that shows a bunch of different models with ELOs.

the highest is gpt-5-2025-08-07-medium at 1086. and, it costs $5/game in tokens.

https://maxim-saplin.github.io/llm_chess/
 
Upvote
3 (3 / 0)
... If they did a tool call to even the most basic chess engine, it would be at 3400+ elo not ~2000 elo. ...
You dramatically overestimate how good "even the most basic chess engine" is.

An engine with material-only evaluation and no tree extensions or pruning would probably be rated under 1000.
 
Upvote
0 (1 / -1)
What does context window have to do with finding the best move?

A human player would at least remember what he or she was trying to do between consecutive moves. It seems to me that this would also be useful to an LLM. (But not to a brute force engine like Stockfish.)
 
Upvote
-1 (0 / -1)