AI models are terrible at betting on soccer—especially xAI Grok

Erbium168 · Sunday at 9:13 AM

crepuscularbrolly said:
Let me introduce you to Rugby football. Or maybe Aussie rules football.

As for rounders, you go around the bases. Same sort of etymology as "soccer", and "rugger": using "-er" as a diminutive.

You hardly need to. I went to a school that played Rugby (hated it) and we had to learn the boring origins of the game.

barich · Sunday at 9:18 AM

issor said:
That’s quite a goalpost shift though, saying if it isn’t AGI then calling it “glorified autocomplete” is accurate.

I don’t think it was reasonable to pit LLMs (or any machine learning for that matter) against something as unpredictable as sports betting when historical data only helps marginally. I get why people say “duh, LLMs can’t do this”. However, the completely dismissive and hyperbolic comments here do get grating. LLMs are way more useful than a “glorified autocomplete”.

The poster was complaining about people saying that LLMs are not intelligent and cannot think. They aren't intelligent and they can't think. Something that can think and is intelligent would be AGI. It's hardly a goalpost shift.

talz13 · Sunday at 9:49 AM

So just like with human betting, the only way to win is not to play.

cleek · Sunday at 9:59 AM

issor said:
That’s quite a goalpost shift though, saying if it isn’t AGI then calling it “glorified autocomplete” is accurate.

I don’t think it was reasonable to pit LLMs (or any machine learning for that matter) against something as unpredictable as sports betting when historical data only helps marginally. I get why people say “duh, LLMs can’t do this”. However, the completely dismissive and hyperbolic comments here do get grating. LLMs are way more useful than a “glorified autocomplete”.

“glorified autocomplete” is the essence of how they work.

personally, i find them to be more trouble than they're worth - and the amount of resources we're spending on them is absolutely ludicrous.

jasonmicron · Sunday at 11:48 AM

codemonkeyuk said:
Historical data is useless in this scenario as the make up of teams changes so frequently. It's the equivalent of trying to predict dice rolls based on what happened last time.

The bookies must love this !

I was just thinking something similar, but for Fantasy Football. If Claude was only off by 11%, I mean, that means it was correct 89% of the time. That is pretty impressive, honestly.

purecarrot · Sunday at 12:07 PM

This study says a lot about the incompetence of the "researchers" involved and nothing anout the LLMs.

"London-based General Reasoning tested eight top AI systems in a virtual re-creation of the 2023–24 Premier League season, providing them with detailed historical data and statistics about each team and previous games."

As far as the outcomes of the sport games are concerned, it's a typical GIGO (garbage in, garbage out) situation. Historical data is useful but not sufficient. What about the health of the players? What about how well (and with whom) the players slept before the game? Why did not the researchers let the LLMs bet on the games in real time and give them access to the internet? And, in the end, the criteria for LLM failure/success in this exercise should be comparison to humans who, as we know, also always lose (statistically speaking).

SraCet · Sunday at 12:28 PM

AI models are terrible at betting on soccer

Here's the headline I'm seeing, but in the article, the only AI models that are being discussed are LLMs.

Please explain.

Kanten · Sunday at 2:22 PM

An octopus predicted the sport more effectively than a multi trillion dollar boondoggle.

Ball Or Nothing · Sunday at 5:07 PM

crepuscularbrolly said:
Let me introduce you to Rugby football. Or maybe Aussie rules football.

As for rounders, you go around the bases. Same sort of etymology as "soccer", and "rugger": using "-er" as a diminutive.

I feel like Aussie’s call every sport (except cricket) “footie”

SeeUnknown · Sunday at 7:52 PM

The so called "democratizing of ability", contributed to AI, only leads to the "double-burden of incompetence", a core tenet of the Dunning-Kruger effect where individuals with low ability in a task suffer from two things: they make poor decisions, and their lack of skill prevents them from recognizing their own incompetence.

TheBaconson · Sunday at 8:27 PM

Ball Or Nothing said:
I feel like Aussie’s call every sport (except cricket) “footie”

We do, it just depends where you live.
Football = soccer
Football = Australian rules football
Football = rugby league
Football = rugby union

Hadrian's Waller · Sunday at 8:46 PM

Erbium168 said:
I hope you can come up with a more intelligent rebuttal than your one about the UK legal system.

Snort!

Legatum_of_Kain · Sunday at 10:30 PM

"AI" models are terrible.... that should've been it.

Statistical bullshit generators are bad at everything, period.

Just because they're marketed by bullshit artists and evangelized by people vibe-using them does not mean shit.

OOPMan · Sunday at 11:52 PM

More stupid AI experiments conducted by people who claim to be experts on this shit yet seem intent on testing the equivalent of asking a pig to fly.

s73v3r · Monday at 12:03 AM

wildsman said:
Of course this is the top voted comment. Smh

Because that’s literally how these things work. If you want to dispute it, the onus is on you to prove how you believe they work.

s73v3r · Monday at 12:16 AM

wildsman said:
So you're argument is that LLMs won't improve at chess or poker?

They won’t, because literally their purpose is to predict text. That is what they are made to do. Trying to pretend everyone is a “Luddite” because they actually know how the technology works won’t change that.

wildsman · Monday at 12:33 AM

s73v3r said:
They won’t, because literally their purpose is to predict text. That is what they are made to do. Trying to pretend everyone is a “Luddite” because they actually know how the technology works won’t change that.

Right...

So how do you explain their improvement from gpt3 etc to now where the frontier sota elo shows they're better than 99% of chess players?

https://chessbenchllm.onrender.com/

Eldorito · Monday at 2:00 AM

wildsman said:
Right...

So how do you explain their improvement from gpt3 etc to now where the frontier sota elo shows they're better than 99% of chess players?

https://chessbenchllm.onrender.com/

Apparently we've had intelligent AIs since the mid 90s, if "better than most chess players" is a benchmark.

Considering chess is mostly about large number crunching on possibilities and predictive moves, I don't know why it's surprising that an LLM is good at chess. Particularly with all the work that has gone into improving math/coding capabilities.

TheOldChevy · Monday at 2:35 AM

Does that mean that "detailed historical data and statistics about each team and previous games" is not sufficient to predict the result of the rest of the season ?

Does this apply to stock market too?

AI_Skeptic · Monday at 3:06 AM

wildsman said:
So how do you explain their improvement from gpt3 etc to now where the frontier sota elo shows they're better than 99% of chess players?

Couple of reasons:

1. They are training the models to learn chess?
2. they are calling a chess AI system (in existence since 1978) to play?

Nop666 · Monday at 4:28 AM

IIci said:
No mention how humans performed in similar tests, isn’t gambling by definition a losing prospect for most? I really don’t see how this test shows much about AI vs human, don’t use grok seems to be the more definitive conclusion.

Indeed. It seems a safe bet to me that "MechaHitler" would be wrong about everything.

john29 · Monday at 4:36 AM

Modern models are orders of magnitude larger and trained on vastly more diverse data.

More parameters → better pattern recognition
More data → exposure to countless chess positions, games, and analyses

Even without being a dedicated chess engine, this lets the model “recognize” strong moves the way a very well-read player might.

Lemonhead78 · Monday at 4:46 AM

As an intelligent bettor (yes, you will not believe this. I am quite happy you don't), footyball is non-bettable as far as I have found. I found no inefficiencies in the lines and also the odds (prop betting may have some but I don't do props). Hockey is the same way. I believe it has to do with the infrequentcy of scoring but I'm not sure. They should try NFL for sure.

The only reason you can make money betting sports is because humans chose stupid on certain bets. This forces the line from reality because the casino tries to make equal winners and losers. You aren't playing the casino. You are playing the other bettors.

I mean, that I can still make money betting sports is absolute proof that the AI the public is using is dumber than me. Or I couldn't do it. I bet on very specific things.

PS. It's not that complicated to find the inefficiencies (as I call them). You have to find reliable past data tho for lines. Odds you don't. The gaps are closing tho (just like how it was easy to win at online poker when it first came out)

wildsman · Monday at 5:23 AM

Eldorito said:
Apparently we've had intelligent AIs since the mid 90s, if "better than most chess players" is a benchmark.

Considering chess is mostly about large number crunching on possibilities and predictive moves, I don't know why it's surprising that an LLM is good at chess. Particularly with all the work that has gone into improving math/coding capabilities.

First, l am refuting his argument here. I said I can foresee LLMs getting better at chess and he said LLMs are 'next token prediction models' so they can never get better at chess.

Second, 'large number crunching on possibilities' isn't how AlphaZero and modern engines are built. They are built via deep learning ie there is basically ZERO human input given to the model beyond identifying legal and illegal moves.

AlphaZero learnt its entire chess skill merely through massive self play.

LLMs are not trained this way so they're fundamentally different. The fact that an LLM can reach 2k elo is actually quite surprising since you go out of book/distribution fairly quickly in chess.

wildsman · Monday at 5:28 AM

AI_Skeptic said:
Couple of reasons:

1. They are training the models to learn chess?
2. they are calling a chess AI system (in existence since 1978) to play?

1. In 2023, my friend and I tried to teach an llm chess. It is nontrivial. LLMs are next token predictors so they do not play chess the same way as engines (you can't hard code in rules etc as they're a stochastic engine).

It is hard to teach it to play chess without making it dumb in other ways (see catastrophic forgetting).

2. This is a definite possibility but very unlikely given how 'weak' an LLM still is at chess. If they did a tool call to even the most basic chess engine, it would be at 3400+ elo not ~2000 elo. To give perspective, I'm at 2k elo and while I'm better than 99% of chess players, there's a massive gulf between me and a Grandmaster. And there's an even more massive gulf between a GM and the most basic chess engine.

3. I actually do not have a clear answer as to how they got better at chess - since you get 'out of book' very quickly in chess so it isn't possible to brute force/mathematically solve chess. But Google's article and model card give us a hint that this is the result of a deeper improvement with 'reasoning' inside the model.

Albino_Boo · Monday at 5:32 AM

Erbium168 said:
So polo is called horseball? Got it.
Mildly amusing that you seem to have cited Google "AI"s summary of the Wikipedia article, which itself is much less definitive in its attribution.

The problem with that argument is that mediaeval Europe didn't go in for ball games on horseback. That reached is from Persia via India, which is why it is called Polo - a Tibetan language word just meaning "ball". So there was no contrast in Europe between ball games on horses or on foot.
The word "Polo" entered English after the game of Association Football was defined.

I hope you can come up with a more intelligent rebuttal than your one about the UK legal system.

Err what a load of nonsense. The first Polo club for Europeans was formed by Tea planters in Assam in 1859.
https://en.wikipedia.org/wiki/Cachar_Club
The FA was formed in 1863.
https://en.wikipedia.org/wiki/The_Football_Association

wildsman · Monday at 5:42 AM

john29 said:
Modern models are orders of magnitude larger and trained on vastly more diverse data.

More parameters → better pattern recognition
More data → exposure to countless chess positions, games, and analyses

Even without being a dedicated chess engine, this lets the model “recognize” strong moves the way a very well-read player might.

Yes but chess gets out of book/ 'training data' very very quickly (given the mathematics of it).

Most amateurs can go out of book by move 3-5.

Google's latest model card and article suggests that this is a deeper breakthrough (https://blog.google/innovation-and-ai/models-and-research/google-deepmind/kaggle-game-arena-updates/)

TheBaconson · Monday at 5:51 AM

Edit: stupid post by me not reading properly.

Geebs · Monday at 6:11 AM

wildsman said:
Right...

So how do you explain their improvement from gpt3 etc to now where the frontier sota elo shows they're better than 99% of chess players?

https://chessbenchllm.onrender.com/

If you read the small print, that site specifies that the rankings are produced primarily by LLMs playing each other and that the person doing the benchmarking states:

I predict that LLMs would perform below these ratings against humans, who are better able to find and exploit systematic weaknesses in play.

Chess.com assigns confidence to ELO scores based in number of games played and the confidence interval only narrows after 30 games. The highest scoring model in the table played 13 games.

So to answer your question:

1) the benchmark is bad
2) the new models have a much bigger context window, which is probably sufficient
3) who really cares anyway, it’s been possible to brute force chess for longer than any of us has been alive. Chess is a poor benchmark for general intelligence in humans, let alone machines.

wildsman · Monday at 6:34 AM

Geebs said:
So to answer your question:

1) the benchmark is bad

By all means we can use other benchmarks - all of them demonstrate unmistakable progress.

Other benchmarks have it at an ELO of 1400ish which is still better than 95% of chess players and more importantly it is way ahead of where it was 3 years ago.

Geebs said:
2) the new models have a much bigger context window, which is probably sufficient

What does context window have to do with finding the best move?

Geebs said:
3) who really cares anyway, it’s been possible to brute force chess for longer than any of us has been alive. Chess is a poor benchmark for general intelligence in humans, let alone machines.

These models are not brute forcing chess:
"While traditional chess engines like Stockfish function as specialized super-calculators, evaluating millions of positions per second to find the optimal move, large language models do not approach the game through brute-force calculation. Instead, they rely on pattern recognition and ‘intuition’ to drastically reduce the search space — an approach that mirrors human play.

Gemini 3 Pro and Gemini 3 Flash currently have the top Elo ratings on the leaderboard. The models’ internal ‘thoughts’ reveal the use of strategic reasoning grounded in familiar chess concepts like piece mobility, pawn structure, and king safety. This significant performance increase over the Gemini 2.5 generation highlights the rapid pace of model progress and demonstrates Game Arena’s value in tracking these improvements over time."

cleek · Monday at 8:10 AM

s73v3r said:
They won’t, because literally their purpose is to predict text. That is what they are made to do. Trying to pretend everyone is a “Luddite” because they actually know how the technology works won’t change that.

"Never argue with a man whose job depends on not being convinced."

https://meincmagazine.com/civis/threa...-records-doctor-visits.1512521/#post-44363606

Geebs · Monday at 8:32 AM

wildsman said:
By all means we can use other benchmarks - all of them demonstrate unmistakable progress.

Other benchmarks have it at an ELO of 1400ish which is still better than 95% of chess players and more importantly it is way ahead of where it was 3 years ago.

What does context window have to do with finding the best move?

These models are not brute forcing chess:
"While traditional chess engines like Stockfish function as specialized super-calculators, evaluating millions of positions per second to find the optimal move, large language models do not approach the game through brute-force calculation. Instead, they rely on pattern recognition and ‘intuition’ to drastically reduce the search space — an approach that mirrors human play.

Gemini 3 Pro and Gemini 3 Flash currently have the top Elo ratings on the leaderboard. The models’ internal ‘thoughts’ reveal the use of strategic reasoning grounded in familiar chess concepts like piece mobility, pawn structure, and king safety. This significant performance increase over the Gemini 2.5 generation highlights the rapid pace of model progress and demonstrates Game Arena’s value in tracking these improvements over time."

The source you’re citing is a press release, which cites another press release, which cites a website which shows an animated gif of two LLMs setting fire to the planet to play the most inept game of Connect 4 I have ever seen. Please read your own sources before citing them, because this is just a waste of everyone’s time.

Rising · Monday at 8:53 AM

House always wins! Seriously though - with sports betting, punters need an 'edge'. Simply knowing that the favourite team is probably going to win isn't rocket science - what makes the difference is knowing when the price is incorrect. Over a long enough timeframe, it doesn't matter whether some of your bets fail, as long as your value and money management are locked in

wildsman · Monday at 8:56 AM

Geebs said:
The source you’re citing is a press release, which cites another press release, which cites a website which shows an animated gif of two LLMs setting fire to the planet to play the most inept game of Connect 4 I have ever seen. Please read your own sources before citing them, because this is just a waste of everyone’s time.

I read the sources and found the insights useful.

If you don't trust that, here is an independent study that got an LLM to play with an elo of 1700 odd:

https://aclanthology.org/2025.naacl-short.1/

Edit:
Another old article where Nicholas Carlini describes how even old LLMs like gpt3.5 do more than simple next token prediction - they model the world (https://nicholas.carlini.com/writing/2023/chess-llm.html)

Causification · Monday at 9:20 AM

The AIs were instructed to build models that would maximize returns and manage risk.

Is this not a complete and utter horseshit request that wouldn't even be attempted by anyone who knew the slightest bit about how LLMs work?

ham bulu · Monday at 9:47 AM

If there is one single field where I would applaud AI for killing an industry sector and every single job within, it it sports betting.

SraCet · Monday at 11:47 AM

wildsman said:
Right...

So how do you explain their improvement from gpt3 etc to now where the frontier sota elo shows they're better than 99% of chess players?

https://chessbenchllm.onrender.com/

Err.

IIRC there was a chess tournament between all the LLMs a few months ago and the games they all played were comically bad.

Like, they would frequently make illegal moves bad.

I'd like to see some substantiation for the idea that any one of them is better than 99% of chess players.

The ratings shown on the web site you linked to are presumably ratings of the LLMs relative to each other, not relative to humans. I have no idea how they came up with the FIDE rating estimates.

cleek · Monday at 11:53 AM

SraCet said:
Err.

IIRC there was a chess tournament between all the LLMs a few months ago and the games they all played were comically bad.

Like, they would frequently make illegal moves bad.

I'd like to see some substantiation for the idea that any one of them is better than 99% of chess players.

The ratings shown on the web site you linked to are presumably ratings of the LLMs relative to each other, not relative to humans. I have no idea how they came up with the FIDE rating estimates.

here's a fun site that shows a bunch of different models with ELOs.

the highest is gpt-5-2025-08-07-medium at 1086. and, it costs $5/game in tokens.

https://maxim-saplin.github.io/llm_chess/

SraCet · Monday at 12:33 PM

wildsman said:
... If they did a tool call to even the most basic chess engine, it would be at 3400+ elo not ~2000 elo. ...

You dramatically overestimate how good "even the most basic chess engine" is.

An engine with material-only evaluation and no tree extensions or pruning would probably be rated under 1000.

The Lurker Beneath · Monday at 12:38 PM

wildsman said:
What does context window have to do with finding the best move?

A human player would at least remember what he or she was trying to do between consecutive moves. It seems to me that this would also be useful to an LLM. (But not to a brute force engine like Stockfish.)

AI models are terrible at betting on soccer—especially xAI Grok

Ars Centurion

Ars Legatus Legionis

Wise, Aged Ars Veteran

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Praefectus

Ars Legatus Legionis

Ars Scholae Palatinae

Seniorius Lurkius

Ars Praetorian

Ars Scholae Palatinae

Ars Praetorian

Ars Praefectus

Ars Scholae Palatinae

Ars Legatus Legionis

Ars Legatus Legionis

Ars Tribunus Militum

Ars Tribunus Angusticlavius

Ars Tribunus Militum

Wise, Aged Ars Veteran

Ars Praefectus

Smack-Fu Master, in training

Wise, Aged Ars Veteran

Ars Tribunus Militum

Ars Tribunus Militum

Ars Praefectus

Ars Tribunus Militum

Ars Scholae Palatinae

Ars Praefectus

Ars Tribunus Militum

Ars Scholae Palatinae

Ars Praefectus

Smack-Fu Master, in training

Ars Tribunus Militum

Ars Scholae Palatinae

Ars Praetorian

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Legatus Legionis

Account Banned