AI models are terrible at betting on soccer—especially xAI Grok

SraCet · Monday at 12:39 PM

cleek said:
here's a fun site that shows a bunch of different models with ELOs.

the highest is gpt-5-2025-08-07-medium at 1086. and, it costs $5/game in tokens.

https://maxim-saplin.github.io/llm_chess/

Interesting site, thanks.

The "average chess.com player" rating in that list is probably meaningless. Presumably dragged down by all the random Joe Sixpacks who barely know the rules of chess, sign up for chess.com, play a game or two, lose, and then never play again.

IIRC the median rating of a USCF club player is around 1200. These are the people I would call "chess players."

Dr. Awkward · Monday at 12:43 PM

There's nothing more tedious than the "soccer vs. rugby" debate that occurs whenever a Brit discovers that American English differs a little from British English. I must have stumbled on this same exact exchange a hundred times over my years on the Internet.

Yup, different dialects sometimes have different words or spellings for things. Incredible, I know. Can we move on?

Hadrian's Waller · Monday at 1:16 PM

Dr. Awkward said:
There's nothing more tedious than the "soccer vs. rugby" debate that occurs whenever a Brit discovers that American English differs a little from British English. I must have stumbled on this same exact exchange a hundred times over my years on the Internet.

You seem to be associating surprise to the the wrong person. "Brits" are only too aware of the linguistic foibles of USAns.

JustYourAverageJDP · Monday at 2:13 PM

Did they try using anthropics newest scariest terrifying model Mythos on this. I was told its so powerful it cant be used by the public to audit its effectiveness. A data driven decision like betting surely should be dominated by highest end AI systems right? I guess if we spend a few more trillion on data centers and training they will be able to do better than chance.

cleek · Monday at 2:26 PM

Hadrian's Waller said:
You seem to be associating surprise to the the wrong person. "Brits" are only too aware of the linguistic foibles of USAns.

say, how do you pronounce "Cholmondeley" and "Quernmore" ?

s73v3r · Monday at 2:30 PM

wildsman said:
Right...

So how do you explain their improvement from gpt3 etc to now where the frontier sota elo shows they're better than 99% of chess players?

https://chessbenchllm.onrender.com/

No, you need to prove that, and you need to provide evidence for how they are different.

wildsman · Monday at 2:36 PM

SraCet said:
You dramatically overestimate how good "even the most basic chess engine" is.

An engine with material-only evaluation and no tree extensions or pruning would probably be rated under 1000.

1000?

You gotta be joking.

You can get stockfish 18 (3600+ ELO) to run on your phone. Pretty sure these big companies won't have a problem running and doing tool calling to that. So they definitely don't do it.

cleek · Monday at 2:39 PM

wildsman said:
1000? You gotta be joking.

You can get stockfish 18 (3600+ ELO) to run on your phone. Pretty sure these big companies won't have a problem running and doing tool calling to that. So they definitely don't do it.

just note that Stockfish is not an LLM. it is a NNUE

wildsman · Monday at 2:40 PM

cleek said:
just note that Stockfish is not an LLM. it is a NNUE

I agree - I didn't say it was. The prior poster was saying that the reason LLMs are improving at chess is because they're calling engines as tools in the bg.

I was replying that if that were the case - they wouldn't be at 1400 ELO.

cleek · Monday at 2:43 PM

wildsman said:
I agree - I didn't say it was. The prior poster was saying that the reason LLMs are improving at chess is because they're calling engines as tools in the bg.

so LLMs aren't improving at all. they're just being shown how to do what we would call "cheating" if they were human.

great.

NM. brainfart.

wildsman · Monday at 2:46 PM

s73v3r said:
No, you need to prove that, and you need to provide evidence for how they are different.

I have provided evidence:
https://blog.google/innovation-and-ai/models-and-research/google-deepmind/kaggle-game-arena-updates/

From the article:

"While traditional chess engines like Stockfish function as specialized super-calculators, evaluating millions of positions per second to find the optimal move, large language models do not approach the game through brute-force calculation. Instead, they rely on pattern recognition and ‘intuition’ to drastically reduce the search space — an approach that mirrors human play.

Gemini 3 Pro and Gemini 3 Flash currently have the top Elo ratings on the leaderboard. The models’ internal ‘thoughts’ reveal the use of strategic reasoning grounded in familiar chess concepts like piece mobility, pawn structure, and king safety. This significant performance increase over the Gemini 2.5 generation highlights the rapid pace of model progress and demonstrates Game Arena’s value in tracking these improvements over time."

wildsman · Monday at 2:47 PM

cleek said:
so LLMs aren't improving at all. they're just being shown how to do what we would call "cheating" if they were human.

great.

Huh? Are you not following the thread?

They're not being 'shown' (no tools are being called) if they were being called their ELO would be much greater than 1400.

SraCet · Monday at 2:53 PM

wildsman said:
1000? You gotta be joking.

You can get stockfish 18 (3600+ ELO) to run on your phone. Pretty sure these big companies won't have a problem running and doing tool calling to that. So they definitely don't do it.

... and you think Stockfish 18 is "the most basic chess engine"?

cleek · Monday at 2:53 PM

wildsman said:
Huh? Are you not following the thread?

They're not being 'shown' (no tools are being called) if they were being called their ELO would be much greater than 1400.

my bad. mixed signals.

wildsman · Monday at 2:59 PM

SraCet said:
Interesting site, thanks.

The "average chess.com player" rating in that list is probably meaningless. Presumably dragged down by all the random Joe Sixpacks who barely know the rules of chess, sign up for chess.com, play a game or two, lose, and then never play again.

No you can take cull that list down to people who have played more than 100 games and it would still be in the 500s.

SraCet said:
IIRC the median rating of a USCF club player is around 1200. These are the people I would call "chess players."

USCF club players need to pay for membership and official ratings - not to mention that you have play 20 odd games to get a non-provisional rating. You can't use a paid tier as a purity test for something as simple as a 'chess player'.

SraCet said:
... and you think Stockfish 18 is "the most basic chess engine"?

It can run on your phone. Exactly how small do you want to go without deliberately nerfing it? Do you think these big companies like Google or Anthropic will have any trouble running them and calling them as tools?

SraCet · Monday at 3:07 PM

wildsman said:
...
It can run on your phone. Exactly how small do you want to go without deliberately nerfing it? Do you think these big companies like Google or Anthropic will have any trouble running them and calling them as tools?

Over the decades, people have created thousands of different chess engines.

They're all different from each other in terms of how "basic" they are.

Stockfish 18 is one of the most complicated, sophisticated chess engines ever made. Maybe THE most complicated and sophisticated.

Suggesting that it might be "basic" to any degree at all is catastrophically wrong.

SraCet · Monday at 3:14 PM

wildsman said:
...
USCF club players need to pay for membership and official ratings - not to mention that you have play 20 odd games to get a non-provisional rating. You can't use a paid tier as a purity test for something as simple as a 'chess player'.
...

I mean, this is getting to be a question of personal interpretation more than anything.

Personally, I wouldn't call somebody an "XYZ player" if they didn't play XYZ in some kind of organized fashion, usually being on a team or belonging to a league, paying dues, etc.

But if you think "XYZ player" just means somebody who knows how to play XYZ, or has done it a few times, fine.

wildsman · Monday at 3:53 PM

SraCet said:
Over the decades, people have created thousands of different chess engines.

They're all different from each other in terms of how "basic" they are.

Stockfish 18 is one of the most complicated, sophisticated chess engines ever made. Maybe THE most complicated and sophisticated.

Suggesting that it might be "basic" to any degree at all is catastrophically wrong.

Whether you want to call something that can run on a phone (vs something that needs a supercomputer to run) as 'basic' is a matter of interpretation.

But we have really deviated from the main point that these companies like Google will have no trouble calling Stockfish 18 as an MCP tool.

You seem to think that they might and I'm saying that there is absolutely no way they would choose a 1400 ELO engine that sometimes gives illegal moves (if such an engine exists).

When you give Gemini a FEN/PGN (chess game notation), its suggested move is all pure LLM next token prediction.

SraCet · Monday at 4:09 PM

wildsman said:
Whether you want to call something that can run on a phone (vs something that needs a supercomputer to run) as 'basic' is a matter of interpretation.

I feel like we're speaking different languages. What does hardware have to do with how "basic" a chess engine is or isn't?

Yes, you can run Stockfish on a phone. What does that matter? How is that relevant to anything? You can run any chess engine on a phone. That has nothing to do with how "basic" a chess engine is or isn't. There is no chess engine that requires a supercomputer to run.

wildsman said:
But we have really deviated from the main point that these companies like Google will have no trouble calling Stockfish 18 as an MCP tool.

You seem to think that they might ...

Huh? No I don't. I had no idea that was your main point, nor do I care.

I took issue with this sentence that you wrote:

"If they did a tool call to even the most basic chess engine, it would be at 3400+ elo not ~2000 elo."

Because the most basic chess engine is dramatically weaker than 3400+ Elo.

Here's a list of 602 chess engines, with ratings:

https://www.computerchess.org/

89% of them are rated under 3400. So I don't know why you think "even the most basic chess engine" would be rated 3400+.

cleek · Monday at 4:14 PM

Play CHess Against ChatGPT

it is extremely bad, and absolutely loves illegal moves.

ex. it's playing black. and it either moved its white bishop from c8 to b4, leaping over pawns and landing on a black square. or it moved its black bishop from g8 to b4, also jumping a pawn to get there.

(the site that's drawing the board has no way to know which illegal and nonsensical move it was. 'Bb4' is an illegal move either way)

cleek · Monday at 4:28 PM

Large language models (LLMs) are bad at chess.

And yet, as a three-time National Chess Champion and a two-time U.S. Women’s Chess Champion, I love to play against them. Not because they push me to play my best, but because of what they reveal about human nature.
Playing chess with LLMs has taught me how uniquely creative and diverse human beings are, how susceptible humans are to flattery and sycophancy, and how AI is beginning to shape human behavior.

LLMs are not meant to play chess well at all. After all, they are designed to predict what’s most likely to come next and to flatter us. AI-powered chess algorithms aren’t trying to crush you; they are trying to keep you playing. But in their interestingly bad chess play, we can learn lessons beyond the table or the token.

..

When I first challenged ChatGPT4 to a chess game, it played decently, but I still got a great position after 15 moves and won a knight. Just as my advantage mounted, it hallucinated a phantom piece to recapture my queen. In other words, it cheated! At first, this didn’t make much sense. Aren’t off-the-rack LLMs more known for sycophancy than for stealing?

So I started to play the worst moves I could think of against ChatGPT. It bent the rules yet again, but this time in my favor. Phantom pieces replaced the pieces I had blundered. Whether I played better or worse than ChatGPT, it ended up making me the same level as it was. It wasn’t always cheating, but it was always confabulating. When humans confabulate, we try to fill in the gaps of our memories or dreams with the most logical sequence. ChatGPT was doing the same thing.

https://time.com/article/2026/04/13/why-i-play-chess-against-chatgpt/

Erbium168 · Monday at 4:33 PM

Lemonhead78 said:
The only reason you can make money betting sports is because humans chose stupid on certain bets. This forces the line from reality because the casino tries to make equal winners and losers. You aren't playing the casino. You are playing the other bettors.

Back in the day the head of my father's firm of solicitors relied on local chauvinism for betting on football. Bookmakers lay off the odds based on their exposure.
So, London matches. E.g. Chelsea versus Tottenham. Quick trip to Chelsea where locals are putting money on Chelsea to win, so the odds for Tottenham are better. Reverse process in Tottenham. Based on who wins, collect profit. This got harder when gambling tax came in, but worked surprisingly well as a strategy till computers came in.

As I mentioned above, it was the British Prime Minister and brilliant statistician Harold Wilson who demonstrated in a paper that the results of First Division football matches were indistinguishable from randomness. Not that it mattered. The "Football pools" were as big a scam as the Lottery, with often less than a third of money bet being returned to winners.

wildsman · Monday at 4:38 PM

SraCet said:
I feel like we're speaking different languages. What does hardware have to do with how "basic" a chess engine is or isn't?

Yes, you can run Stockfish on a phone. What does that matter? How is that relevant to anything? You can run any chess engine on a phone. That has nothing to do with how "basic" a chess engine is or isn't. There is no chess engine that requires a supercomputer to run.

Huh? No I don't. I had no idea that was your main point, nor do I care.

I took issue with this sentence that you wrote:

"If they did a tool call to even the most basic chess engine, it would be at 3400+ elo not ~2000 elo."

Because the most basic chess engine is dramatically weaker than 3400+ Elo.

Here's a list of 602 chess engines, with ratings:

https://www.computerchess.org/

89% of them are rated under 3400. So I don't know why you think "even the most basic chess engine" would be rated 3400+.

Fine if you have a problem with my usage of the term 'most basic engine' - I will concede that. It doesn't detract from my main point about LLMs not calling engines as tools (they'd be stronger than 1400 ELO).

kinpin · Monday at 4:40 PM

Kavinsky said:
Yes, because the current "AI" models are nothing more than glorified predictive text and search systems.

They guess at what the most likely next text token is based on the knowledge they've trained on - they aren't "intelligent" in any meaning of the word.

Again what does “intelligence” ? I’m willing to bet you can’t define it.

benwaggoner · Monday at 4:49 PM

The takeaway from this may be that "general purpose AI models fail in a highly competitive market where bespoke AI is likely a significant factor."

Given the size of the sport betting market, there is no way that we're not already seeing a good number of bets made with AI assistance, and those would be refined models for this specific task.

It would be interesting to simulate how well the betting would go against games and odds from a decade ago, where there would have been a fair amount of computer assistance but not modern AI.

arsisloam · Monday at 8:56 PM

wildsman said:
1000? You gotta be joking.

You can get stockfish 18 (3600+ ELO) to run on your phone. Pretty sure these big companies won't have a problem running and doing tool calling to that. So they definitely don't do it.

Anyone with a rating above 400 knows what they're doing. People at that level are setting traps, understand and can avoid all of the early mating scenarios, and are thinking three or four moves ahead. Indian chess hustlers have taken over that middle area ever since Iran lost its internet. It's interesting how the ratings stratify like that. I miss chewing through the IRGC players.

I know you guys are enthralled in your mechanical dick measuring contest, but I actually enjoy chess. With my brain.

wildsman · Monday at 9:17 PM

arsisloam said:
Anyone with a rating above 400 knows what they're doing. People at that level are setting traps, understand and can avoid all of the early mating scenarios, and are thinking three or four moves ahead. Indian chess hustlers have taken over that middle area ever since Iran lost its internet. It's interesting how the ratings stratify like that. I miss chewing through the IRGC players.

I know you guys are enthralled in your mechanical dick measuring contest, but I actually enjoy chess. With my brain.

Hey don't blame me. I'm not the one putting random purity tests like 1200 elo and uscf membership.

zogus · Tuesday at 1:53 AM

halse said:
Nate Silver has written on how ChatGPT is lousy at poker, an interesting read
“But if there are examples where LLMs already seem to have superhuman capabilities, they’re very far from it in poker. And I’d argue that poker is a better test of general intelligence than some of the more discrete tasks that ChatGPT performs so well.”

https://www.natesilver.net/p/chatgpt-is-shockingly-bad-at-poker

I read the article expecting to hear that ChatGPT lost badly at poker, which would have been understandable—almost all humans suck at poker, too, and one shouldn’t expect a general chat engine to bluff with perfect precision. What I didn’t expect to hear was that it couldn’t even declare the correct winners, or keep accurate account balances of the players.

SraCet · Tuesday at 12:13 PM

wildsman said:
Hey don't blame me. I'm not the one putting random purity tests like 1200 elo and uscf membership.

Yeah, "sorry" for trying to nail down what people mean when they say XYZ is better than 99% of "chess players."

Heaven forbid we have a common understanding of what the term "chess player" means. We might end up talking about the same thing.

AI_Skeptic · 2026-04-16T14:46:14-0400

wildsman said:
It is hard to teach it to play chess without making it dumb in other ways (see catastrophic forgetting).

For Chess, I know there have been chess engines since 1979. See https://en.wikipedia.org/wiki/Video_Chess . The question becomes, why reinvent the wheel? Clearly Chess engines are well known, and serve a purpose. Why reinvent the wheel?

zogus · 2026-04-17T02:53:33-0400

AI_Skeptic said:
For Chess, I know there have been chess engines since 1979. See https://en.wikipedia.org/wiki/Video_Chess . The question becomes, why reinvent the wheel? Clearly Chess engines are well known, and serve a purpose. Why reinvent the wheel?

This is a pretty superficial take on things, akin to asking why we needed to invent the automobile when wheels have been around for thousands of years. Chess engines from 1979 bear little resemblance to the ones we have today, especially after the introduction of AlphaZero in 2017 made neural networks pretty much a required feature.

AI models are terrible at betting on soccer—especially xAI Grok

Ars Legatus Legionis

Smack-Fu Master, in training

Ars Praetorian

Ars Praetorian

Ars Scholae Palatinae

Ars Legatus Legionis

Ars Tribunus Militum

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Tribunus Militum

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Legatus Legionis

Ars Legatus Legionis

Ars Tribunus Militum

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Centurion

Ars Tribunus Militum

Ars Tribunus Militum

Ars Praefectus

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Tribunus Angusticlavius

Ars Legatus Legionis

Wise, Aged Ars Veteran

Ars Tribunus Angusticlavius