Systems from Google, OpenAI, Anthropic, and xAI struggle with the Premier League.
See full article...
See full article...
Interesting site, thanks.here's a fun site that shows a bunch of different models with ELOs.
the highest is gpt-5-2025-08-07-medium at 1086. and, it costs $5/game in tokens.
https://maxim-saplin.github.io/llm_chess/
You seem to be associating surprise to the the wrong person. "Brits" are only too aware of the linguistic foibles of USAns.There's nothing more tedious than the "soccer vs. rugby" debate that occurs whenever a Brit discovers that American English differs a little from British English. I must have stumbled on this same exact exchange a hundred times over my years on the Internet.
You seem to be associating surprise to the the wrong person. "Brits" are only too aware of the linguistic foibles of USAns.
No, you need to prove that, and you need to provide evidence for how they are different.Right...
So how do you explain their improvement from gpt3 etc to now where the frontier sota elo shows they're better than 99% of chess players?
https://chessbenchllm.onrender.com/
1000?You dramatically overestimate how good "even the most basic chess engine" is.
An engine with material-only evaluation and no tree extensions or pruning would probably be rated under 1000.
1000?You gotta be joking.
You can get stockfish 18 (3600+ ELO) to run on your phone. Pretty sure these big companies won't have a problem running and doing tool calling to that. So they definitely don't do it.
I agree - I didn't say it was. The prior poster was saying that the reason LLMs are improving at chess is because they're calling engines as tools in the bg.just note that Stockfish is not an LLM. it is a NNUE
I agree - I didn't say it was. The prior poster was saying that the reason LLMs are improving at chess is because they're calling engines as tools in the bg.
I have provided evidence:No, you need to prove that, and you need to provide evidence for how they are different.
Huh? Are you not following the thread?so LLMs aren't improving at all. they're just being shown how to do what we would call "cheating" if they were human.
great.
... and you think Stockfish 18 is "the most basic chess engine"?1000?You gotta be joking.
You can get stockfish 18 (3600+ ELO) to run on your phone. Pretty sure these big companies won't have a problem running and doing tool calling to that. So they definitely don't do it.
my bad. mixed signals.Huh? Are you not following the thread?
They're not being 'shown' (no tools are being called) if they were being called their ELO would be much greater than 1400.
No you can take cull that list down to people who have played more than 100 games and it would still be in the 500s.Interesting site, thanks.
The "average chess.com player" rating in that list is probably meaningless. Presumably dragged down by all the random Joe Sixpacks who barely know the rules of chess, sign up for chess.com, play a game or two, lose, and then never play again.
USCF club players need to pay for membership and official ratings - not to mention that you have play 20 odd games to get a non-provisional rating. You can't use a paid tier as a purity test for something as simple as a 'chess player'.IIRC the median rating of a USCF club player is around 1200. These are the people I would call "chess players."
It can run on your phone. Exactly how small do you want to go without deliberately nerfing it? Do you think these big companies like Google or Anthropic will have any trouble running them and calling them as tools?... and you think Stockfish 18 is "the most basic chess engine"?
Over the decades, people have created thousands of different chess engines....
It can run on your phone. Exactly how small do you want to go without deliberately nerfing it? Do you think these big companies like Google or Anthropic will have any trouble running them and calling them as tools?
I mean, this is getting to be a question of personal interpretation more than anything....
USCF club players need to pay for membership and official ratings - not to mention that you have play 20 odd games to get a non-provisional rating. You can't use a paid tier as a purity test for something as simple as a 'chess player'.
...
Whether you want to call something that can run on a phone (vs something that needs a supercomputer to run) as 'basic' is a matter of interpretation.Over the decades, people have created thousands of different chess engines.
They're all different from each other in terms of how "basic" they are.
Stockfish 18 is one of the most complicated, sophisticated chess engines ever made. Maybe THE most complicated and sophisticated.
Suggesting that it might be "basic" to any degree at all is catastrophically wrong.
I feel like we're speaking different languages. What does hardware have to do with how "basic" a chess engine is or isn't?Whether you want to call something that can run on a phone (vs something that needs a supercomputer to run) as 'basic' is a matter of interpretation.
Huh? No I don't. I had no idea that was your main point, nor do I care.But we have really deviated from the main point that these companies like Google will have no trouble calling Stockfish 18 as an MCP tool.
You seem to think that they might ...
..Large language models (LLMs) are bad at chess.
And yet, as a three-time National Chess Champion and a two-time U.S. Women’s Chess Champion, I love to play against them. Not because they push me to play my best, but because of what they reveal about human nature.
Playing chess with LLMs has taught me how uniquely creative and diverse human beings are, how susceptible humans are to flattery and sycophancy, and how AI is beginning to shape human behavior.
LLMs are not meant to play chess well at all. After all, they are designed to predict what’s most likely to come next and to flatter us. AI-powered chess algorithms aren’t trying to crush you; they are trying to keep you playing. But in their interestingly bad chess play, we can learn lessons beyond the table or the token.
When I first challenged ChatGPT4 to a chess game, it played decently, but I still got a great position after 15 moves and won a knight. Just as my advantage mounted, it hallucinated a phantom piece to recapture my queen. In other words, it cheated! At first, this didn’t make much sense. Aren’t off-the-rack LLMs more known for sycophancy than for stealing?
So I started to play the worst moves I could think of against ChatGPT. It bent the rules yet again, but this time in my favor. Phantom pieces replaced the pieces I had blundered. Whether I played better or worse than ChatGPT, it ended up making me the same level as it was. It wasn’t always cheating, but it was always confabulating. When humans confabulate, we try to fill in the gaps of our memories or dreams with the most logical sequence. ChatGPT was doing the same thing.
Back in the day the head of my father's firm of solicitors relied on local chauvinism for betting on football. Bookmakers lay off the odds based on their exposure.The only reason you can make money betting sports is because humans chose stupid on certain bets. This forces the line from reality because the casino tries to make equal winners and losers. You aren't playing the casino. You are playing the other bettors.
Fine if you have a problem with my usage of the term 'most basic engine' - I will concede that. It doesn't detract from my main point about LLMs not calling engines as tools (they'd be stronger than 1400 ELO).I feel like we're speaking different languages. What does hardware have to do with how "basic" a chess engine is or isn't?
Yes, you can run Stockfish on a phone. What does that matter? How is that relevant to anything? You can run any chess engine on a phone. That has nothing to do with how "basic" a chess engine is or isn't. There is no chess engine that requires a supercomputer to run.
Huh? No I don't. I had no idea that was your main point, nor do I care.
I took issue with this sentence that you wrote:
"If they did a tool call to even the most basic chess engine, it would be at 3400+ elo not ~2000 elo."
Because the most basic chess engine is dramatically weaker than 3400+ Elo.
Here's a list of 602 chess engines, with ratings:
https://www.computerchess.org/
89% of them are rated under 3400. So I don't know why you think "even the most basic chess engine" would be rated 3400+.
Again what does “intelligence” ? I’m willing to bet you can’t define it.Yes, because the current "AI" models are nothing more than glorified predictive text and search systems.
They guess at what the most likely next text token is based on the knowledge they've trained on - they aren't "intelligent" in any meaning of the word.
Anyone with a rating above 400 knows what they're doing. People at that level are setting traps, understand and can avoid all of the early mating scenarios, and are thinking three or four moves ahead. Indian chess hustlers have taken over that middle area ever since Iran lost its internet. It's interesting how the ratings stratify like that. I miss chewing through the IRGC players.1000?You gotta be joking.
You can get stockfish 18 (3600+ ELO) to run on your phone. Pretty sure these big companies won't have a problem running and doing tool calling to that. So they definitely don't do it.
Hey don't blame me. I'm not the one putting random purity tests like 1200 elo and uscf membership.Anyone with a rating above 400 knows what they're doing. People at that level are setting traps, understand and can avoid all of the early mating scenarios, and are thinking three or four moves ahead. Indian chess hustlers have taken over that middle area ever since Iran lost its internet. It's interesting how the ratings stratify like that. I miss chewing through the IRGC players.
I know you guys are enthralled in your mechanical dick measuring contest, but I actually enjoy chess. With my brain.
I read the article expecting to hear that ChatGPT lost badly at poker, which would have been understandable—almost all humans suck at poker, too, and one shouldn’t expect a general chat engine to bluff with perfect precision. What I didn’t expect to hear was that it couldn’t even declare the correct winners, or keep accurate account balances of the players.Nate Silver has written on how ChatGPT is lousy at poker, an interesting read
“But if there are examples where LLMs already seem to have superhuman capabilities, they’re very far from it in poker. And I’d argue that poker is a better test of general intelligence than some of the more discrete tasks that ChatGPT performs so well.”
https://www.natesilver.net/p/chatgpt-is-shockingly-bad-at-poker
Yeah, "sorry" for trying to nail down what people mean when they say XYZ is better than 99% of "chess players."Hey don't blame me. I'm not the one putting random purity tests like 1200 elo and uscf membership.
For Chess, I know there have been chess engines since 1979. See https://en.wikipedia.org/wiki/Video_Chess . The question becomes, why reinvent the wheel? Clearly Chess engines are well known, and serve a purpose. Why reinvent the wheel?It is hard to teach it to play chess without making it dumb in other ways (see catastrophic forgetting).
This is a pretty superficial take on things, akin to asking why we needed to invent the automobile when wheels have been around for thousands of years. Chess engines from 1979 bear little resemblance to the ones we have today, especially after the introduction of AlphaZero in 2017 made neural networks pretty much a required feature.For Chess, I know there have been chess engines since 1979. See https://en.wikipedia.org/wiki/Video_Chess . The question becomes, why reinvent the wheel? Clearly Chess engines are well known, and serve a purpose. Why reinvent the wheel?