Puzzle-based experiments reveal limitations of simulated reasoning, but others dispute findings.
See full article...
See full article...
"No, really, I can totally do that. Do it? Now? Ehhh, not feeling it, y'know? I'll do it next time, you'll see"Goedecke argues this represents the model choosing not to attempt the task rather than being unable to complete it
My dude LLMs do not even know what the task is, all it knows is statistical relationships between words.Software engineer Sean Goedecke offered a similar critique of the Apple paper on his blog, noting that when faced with Tower of Hanoi requiring over 1,000 moves, DeepSeek-R1 "immediately decides 'generating all those moves manually is impossible,' because it would require tracking over a thousand moves. So it spins around trying to find a shortcut and fails." Goedecke argues this represents the model choosing not to attempt the task rather than being unable to complete it.
I tried that in middle school. Didn’t go well for me either.Goedecke argues this represents the model choosing not to attempt the task rather than being unable to complete it.
Every day, I make this argument and I feel like I’m losing my mind at how people just religiously form belief and call it science without describing what a system does, but especially people with PHDs.My dude LLMs do not even know what the task is, all it knows is statistical relationships between words.
I feel like I am going insane. An entire industry's worth of engineers and scientists are desperate to convince themselves a fancy Markov chain trained on all known human texts is actually thinking through problems and not just rolling the dice on what words it can link together.
If your product needs to be connected to a tool, why wouldn't you disclose that to users?the 'engineered constraint' in this case was the removal of tool use. it's like asking a human to multiply two 20-digit numbers by hand and then writing a paper on the limits of human cognition when they fail.
the interesting part isn't the failure in isolation, but the intelligence of knowing when to use a tool. connecting to tools isn't a limitation to be disclosed, it's the entire point of a modern reasoning system.
But you missed the key step in the paper.We don't know what intelligence is. Every test someone builds that AIs can't do, some humans can't do also (e.g. solve towers of hanoi from this paper ... very large percentage of humans can't either, even with the same clues as given to the AI).
What I keep thinking about is the brain has multiple lobes that work multiple ways.The fact that LLMs can't handle some reasoning tasks or that they fail on certain puzzles does not necessarily mean they're not reasoning (even if it isn't analogous to human reasoning).
Depends, if you need to provide some custom tool for the problem at hand, its no longer the LLM solving it.people keep trying to define intelligence in a way that excludes silicon. it's like watching the first airplane and saying 'further evidence that these are not at all birds'.
the new thing is not the old thing, it is the new thing.
the 'engineered constraint' in this case was the removal of tool use. it's like asking a human to multiply two 20-digit numbers by hand and then writing a paper on the limits of human cognition when they fail.
the interesting part isn't the failure in isolation, but the intelligence of knowing when to use a tool. connecting to tools isn't a limitation to be disclosed, it's the entire point of a modern reasoning system.
Right. If you word very simple logic scenarios in an unusual way LLMs just completely fall apart. They can't go step by step determining what can and can't be true if you don't talk like most people or imagine unique puzzles.Great write up. I have always seen what we are calling AI as just advanced machine learning. No real reasoning behind it just fancy, high speed pattern recognition being sold as intelligence to raise massive VC funds. It is still useful as you point out, but it is far from "thinking".
To be fair, much of the generative AI space is so new that even its inventors do not yet fully understand how or why these techniques work.
Could the critics be saying that the models simply reject the given solution as being outside the bounds in which they are trying to solve the problem, and then go through the same process as when not having been given the answer? I’m not sure how the model would determine such a thing, though, without giving some hint, given that the models under consideration (IIUC) are designed such that they “talk through” the problem, using the text output to try to simulate a reasoning process, rather than to arriving at a result through a gestalt pass through the neural network. That “talk through” system doesn’t mean that the text output tells the actual tale of how any part of the response is determined, though, so various potential parts of the decision tree would still end up hidden.But you missed the key step in the paper.
They ran all of the models with towers of hanoi at increasing difficulty from easy to very hard. Then as part of the prompt they explicitly gave it the algorithm on how to solve the problem and reran all of the models. What is damning is that when given the algorithm the performance did not change in any way. Not just the point where they failed, but at each level of difficulty. Given how to solve the problem in an algorithmic way did not change how they tried to solve the problem.
While it is true when problems gets hard many people will give up. But when given the cheat code I'd expect them to do somewhat better, particularly on problems that are just a little hard. But the reasoning models couldn't use the solution algorithm when explicitly given it.
"In the meantime, AI companies might build trust by tempering some claims about reasoning and intelligence breakthroughs." Fat chance that will happen with all the capital and political inertia behind this scam. Besides, we have the morons in charge in the White House and Capitol building who have no clue. I guess this also explains why Apple is not so gung ho about incorporating AI in its products.
You referred to " connecting to tools" which implies the tools are external. You seem to be intentionally talking in circles. Good Day.the premise that the 'mind' and the 'tool' are separate things that need to be disclosed is the error. a modern reasoning system generates and executes its own ad-hoc tools (python code, for example) as an intrinsic part of its thought process.
asking it to 'disclose' this is like asking a human to 'disclose' that they use their prefrontal cortex to access memory. it's a fundamental misunderstanding of the system's architecture.
If you have spent enough time with a certain type of engineer and I have, you will not be surprised. Many cse’s or just self trained programmers and developers take their narrow (albeit real) skillset in computer programming and extrapolate it confidently in so many other fields.My dude LLMs do not even know what the task is, all it knows is statistical relationships between words.
I feel like I am going insane. An entire industry's worth of engineers and scientists are desperate to convince themselves a fancy Markov chain trained on all known human texts is actually thinking through problems and not just rolling the dice on what words it can link together.
This is my life right now, as well, and I don't think I've ever seen anything like it. I'm in constant awe at how extremely intelligent and talented tech professionals who know damn well how these things work refuse to say that the emperor has no clothes on.My dude LLMs do not even know what the task is, all it knows is statistical relationships between words.
I feel like I am going insane. An entire industry's worth of engineers and scientists are desperate to convince themselves a fancy Markov chain trained on all known human texts is actually thinking through problems and not just rolling the dice on what words it can link together.