New Apple study challenges whether AI models truly “reason” through problems

Post content hidden for low score. Show…

Killdozer77

Ars Scholae Palatinae
642
The suggestions that the issue is engineered constraints or that puzzles like these aren't appropriate for LLMs raise other questions in my mind: If there are engineered constraints shouldn't the LLMs' manufacturers tell its users about those constraints? And shouldn't they also be clear about what their LLMs are and are not good for?
 
Upvote
206 (210 / -4)

WokStation

Wise, Aged Ars Veteran
103
Goedecke argues this represents the model choosing not to attempt the task rather than being unable to complete it
"No, really, I can totally do that. Do it? Now? Ehhh, not feeling it, y'know? I'll do it next time, you'll see"

Can't do it, won't do it, functionally it's not very different.

edit, ninja'd by Derecho Imminent. Curses!
 
Upvote
152 (153 / -1)
Post content hidden for low score. Show…
Software engineer Sean Goedecke offered a similar critique of the Apple paper on his blog, noting that when faced with Tower of Hanoi requiring over 1,000 moves, DeepSeek-R1 "immediately decides 'generating all those moves manually is impossible,' because it would require tracking over a thousand moves. So it spins around trying to find a shortcut and fails." Goedecke argues this represents the model choosing not to attempt the task rather than being unable to complete it.
My dude LLMs do not even know what the task is, all it knows is statistical relationships between words.

I feel like I am going insane. An entire industry's worth of engineers and scientists are desperate to convince themselves a fancy Markov chain trained on all known human texts is actually thinking through problems and not just rolling the dice on what words it can link together.
 
Upvote
412 (429 / -17)
Post content hidden for low score. Show…
Post content hidden for low score. Show…
Post content hidden for low score. Show…

Chuckstar

Ars Legatus Legionis
37,378
Subscriptor
I’ve pointed this out before: Simply because of their design, LLMs would be expected to perform better at interpolation within areas of the problem space well-represented within their data set, progressively worse in interpolation within areas of the problem space poorly represented in their training set and worst at extrapolating out to areas of the problem space outside of their training set. This should not be at all controversial, especially given that we’d expect it to be true in a very general way for just about any intelligence to be better at tasks similar to those it has seen before.

Simulated reasoning, for instance, could possibly improve results across the board, but would not fundamentally be expected to overcome that basic better, worse, worst pattern.

The interesting question Apple is trying to get at, IMHO, is not whether performance degrades as you move away from the training set, but whether it falls off a cliff in such a way as to preclude the claims of “reasoning”. The criticisms cited against Apple’s data may preclude their paper from providing any useful insight, though, since the cliff they’ve fallen off of might represent purposeful limitations designed into the models, rather than fundamental limitations of their reasoning.
 
Upvote
73 (89 / -16)

Legatum_of_Kain

Ars Praefectus
4,083
Subscriptor++
My dude LLMs do not even know what the task is, all it knows is statistical relationships between words.

I feel like I am going insane. An entire industry's worth of engineers and scientists are desperate to convince themselves a fancy Markov chain trained on all known human texts is actually thinking through problems and not just rolling the dice on what words it can link together.
Every day, I make this argument and I feel like I’m losing my mind at how people just religiously form belief and call it science without describing what a system does, but especially people with PHDs.
 
Upvote
180 (186 / -6)
Post content hidden for low score. Show…
Post content hidden for low score. Show…

Killdozer77

Ars Scholae Palatinae
642
the 'engineered constraint' in this case was the removal of tool use. it's like asking a human to multiply two 20-digit numbers by hand and then writing a paper on the limits of human cognition when they fail.

the interesting part isn't the failure in isolation, but the intelligence of knowing when to use a tool. connecting to tools isn't a limitation to be disclosed, it's the entire point of a modern reasoning system.
If your product needs to be connected to a tool, why wouldn't you disclose that to users?
 
Upvote
72 (72 / 0)

mmorales

Ars Praetorian
476
Subscriptor
We don't know what intelligence is. Every test someone builds that AIs can't do, some humans can't do also (e.g. solve towers of hanoi from this paper ... very large percentage of humans can't either, even with the same clues as given to the AI).
But you missed the key step in the paper.

They ran all of the models with towers of hanoi at increasing difficulty from easy to very hard. Then as part of the prompt they explicitly gave it the algorithm on how to solve the problem and reran all of the models. What is damning is that when given the algorithm the performance did not change in any way. Not just the point where they failed, but at each level of difficulty. Given how to solve the problem in an algorithmic way did not change how they tried to solve the problem.

While it is true when problems gets hard many people will give up. But when given the cheat code I'd expect them to do somewhat better, particularly on problems that are just a little hard. But the reasoning models couldn't use the solution algorithm when explicitly given it.
 
Upvote
260 (262 / -2)

colincornaby

Seniorius Lurkius
32
Subscriptor
The fact that LLMs can't handle some reasoning tasks or that they fail on certain puzzles does not necessarily mean they're not reasoning (even if it isn't analogous to human reasoning).
What I keep thinking about is the brain has multiple lobes that work multiple ways.

There is a certain amount of the brain that is maybe like an LLM. Where we store and retrieve information in a similar way. Where LLMs seem to fall down is what we'd call for humans "executive function." That is - you can retrieve something from your long term memory, but there is a higher level function that processes that, edits it, refines it, works out discrepancies, algorithmically chew through it, etc.

I'm not a neurologist - but I'd have to assume executive function in humans is more like factory firmware, not as much LLM. We humans all have enough in common with our executive function that it seems to be inherited. We all share a similar set of emotions. And it may have also been shaped by evolution. But it's not an open ended thing where it's just piles of data. It's clearly a more defined set of rules.

If I ask you what color the sky is, you'll probably answer blue because you've seen a sky before. That's memory, not really reasoning. If you've never seen the sky before, you'd have to figure out how to observe the sky - and that seems to be more the sort of task LLMs have trouble with. And that's more like reasoning.
 
Upvote
78 (80 / -2)
I feel like I had this happen to me. I asked an LLM to review all 50 state laws regarding particular aspects of state trade secret law and give me a summary chart. It did that fine.

Then I asked it to put it into an excel file and the model shit the bed. First, all the formatting is wrong. I say try again and it only gives me the first 5 states. I say try again and formatting is gone. Each time it wrote a python script to extract from the pdf it generated. I tell it to “copy manually” and it essentially refuses in its chain of reasoning and generates the python script again, saying the size of the request is unreasonable.

I suspect the critique saying that LLMs are RL’d to be efficient in reasoning is right on the money. I think there is room in the market for an LLM that is dumb, but very power efficient in generating tokens, so it can receive basic formatting or other mTurk like requests and happily iterate across 1000+ page documents.
 
Upvote
-5 (10 / -15)
Post content hidden for low score. Show…

mmorales

Ars Praetorian
476
Subscriptor
Following my previous post, this is the performance vs. difficulty when they have to figure out how to solve the tower of hanoi problem (solid) and when given the solution algorithm as part of the prompt (dashed).

I struggle to see how this fits with a context window limitation, and other plausible explanations. If given the algorithm it should need a lot fewer tokens associated with figuring out a general solution, tokens it could use to actually do the problem. It more looks like it can't actually follow reasoning steps, even when explicitly told how to do them.

1749682039109.png
 
Upvote
133 (133 / 0)

Chuckstar

Ars Legatus Legionis
37,378
Subscriptor
I would point out that there are other AI systems that have proven quite good at solving novel puzzles, except the puzzles have to be translated into well-defined rules. Maybe the answer is to figure out how to get an LLM to do the translation and work synergistically with such a puzzle-solving AI.

That strikes me as much closer to how a human solves some conundrum:

1) If you’ve seen something like it before, one often tries to use the solution from before, or close variants, not altogether dissimilar to an LLM pattern-matching it’s way to a solution.

2) If you’ve never seen anything like it, one often just starts exploring the problem space in a way much more similar to the trial-and-error style of something like AlphaZero, by building a database of partial results until you can string some together into a full solution.

EDIT: It’s not exactly the same as how humans do it, of course, since we don’t seem to treat those two as entirely separate modules. We seem to be able simultaneously make use of the historical lessons that would be analogous to an LLM’s internal state and the newly learned problem-specific results analogous to AlphaZero’s internal state.
 
Last edited:
Upvote
36 (37 / -1)
people keep trying to define intelligence in a way that excludes silicon. it's like watching the first airplane and saying 'further evidence that these are not at all birds'.

the new thing is not the old thing, it is the new thing.
the 'engineered constraint' in this case was the removal of tool use. it's like asking a human to multiply two 20-digit numbers by hand and then writing a paper on the limits of human cognition when they fail.

the interesting part isn't the failure in isolation, but the intelligence of knowing when to use a tool. connecting to tools isn't a limitation to be disclosed, it's the entire point of a modern reasoning system.
Depends, if you need to provide some custom tool for the problem at hand, its no longer the LLM solving it.
Now, tools are of cause useful to reduce the tokens needed but if the tools are generic enough, why are they not built into the model, and if they are not generic, then I think the critique is valid since the model would be unable to solve a "new" problem that it lacks the tools for.

A really reasoning agent should be able to, with enough time. to solve the problem without tools, even if that means taking longer time.

I can multiply two 20 digit number, I will need a generic tool in the form om pen and paper or similar extra memory to account for my inability to keep all the numbers in memory.

But with that pen and paper, I can apply the tool to more or less any problem.

If on the other hand I had required the use of a calculator, then that is a much more specific tool that while useful for math will be quite useless for text or drawings or ...

And the most important lesson is that while the models are presented as reasoning models, they are not truly reasoning, they are using very very advanced pattern matching on both the supplied prompt but also on all the training data, and it also as far as I know does some recursive processing that allows it to make new patterns to search for based on results for the first.

And knowing the limits to that reasoning will allow for a more realistic expectation on what can be done.

LLM's will keep evolving and adding more recursive patterns and ways to store more intermediate data will improve this because our own brain is also an expert on pattern matching.

The reasoning comes from doing millions of such pattern matches, often with slight variations and selecting the "good" results and then do more matching over and over until we have a result we are fine with.

I think current LLM's are just on the beginning of this, they do not learn over time, they cannot do quite the same level of parallel processing we do and cross matches and all, at least not within reasonable costs in power and performance.

Figuring out ways to store temporary results and how to branch of into different theories and validating that will require a lot more science I think.

Though I could be wrong, had you asked me 6 years ago if a description of todays AI would be possible win the near future I would have laughed and claimed it impossible, my god was I wrong on that :D

But I still think to many over estimate what AI can do, and under estimate it at the same time, its just you need to find the areas where it really does shine, and many of the involve hallucinations, you just need to find the cases where hallucinations are what you wish to have ;)
 
Upvote
11 (18 / -7)
Great write up. I have always seen what we are calling AI as just advanced machine learning. No real reasoning behind it just fancy, high speed pattern recognition being sold as intelligence to raise massive VC funds. It is still useful as you point out, but it is far from "thinking".
Right. If you word very simple logic scenarios in an unusual way LLMs just completely fall apart. They can't go step by step determining what can and can't be true if you don't talk like most people or imagine unique puzzles.
 
Upvote
53 (53 / 0)
Post content hidden for low score. Show…

FlyingGoat

Wise, Aged Ars Veteran
117
Subscriptor++
To be fair, much of the generative AI space is so new that even its inventors do not yet fully understand how or why these techniques work.

Neural nets, however, are not new. This seems a common property of them - if they're large and complex enough to do something interesting that you don't know how to do without them, just having a neural net doesn't tell you much about the problem, or how the neural net solves it.
 
Upvote
30 (30 / 0)
Post content hidden for low score. Show…

asharkinasuit

Ars Centurion
239
Subscriptor
For me, apart from the obvious and persistent reliability issue, one other major issue with the proposition of outsourcing generation of problem solutions to AI is that it implicitly seeks to relegate humans to the position of verifier. We see this with use cases such as generating various kinds of documents, which invariably have to be checked for accuracy or satisfaction of the domain's specific constraintst.

The problem with this, as I see it, is that it is widely known or at least assumed that solving a problem is generally harder than verifying a given solution. This leads to at least two possible conclusions:
  • Optimistically speaking, perhaps AI wiill help us generate solutions to more complex problems than we could hope to solve, where even verification gets tricky. Perhaps the recent advances in protein folding could be an example of that.
  • Pessimistically speaking, this may lead to a situation where humans are slowly losing their ability to generate solutions themselves, which seems undesirable.

It is worth considering too how people gain their ability to verify, especially nontrivial, problems. In creative domains, this is called taste; in technical domains, I guess it's expertise, but how does one gain these things but by trying it themselves? Ironically, there seems to be a temptation already to act exactly like an LLM, trying to blindly learn patterns of elegant solutions or desirable creations in order to seem smart, but the hope is that we can motivate our preferences of course.

I'm reminded of that rercent article about teachers' difficulties dealing with the consequences of AI, with an especially poignant [link=[URL]https://www.404media.co/teachers-are-not-ok-ai-chatgpt/]article[/a[/URL]] linked to by one of the commenters.
 
Upvote
18 (19 / -1)

Chuckstar

Ars Legatus Legionis
37,378
Subscriptor
But you missed the key step in the paper.

They ran all of the models with towers of hanoi at increasing difficulty from easy to very hard. Then as part of the prompt they explicitly gave it the algorithm on how to solve the problem and reran all of the models. What is damning is that when given the algorithm the performance did not change in any way. Not just the point where they failed, but at each level of difficulty. Given how to solve the problem in an algorithmic way did not change how they tried to solve the problem.

While it is true when problems gets hard many people will give up. But when given the cheat code I'd expect them to do somewhat better, particularly on problems that are just a little hard. But the reasoning models couldn't use the solution algorithm when explicitly given it.
Could the critics be saying that the models simply reject the given solution as being outside the bounds in which they are trying to solve the problem, and then go through the same process as when not having been given the answer? I’m not sure how the model would determine such a thing, though, without giving some hint, given that the models under consideration (IIUC) are designed such that they “talk through” the problem, using the text output to try to simulate a reasoning process, rather than to arriving at a result through a gestalt pass through the neural network. That “talk through” system doesn’t mean that the text output tells the actual tale of how any part of the response is determined, though, so various potential parts of the decision tree would still end up hidden.
 
Upvote
5 (5 / 0)

tharpold

Smack-Fu Master, in training
68
Once we, if we ever, survive the tulip mania phase of the "AI" field as it now operates, surely these tools will be useful in many ways. Probably immensely so for any number of daunting tasks associated with "reasoning" that require time-intensive, large-scale investigation and case-testing.

But the assertion that even extremely elaborate, iterative pattern matching of the intertwingles of a finite corpus of material – I'll leave aside the reality that most of the record of human thought, writing, speaking, art, and, yes reasoning, is still outside of the LLMs, and most of it will never get into them – if we equate combinatorial play and pattern matching with genuinely "generative/general" intelligence, then we're missing a key fact here. What's missing from all the LLM hubris and enthusiasm is a reflexive consciousness of the limits of language, of the aspects of experience that exceed its reach and are also, paradoxically, the source of its actual innovations.

If the principal function of speech were to (only) describe things and actions, then every speech act would be no more interesting than a cookie recipe. Lovely for making yummy cookies but not so good for speaking to the world.

Two examples of what I mean –



Tristan Tzara, "To Make a Dadaist Poem" (1920)

Take a newspaper.
Take some scissors.
Choose from this paper an article the length you want to make your poem.
Cut out the article.
Next carefully cut out each of the words that make up this article and put them all in a bag.
Shake gently.
Next take out each cutting one after the other.
Copy conscientiously in the order in which they left the bag.
The poem will resemble you.
And there you are—an infinitely original author of charming sensibility, even though unappreciated by the vulgar herd.



William Shakespeare, "Sonnet 106" (1609)

When in the chronicle of wasted time
I see descriptions of the fairest wights,
And beauty making beautiful old rhyme
In praise of ladies dead and lovely knights,

Then, in the blazon of sweet beauty’s best,
Of hand, of foot, of lip, of eye, of brow,
I see their antique pen would have express’d
Even such a beauty as you master now.

So all their praises are but prophecies
Of this our time, all you prefiguring;
And, for they look’d but with divining eyes,
They had not skill enough your worth to sing:

For we, which now behold these present days,
Have eyes to wonder, but lack tongues to praise.



One of these is not like the other, though they have much in commmon. I don't mean that Shakespeare is the greater poet than Tzara – love me some Tzara, though yeah, the Bard is greater.

What I mean is that the first poem is a parody of how we produce linguistic invention and poetic expression and a compelling demonstration of the power of merely combinatorial play. (Try Tzara’s recipe sometime. You'll be surprised by what can happen.) Whatever poetry comes out of it is the product of how we read beyond the sequences of the words we pull from the bag.

The second poem is not merely one of the greatest examples of poetic art in history of the world, it is that expressly because it's a knowing, unforgiving critique of the hubris of all confectors of poetry, including Bill Shakespeare, and of the limits of language's capture of experience, no matter how many words you try out, recombine, and eventually use. Both poems are, in different ways, informed by an actual theory of language that doesn't mistake the exhaustion of combinations or the received meanings of words for the expression of the truth of things.

What Tzara and Shakespeare have in common is that both understand that what's left unsaid is where the action is.
 
Upvote
43 (45 / -2)

Unclebugs

Ars Praefectus
3,105
Subscriptor++
"In the meantime, AI companies might build trust by tempering some claims about reasoning and intelligence breakthroughs." Fat chance that will happen with all the capital and political inertia behind this scam. Besides, we have the morons in charge in the White House and Capitol building who have no clue. I guess this also explains why Apple is not so gung ho about incorporating AI in its products.
 
Upvote
33 (34 / -1)

LordInternet

Ars Scholae Palatinae
871
"In the meantime, AI companies might build trust by tempering some claims about reasoning and intelligence breakthroughs." Fat chance that will happen with all the capital and political inertia behind this scam. Besides, we have the morons in charge in the White House and Capitol building who have no clue. I guess this also explains why Apple is not so gung ho about incorporating AI in its products.

What else can one expect, so long as the person with most notoriety in AI, is also a crypto-guy with no Bachelors degree.

Good on Apple for being conservative here. Last thing I want is a chatbot sucking up large amounts of system ram and storage space whilst providing almost no benefit and no way to disable it.
 
Upvote
49 (50 / -1)

Killdozer77

Ars Scholae Palatinae
642
the premise that the 'mind' and the 'tool' are separate things that need to be disclosed is the error. a modern reasoning system generates and executes its own ad-hoc tools (python code, for example) as an intrinsic part of its thought process.

asking it to 'disclose' this is like asking a human to 'disclose' that they use their prefrontal cortex to access memory. it's a fundamental misunderstanding of the system's architecture.
You referred to " connecting to tools" which implies the tools are external. You seem to be intentionally talking in circles. Good Day.
 
Upvote
33 (36 / -3)

Hyoubu

Ars Scholae Palatinae
736
My dude LLMs do not even know what the task is, all it knows is statistical relationships between words.

I feel like I am going insane. An entire industry's worth of engineers and scientists are desperate to convince themselves a fancy Markov chain trained on all known human texts is actually thinking through problems and not just rolling the dice on what words it can link together.
If you have spent enough time with a certain type of engineer and I have, you will not be surprised. Many cse’s or just self trained programmers and developers take their narrow (albeit real) skillset in computer programming and extrapolate it confidently in so many other fields.

I recall an argument I had with a very famous game developer who broadly concluded something a bit stereotypical about all of his workers behavior, personality, and just traits you’d need to be a psychologist doing like hours of individual sessions to conclude, and when I called him out he said “I do this thing called pattern matching so I know what I am talking about.” The irony is this was before LLM’s were even a thing!
 
Upvote
45 (45 / 0)
Post content hidden for low score. Show…

shakethedisease

Smack-Fu Master, in training
22
My dude LLMs do not even know what the task is, all it knows is statistical relationships between words.

I feel like I am going insane. An entire industry's worth of engineers and scientists are desperate to convince themselves a fancy Markov chain trained on all known human texts is actually thinking through problems and not just rolling the dice on what words it can link together.
This is my life right now, as well, and I don't think I've ever seen anything like it. I'm in constant awe at how extremely intelligent and talented tech professionals who know damn well how these things work refuse to say that the emperor has no clothes on.
 
Upvote
70 (70 / 0)