Has AI improved in terms of reasoning and thinking ability?

RagingWarGod · Sunday at 9:48 PM

It's been a while since I checked up on the status of where AI is currently at, is it at the point to where LLMs and the overview on google are reliable sources of knowledge or are we still stuck in them being kinda nonsense engines or sycophants? I know the odds of people not using them is slim to none but are we closer to actual "sentient" machines or are people still just hyping it up.

So far the only thing I've seen with AI is the new datacenters erupting everywhere and just ruining people's lives. I'm asking about AI here because some of the main hubs for it border on religious thinking, making it sound like some second coming or more capable than it really is.

Bardon · Sunday at 9:51 PM

RagingWarGod said:
It's been a while since I checked up on the status of where AI is currently at, is it at the point to where LLMs and the overview on google are reliable sources of knowledge or are we still stuck in them being kinda nonsense engines or sycophants? I know the odds of people not using them is slim to none but are we closer to actual "sentient" machines or are people still just hyping it up.

So far the only thing I've seen with AI is the new datacenters erupting everywhere and just ruining people's lives. I'm asking about AI here because some of the main hubs for it border on religious thinking, making it sound like some second coming or more capable than it really is.

Honest question, why are you asking here rather than looking up the information yourself? You stated its been a while since you looked into it .. why do you expect ARS to do your research for you?

RagingWarGod · Sunday at 10:48 PM

Bardon said:
Honest question, why are you asking here rather than looking up the information yourself? You stated its been a while since you looked into it .. why do you expect ARS to do your research for you?

I ask people who know this stuff better than I do. I also know how my mind works and how easily influenced I am which is why I’m being cautious since I could just easily fall for hype as much as genuine information.

To avoid repeating old news due to how my brain works I have to ask other folks who aren’t like me and aren’t prone to my weaknesses in thinking.

Kyuu · Monday at 3:31 AM

What is currently being called "AI" can neither reason nor think. The question itself reveals a gross misunderstanding of what it is.

VividVerism · Monday at 8:11 AM

Anecdotally the search summaries are still hot garbage for me probably close to half the time. I don't think I've ever seen the information presented in any of the "cited" sources it turns up. It consistently generates fake manpages and commands.

I've started finding a few areas where the full chatbot interface can be helpful. Mostly they involve things I can and do immediately confirm myself, but traditional search for whatever reason just completely fails to find what I'm looking for due to being buried in irrelevant SEO bullshit or somewhat similar, not actually relevant, but vastry more common/more popular content.

But really...I'm kind of surprised you haven't noticed this for yourself at this point.

Edit: typo. I meant SEO, not SSO. Stupid mixing acronyms before caffeine.

wxfisch · Monday at 8:49 AM

Like many tools, LLMs can be useful in the right situations and cause a lot of problems in all of the others. I think of our tooling at work (Copilot) as a junior intern that can get me a good start on a lot of things but all of the work needs checked and I will probably still need to complete the last 20%. It is still a time-saver in many situations (usually around drafting documentation and emails, and summarizing documentation, emails, and meetings; as well as finding information squirraled away somewhere but I don't know where). LLMs though cannot think or reason, they will never be able to because of how they work.

Search Summaries can be useful as a way to get sort through results when something may have multiple meanings since they generally include reference links, but otherwise I find the information is wrong as often as it is right, is often misleading at best, and is always very confident regardless. But again, its not thinking or reasoning on anything, just summarizing search results.

As another example, I was having issues with my NAS at home, I ran my support file through Claude to see if it could highlight any logs to point me in the right direction and it confidently concluded that one my my SSD Cache drives was on the verge of failing. When it continued to have issues I reached out to the vendor for email support in after a week of working through the development team they finally gave me some commands to clear out some stale health data in the cache config and it has been working fine ever since. Claude was very confident in the problem and potential solutions, but it was wrong because it didn't really understand what it was looking through at a real level, it just knew that some terms related to SMART, SSDs, and RAID Caches often lead to more terms related to drive failures and replacement of drives so that is what it provided me.

Agentic and Reasoning models I guess are a little better in this regard since they run their responses thorugh multiple phases of LLM generation to "think", but the results are just ever so slightly more useful for a lot more compute used.

Pino90 · Monday at 9:44 AM

Kyuu said:
What is currently being called "AI" can neither reason nor think. The question itself reveals a gross misunderstanding of what it is.

The "statistical parrot" conjecture is... kinda outdated tbh, and shows that whoever is repeating it (parroting it?) doesn't actually understand the tech -- or, more probably, is repeating an understanding that was maybe true in 2016 but it hasn't been for... years.

The "parrot" objection rests on a picture of cognition that was philosophically... wrong? long before LLMs existed: that somewhere behind the symbols there's "real" understanding, and that symbol manipulation is merely a surface representation of it. But this gets the relationship between thought and representation exactly backwards.

Consider where arithmetic actually came from. Early humans didn't first develop an abstract concept of addition and then decide to represent it with pebbles or notches on bones. The manipulation of physical objects was the arithmetic — the concept and its representation were born simultaneously and inseparably. There was never a stage where "pure" addition existed independently of some instantiation. Peano's axioms don't point at addition from the outside; they constitute it from within.

This isn't just true at the primitive end, but it scales all the way to the most abstract mathematics. A decision problem isn't merely associated with a formal language, but it's actually one. Problem and representation aren't two things in correspondence; they're one thing. The most foundational framework we have for mathematical reasoning defines reasoning as representation.

So the gap the "parrot" objection assumes has nowhere to exist. Not at the primitive end, not at the formal end, and nowhere in between. The burden falls entirely on the objector to specify where, in this continuous history from pebbles to formal languages, this extra ingredient called "real understanding" (or reasoning or thinking) appears and what it actually consists of beyond manipulating the representation of a concept (which sometimes happens to be the physical object itself).

The question isn't whether symbol manipulation can constitute understanding (it clearly can), but whether a given system's manipulation is rich enough, which is the empirical question. And it's the burden of the objector to prove that it actually can't, rather than just saying "I understand it better than you" without specifying anything more.

This is deeply connected to the structural understanding of neural networks: evidence tells us that LLMs don't actually memorize math, but they understand it. I'm not referring to arithmetic which is so mechanical that it still gets memorized (there was an interesting advance about it by Anthropic and reported here on Ars but I can't find it RN), but to "actual" math.

In fact, there's evidence from inside the networks themselves: mechanistic interpretability research has found that neural networks don't just store math facts, but they independently rediscover actual mathematical algorithms. The "parrot" framing assumes the model is replaying sequences it has seen, but what we actually observe, when we look inside, is a model that has built its own mathematical machinery (sometimes new math machinery!!!).

Example of the machinery from the scientific literature:

- The 2022 paper "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets" documented a sudden phase transition in neural network training: a model trained on modular arithmetic memorized its training examples quickly, then appeared to plateau with zero improvement on unseen data. However, that was until it suddenly achieved perfect generalization. Not "memorization," but actual structural understanding of the math.

- Subsequent research by Neel Nanda showed that while the transition appears sudden externally, internally the network is going through a gradual, two-stage process: first it builds a generalizing algorithm alongside the memorizing one, then it discard the memorizing algorithm entirely. It's only when both happen that you observe the outward "snap" to generalization. (attached: super interesting read from quanta magazine https://www.quantamagazine.org/how-do-machines-grok-data-20240412/ )

- Researchers have literally opened up the networks and reverse-engineered what algorithm they're running. Neel Nanda reverse-engineered the weights of the model and found it had discovered a Fourier multiplication algorithm with a recognizable, mathematically elegant strategy for computing modular arithmetic, not a pattern-matching heuristic. So not a statistical Fourier parrot but rather... an actual algorithm that the machine "generalized."

Now let's get to the actual evidence. While some... "haters," for the lack of a better word, keep on repeating that LLM can't reason, we have LLMs solving hard mathematical problems in the real life. Critically, these are unsolved problems whose solution couldn't have been picked up during the training; which I think is a better argument to the fact that some AIs can in fact reason, proven by evidence in the actual real world.

The pattern is quite obvious: LLMs are not just solving known problems with known answers that might have gotten into their training set. Instead, they're solving open problems that the most brilliant minds of humanity couldn't advance for decades. Getting to the actual unsolved problems, before someone tells me they were actually solved:
- Erdős unit-distance problem. Upper bound stuck for decades (I believe it was 70+ years) moved by AI: https://openai.com/index/model-disproves-discrete-geometry-conjecture/

- Just days after OpenAI's announcement, DeepMind's AlphaProof Nexus autonomously solved 9 out of 353 open Erdős problems at an inference cost of just a few hundred dollars per problem. Two of the nine problems had been open for 56 years. https://arxiv.org/html/2605.22763v1

- Beyond Erdős problems, AlphaProof Nexus also proved 44 out of 492 open conjectures from the Online Encyclopedia of Integer Sequences (OEIS), resolved a 15-year-old open question in algebraic geometry concerning Hilbert functions, and improved a bound in convex optimization by discovering a novel algorithmic parameter schedule.

- In collaboration with Fields Medalist Terence Tao and mathematician Javier Gómez-Serrano, AlphaEvolve discovered a new construction for the finite field Kakeya conjecture; Gemini Deep Think then proved it correct and AlphaProof formalized that proof in Lean.

To wrap it up: I think that when people tell you "it can't think because you don't understand the technology," well, they're stuck to the 2016 understanding of neural networks and are disregarding everything that went after. The statistical parrot thesis is well more than dead and it's been for a couple of years now, despite what the Ars commentariat will keep on... parroting!

I must, however, recognize that this framing that's repeated ad naueam here is a very powerful rhethorical move that resonates a lot with the skeptics.

Let me say this: there's a rich irony, @Kyuu , in the "gross misunderstanding" framing. The people who understand these systems deeply are the ones who keep finding mathematical structure where the parrot theory predicts there should be none. Mechanistic interpretability is the closest thing we have to actually understanding what these systems are. And what it keeps finding is not autocomplete or "parroting", but systems that independently derive mathematical algorithms, undergo phase transitions into structural generalization, and solve problems their training data couldn't possibly have contained. If there's a gross misunderstanding here, it belongs to the position that treats "I don't see how it could work" as a substitute for looking.

My 2 cents.

PS I know that the statistical parrot is a paper from 2021, but it was so outdated even at the time that... yeah, I say 2016 but it's more 1980. Anyone who's ever worked with a neural network for a decently complex task knows that they can go way above statistical parroting.

PPS: @RagingWarGod hope this helps with your original question

wxfisch · Monday at 11:00 AM

Pino90 said:
The "statistical parrot" conjecture is... kinda outdated tbh, and shows that whoever is repeating it (parroting it?) doesn't actually understand the tech -- or, more probably, is repeating an understanding that was maybe true in 2016 but it hasn't been for... years.

The "parrot" objection rests on a picture of cognition that was philosophically... wrong? long before LLMs existed: that somewhere behind the symbols there's "real" understanding, and that symbol manipulation is merely a surface representation of it. But this gets the relationship between thought and representation exactly backwards.

Consider where arithmetic actually came from. Early humans didn't first develop an abstract concept of addition and then decide to represent it with pebbles or notches on bones. The manipulation of physical objects was the arithmetic — the concept and its representation were born simultaneously and inseparably. There was never a stage where "pure" addition existed independently of some instantiation. Peano's axioms don't point at addition from the outside; they constitute it from within.

This isn't just true at the primitive end, but it scales all the way to the most abstract mathematics. A decision problem isn't merely associated with a formal language, but it's actually one. Problem and representation aren't two things in correspondence; they're one thing. The most foundational framework we have for mathematical reasoning defines reasoning as representation.

So the gap the "parrot" objection assumes has nowhere to exist. Not at the primitive end, not at the formal end, and nowhere in between. The burden falls entirely on the objector to specify where, in this continuous history from pebbles to formal languages, this extra ingredient called "real understanding" (or reasoning or thinking) appears and what it actually consists of beyond manipulating the representation of a concept (which sometimes happens to be the physical object itself).

The question isn't whether symbol manipulation can constitute understanding (it clearly can), but whether a given system's manipulation is rich enough, which is the empirical question. And it's the burden of the objector to prove that it actually can't, rather than just saying "I understand it better than you" without specifying anything more.

This is deeply connected to the structural understanding of neural networks: evidence tells us that LLMs don't actually memorize math, but they understand it. I'm not referring to arithmetic which is so mechanical that it still gets memorized (there was an interesting advance about it by Anthropic and reported here on Ars but I can't find it RN), but to "actual" math.

In fact, there's evidence from inside the networks themselves: mechanistic interpretability research has found that neural networks don't just store math facts, but they independently rediscover actual mathematical algorithms. The "parrot" framing assumes the model is replaying sequences it has seen, but what we actually observe, when we look inside, is a model that has built its own mathematical machinery (sometimes new math machinery!!!).

Example of the machinery from the scientific literature:

- The 2022 paper "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets" documented a sudden phase transition in neural network training: a model trained on modular arithmetic memorized its training examples quickly, then appeared to plateau with zero improvement on unseen data. However, that was until it suddenly achieved perfect generalization. Not "memorization," but actual structural understanding of the math.

- Subsequent research by Neel Nanda showed that while the transition appears sudden externally, internally the network is going through a gradual, two-stage process: first it builds a generalizing algorithm alongside the memorizing one, then it discard the memorizing algorithm entirely. It's only when both happen that you observe the outward "snap" to generalization. (attached: super interesting read from quanta magazine https://www.quantamagazine.org/how-do-machines-grok-data-20240412/ )

- Researchers have literally opened up the networks and reverse-engineered what algorithm they're running. Neel Nanda reverse-engineered the weights of the model and found it had discovered a Fourier multiplication algorithm with a recognizable, mathematically elegant strategy for computing modular arithmetic, not a pattern-matching heuristic. So not a statistical Fourier parrot but rather... an actual algorithm that the machine "generalized."

Now let's get to the actual evidence. While some... "haters," for the lack of a better word, keep on repeating that LLM can't reason, we have LLMs solving hard mathematical problems in the real life. Critically, these are unsolved problems whose solution couldn't have been picked up during the training; which I think is a better argument to the fact that some AIs can in fact reason, proven by evidence in the actual real world.

The pattern is quite obvious: LLMs are not just solving known problems with known answers that might have gotten into their training set. Instead, they're solving open problems that the most brilliant minds of humanity couldn't advance for decades. Getting to the actual unsolved problems, before someone tells me they were actually solved:
- Erdős unit-distance problem. Upper bound stuck for decades (I believe it was 70+ years) moved by AI: https://openai.com/index/model-disproves-discrete-geometry-conjecture/

- Just days after OpenAI's announcement, DeepMind's AlphaProof Nexus autonomously solved 9 out of 353 open Erdős problems at an inference cost of just a few hundred dollars per problem. Two of the nine problems had been open for 56 years. https://arxiv.org/html/2605.22763v1

- Beyond Erdős problems, AlphaProof Nexus also proved 44 out of 492 open conjectures from the Online Encyclopedia of Integer Sequences (OEIS), resolved a 15-year-old open question in algebraic geometry concerning Hilbert functions, and improved a bound in convex optimization by discovering a novel algorithmic parameter schedule.

- In collaboration with Fields Medalist Terence Tao and mathematician Javier Gómez-Serrano, AlphaEvolve discovered a new construction for the finite field Kakeya conjecture; Gemini Deep Think then proved it correct and AlphaProof formalized that proof in Lean.

To wrap it up: I think that when people tell you "it can't think because you don't understand the technology," well, they're stuck to the 2016 understanding of neural networks and are disregarding everything that went after. The statistical parrot thesis is well more than dead and it's been for a couple of years now, despite what the Ars commentariat will keep on... parroting!

I must, however, recognize that this framing that's repeated ad naueam here is a very powerful rhethorical move that resonates a lot with the skeptics.

Let me say this: there's a rich irony, @Kyuu , in the "gross misunderstanding" framing. The people who understand these systems deeply are the ones who keep finding mathematical structure where the parrot theory predicts there should be none. Mechanistic interpretability is the closest thing we have to actually understanding what these systems are. And what it keeps finding is not autocomplete or "parroting", but systems that independently derive mathematical algorithms, undergo phase transitions into structural generalization, and solve problems their training data couldn't possibly have contained. If there's a gross misunderstanding here, it belongs to the position that treats "I don't see how it could work" as a substitute for looking.

My 2 cents.

PS I know that the statistical parrot is a paper from 2021, but it was so outdated even at the time that... yeah, I say 2016 but it's more 1980. Anyone who's ever worked with a neural network for a decently complex task knows that they can go way above statistical parroting.

PPS: @RagingWarGod hope this helps with your original question

You make very valid points and have for sure given me some reading material that looks interesting at least, but I do think it is worth separating out research neural network systems from the LLMs that the average person is interacting with and generally referring to in 2026 when they say AI. LLMs on their own are not reasoning in the same way that AlphaProof is since that is not what they are designed to do. They are literal token generators. They may pass off some tasks to other engines (and as an example, arithmetic is often handed off not to a neural network but to a normal calculator process since even major LLM model creators like OpenAI and Anthropic admit they are not good at basic arithmetic, they cannot even count since again, that is not what they were created to do).

When Ars commenters refer to the stochastic parrot, they are almost universally referring to mainstream LLM-based tools, not neural networks in general. I argue with people at work all the time about the difference since companies today have (I think purposely) confused the terms and conflate LLM engines with AI in general and that is disingenuous at best. On top of that, every software company seems to include "AI" as a selling point for anything that has even a basic algorithem in it which makes the entire thing worse. Trying to work through governance on use of AI tools when industry doesn't agree on what AI really is and the common person is not equipped to distinguish between the applications we need to be cautious with and the ones that just rebarnded 20-year old technology makes the entire thing mostly unworkable.

I assumed OP was asking about mainstream LLM-based tools in which case I stand by the stochastic parrot framing. Those tools do not really "understand" anything in the same way that a sentient being does or even that a more specilized NN does (for sure, AlphaFold largely understands proteins, how they work and the rules that govern them, Gemini surely does not). LLMs store weights of most likely tokens, and generate responses based on only that. More recent tools will go through multiple rounds of this, trying to break up prompts more logically to get at what the user is really requesting by adding additional information to the prompt to help with token generation, but unless they are handing off to non-LLM tools under the hood, reasoning is not a great way to talk about what they are actually doing IMHO.

What OpenAI appears to have done with their general reasoning model is plumb together an LLM with deeper NN based models that the LLM can hand things off to. It is notable that they don't really describe this general reasoning model in their announcement or even name it (other than noting that it is in internal research model), or really provide much detail at all on how the model was setup or prompted, though they do provide a 125 page pdf of the revised reasoning the model used.

I would argue then that even if a model can do these things, what OP is asking about is more in line with what is available for use by the average person in day to day work, and on that face, solutions available today are still largely in the "they can be helpful, but are still mostly wrong in various ways" stage in the same way that I can get a room full of monkeys to eventually type out a full sentence, and if you train them well that sentence might even make sense most of the time, but that is very different from a room full of humans creating newspaper articles. Both are primates, and both are using a typewriter or word processor, both create words and even make sense, but to say that the monkeys are reasoning in the same way the humans are is really not accurate. Saying internal research models reason the same way commercially available models do doesn't help the case that AI is advancing in real ways that help the normal person.

Pino90 · Monday at 12:17 PM

Quite a bit of confusion in this last post.

The scope move (I'm only talking about the LLM the average person uses, not the research systems) doesn't do the work you need it to, because the wall between those two categories is mostly gone in 2026.
AlphaProof Nexus, the thing that just solved nine open Erdős problems... That's Gemini 3.1 Pro paired with a Lean verifier. The "research NN" is an LLM with a proof-checker bolted on. You can do it yourself at home by installing Lean and editing a JSON. It's literally 10 minutes of work away (you can check the AI coding thread because we do similar things all the time).

OpenAI's unit-distance result came from what they explicitly call a general-purpose reasoning model (most likely o5) "rather than from a system trained specifically for mathematics" or "scaffolded to search through proof strategies."

And the models actual people use day to day (GPT-5.x, Gemini 3.x, Claude with extended thinking) are reasoning models now. So "consumer LLM vs. real research system" isn't a clean divide, but rather it's the same family of system at different price points. Which BTW I use to help me with my research and they're fairly formidable at tackling unsolved problems.

So... Let's get to the core contradiction in your post: you said AlphaFold "largely understands proteins, how they work and the rules that govern them," but "Gemini surely does not" understand anything, but both are neural networks. So by your own words we've already granted that a neural net can surely understand a domain... and then drawn a line in the sand for a different one for some reason.

I'm happy that we both agree that NN actually can understand things, but... Where is that line, exactly, and what's the criterion?

That's the whole question. "AlphaFold understands but Gemini doesn't" isn't an answer to it, it's just the conclusion you wanted, asserted.

The architecture thing has the same issue. You wrote that OpenAI "appears to have" plumbed an LLM into "deeper NN based models that the LLM can hand things off to."

First things first, an LLM is as deep as a NN will get. Then... That's just false because it directly contradicts OpenAI's own description of the model as general-purpose and not a math-specialized or scaffolded system. You could say that you don't trust their own description, but that's a different argument.

I understand the appeal of what you're saying: if the real reasoning is always happening in some deeper module the LLM defers to, the parrot thesis survives. But you're manufacturing the mechanism to save the thesis, which is the exact "I don't see how it could work, so it must secretly be doing X" move I called out earlier. I don't think you're arguing in bad faith to be clear, I just think you slipped on this one.

Same with "LLMs store weights of most likely tokens and generate responses based on only that." That's not a description of the mechanism, it's a restatement of the parrot claim, which in turn is the thing I'm actually arguing about.

The interpretability work already on the table (grokking, the reverse-engineered Fourier circuit, Anthropic's addition work) is evidence that the internal computation is structured algorithms and learned features, not token lookup. You're free to argue that evidence doesn't show what I think it shows, but you're not free to just assert the opposite as a premise because that's not how things work.

The arithmetic point cuts the wrong way for you, too. Offloading math to a code tool is a reliability-and-cost decision, not proof the model "can't." Tellingly, the interpretability research shows models do compute addition internally with their own algorithms. "Can't even count letters" is a tokenization artifact, not a statement about reasoning. It's because models don't receive letters, so it's basically impossible for a model to count something they can't see.

And the definition you land on, that reasoning is only "a good way to talk about it" when the system hands off to non-LLM tools, is unfalsifiable. Any time an LLM reasons well, you can say "ah, but that wasn't really the LLM." There's no result I could show you that would count, because you've defined the win condition so that the LLM can never be the thing doing it.

So... Not really, no. It's happening in the consumer models, with things you can do it at home or using existing tools (if you have the money).

Xanrael · Monday at 12:51 PM

wxfisch said:
LLMs on their own are not reasoning in the same way that AlphaProof is since that is not what they are designed to do. They are literal token generators.

LLMs generate text one token at a time, but then so do I. This is not proof that they "think" one token at a time (for whatever definition of think). Last year Anthropic demonstrated that they plan several tokens ahead: https://www.anthropic.com/research/tracing-thoughts-language-model

They claimed this surprised them, but I'm surprised they were surprised. It seems fairly obvious to me that the models wouldn't be able to do what they're clearly capable of doing if they were really only proceeding one token at a time.

demultiplexer · Monday at 12:54 PM

OK, this is indeed evolving into a kind of interesting discussion. I'll take on the non-statistical parrot.

The problem with literally all evidence I've seen for the idea that LLMs are able to 'think' and 'reason' is that there is a fundamental misunderstanding of what you can do without reasoning

. It feels like people interpret the discoveries done through LLMs as necessarily proving that LLMs aren't next token generators. However, there is absolutely nothing new in LLM world past attention. All the bolt-ons that have happened aren't new technology. We're still thoroughly living in a token prediction machine + analytical addons + purely stochastic addons world. There is no novel or intelligent machinery going on here, unless you want to go into the emergence argument.

As it turns out, a token generator is able, given enough training data and filtering, to generate novel sentences that yield useful 'new' information. That's very interesting in itself, but does not prove either understanding or some kind of internal model state.

Because from both a neurological and phenomenological perspective there is no agreement on what 'understanding' actually is, this is also something that may be discussed until the cows come home.

Xanrael · Monday at 2:52 PM

I mean, the emergence argument is the whole reason the transformer architecture exists and has been so successful. The alternative is the amyloid-plaque-style mafia that dominated the field for decades with nothing to show for it.

demultiplexer · Monday at 3:39 PM

Xanrael said:
I mean, the emergence argument is the whole reason the transformer architecture exists and has been so successful. The alternative is the amyloid-plaque-style mafia that dominated the field for decades with nothing to show for it.

Except we now know that emergence doesn't work as a robust theory of functionality from increasingly larger LLMs. Emergence shouldn't stop working suddenly, yet as we scale LLMs we've seen only a very short period of significant improvement in useful ability, and then it all stalled again. That sounds a lot more like the maturation phase of a technology and not technology showing emergent abilities.

Emergence should work like emergence in animals, like e.g. in the development of humans through infancy. There are very clear patterns of sudden and rapid cognitive development as a seemlingly pure fuction of the number of connections within the brain. No particular functional groups of cells (as far as we understand) appear completely anew beyond a certain age, yet the expansion of those brain regions leads to the hallmarks of emergent ability. Like, s-curve improvements in rapid succession.

If LLMs were a true underlying universalizable mathematics of intelligence (or some other subset of useful ability), we would see relatively small increases in model size or effective complexity result in massive leaps in ability all the time. We don't see that at all, hence I don't believe in the emergence hypothesis for LLMs. In fact, I believe LLMs are entirely the wrong pathway to do these kinds of things, but that's a separate and completely off-topic discussion.

I also don't really understand what you mean by that mafia dominating 'the field'. What field, transformers? Neural networks? AI? None of the fields I can think of that graze this topic have any kind of mafia in them, aside from transformers these fields are many decades old with multiple hype cycles going back to the 50s, new blood entering all the time and doing vastly different stuff every 2 decades or so. We're now halfway through this hype cycle, just to be clear - it's all going to be superseded by the new shiny thing in 10 years.

Pino90 · Monday at 4:33 PM

Who actually said "emergence via scaling" in this discussion? So... Straw man all the way down? I'll try to reply point by point.

You say real emergence looks like "s-curve improvements in rapid succession."

Two problems. First, that's exactly grokking which I've cited before (read the paper): the sudden phase transition in training. That's literally it. You're demanding s-curves while dismissing the documented s-curve.

Second, an s-curve includes the plateau. It's the literal definition of an s-curve. Yet you point at a plateau as proof there's no emergence... Something doesn't work here! Human development doesn't show massive leaps all the time either, but leaps at thresholds and critical periods, then consolidation. By your analogy, plateaus are consistent with emergence, not evidence against it.

Then... Is there a plateau? Real world evidence (e.g what AI has achieved in the last 6 months alone) tells us there's no actual plateau. If you want to believe there's one, then please cite your data or papers or whatever. Trust me bro isn't an argument.

Your argument about emergence being a linear phenomenon is also quite funny. "Small increases should give massive leaps all the time" isn't what emergence means anywhere. Emergence and phase transitions are threshold phenomena. Water doesn't boil a little at every temperature; it transitions at a critical point. Nothing in physics or biology predicts continuous massive leaps from small increases. You defined emergence as something no emergent system on earth exhibits, then observed that LLMs don't exhibit it. That's... Tautological.

Maybe you were thinking of the scaling assumption (now pretty outdated) in "Are Emergent Abilities of Large Language Models a Mirage?", arguing emergent abilities are partly an artifact of discontinuous metrics. But... its claim is that capability rises smoothly with scale, not that it stalled. And anyway as the research moved forward the consensus is that size is not the right answer.

Let me address your final straw man: "I believe LLMs are entirely the wrong pathway, but that's off-topic"... agreed that it's off-topic! But there's a huge but. Whether LLMs are the eventual road to AGI is a totally different question from whether the current ones can reason. A system can reason without being the destined path to anything. You're conceding the pathway question is separate, which is why it can't carry the argument about what these systems are doing today. it's a separate claim in which I tend to agree with you.

w00key · Monday at 4:45 PM

wxfisch said:
What OpenAI appears to have done with their general reasoning model is plumb together an LLM with deeper NN based models that the LLM can hand things off to.

[Citation required]? On https://openai.com/index/model-disproves-discrete-geometry-conjecture/, I quote

The result is also notable for how it was found. The proof came from a new general-purpose reasoning model, rather than from a system trained specifically for mathematics, scaffolded to search through proof strategies, or targeted at the unit distance problem in particular. As part of a broader effort to test whether advanced models can contribute to frontier research, we evaluated it on a collection of Erdős problems. In this case, it produced a proof resolving the open problem.

wrylachlan · Monday at 5:29 PM

demultiplexer said:
Except we now know that emergence doesn't work as a robust theory of functionality from increasingly larger LLMs. Emergence shouldn't stop working suddenly, yet as we scale LLMs we've seen only a very short period of significant improvement in useful ability, and then it all stalled again.

(Looks around) What are you talking about? Both Anthropic and OpenAI are seeing improvements at an absolutely breakneck speed over the last 6 months. Jesus Opus 4.8 came out just 6 weeks after Opus 4.7! Anthropic does a great job with their model cards (more than 100 pages each) summarizing how the models perform on standard benchmarks. I fundamentally don’t understand how you can look at the benchmark improvements and think to yourself: “stalled”.

w00key · Monday at 5:43 PM

One of the breakthroughs related to the OP question is actually what is casually named "reasoning", the more accurate name is "test time compute scaling".

Older models are forced to, no matter the internal state / attention looks like, always output the final answer.

Chain of thought / reasoning models are given a scratch pad, where they can generate tokens, talk to themselves, align internal state, double check, before emitting the final output. What you see as a text description by the way is often cleaned up, summarized and rewritten, the raw CoT is much more messy.

Is this thinking and reasoning? If it isn't, it comes very close to emulating human way of working through a problem. Claude often explores one direction, goes, no wait, and backtracks and explores a different solution. Gemini often uses this scratchpad to setup some reminders so once the output sequence begins, all it needs is pulled into the fresh part of the context increasing accuracy. The app / model messed up and leaked part of it today, step 9 right before final answer was

Final Polish: Check props formatting, XML line-start law, ensure no > inside props. Construct a crisp, actionable response.That classic Dutch gang crunch is real. A width of 2 meters sounds decent on paper, but once you subtract the 80cm staircase, you are left with a 1.2-meter-wide corridor. When you throw a stroller, a couple of jackets, and a grocery run into a 120cm-wide path, the space immediately chokes.

Claude often spends minutes drafting and going over code before deciding "this is it", and then exit thinking mode and rapid fire "edit tool" calls to output a whole new feature in a single turn.

The adaptive thinking effort feature feels very close to what humans do in Blitz, Rapid and classic Chess. In Blitz the 3 or 5 minute time limit forces you to move on instincts and heuristics, in LLM term, thinking=off or budget=low (no "off" in new Opus), Rapid has limited time but still enough "token budget" to let you play out a few critical lines in your mind. Last gives you two hours so it's like max effort Opus 4.8, sure generate 50000 tokens first, takes 10 minutes, but we have the time. And then decide the final answer. Okay, LLMs are completely trash at actually playing chess but for code this works.

But no matter what the budget is you cannot compensate for lack of skill or knowledge with more intermediate tokens. Opus 4.8 min doing better than 4.7 max shows the ceiling on the previous version. And interestingly, on a few benchmarks, "overthinking" reduces accuracy while on others, like math, it keeps improving.

Xanrael · Monday at 6:29 PM

demultiplexer said:
I also don't really understand what you mean by that mafia dominating 'the field'. What field, transformers? Neural networks? AI? None of the fields I can think of that graze this topic have any kind of mafia in them

Apologies, I mixed up some stuff in my recollection of Bob Carpenter's various rants about computational linguistics over the years (Chomsky's Linguistic Wars, and the fact that most statistical folks bailed to industry because that's where the data was).

Coriolanus · 2026-06-02T00:10:03-0400

RagingWarGod said:
It's been a while since I checked up on the status of where AI is currently at, is it at the point to where LLMs and the overview on google are reliable sources of knowledge or are we still stuck in them being kinda nonsense engines or sycophants?

I am not qualified to discuss the underlying mechanics of LLMs and how much they "understand" concepts. I will leave it for other people to discuss.

As somebody who works in AI governance at a large company, I can tell you this - THEY ARE NOT RELIABLE SOURCES.

They may be useful for a lot of things that are tedious or takes a lot of time, but if there is anything truly important, you never want to have the machine make the decision by itself and you must never rely on what it says on blind faith. Everything important needs to be validated by a human SME. If you don't do that, a screw up will cost you tens of billions of dollars.

RagingWarGod · 2026-06-02T21:01:10-0400

VividVerism said:
Anecdotally the search summaries are still hot garbage for me probably close to half the time. I don't think I've ever seen the information presented in any of the "cited" sources it turns up. It consistently generates fake manpages and commands.

I started to severely question the validity of it when it sourced threads I made on other cites asking for help. That kinda gave me a red flag about how reliable it really is. I just don't get why people treat it like it's sentient or not like this page does:

View: https://x.com/anthrupad/status/1859013535344611510?s=20

Has AI improved in terms of reasoning and thinking ability?

Ars Centurion

Ars Tribunus Angusticlavius

Ars Centurion

Ars Praefectus

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Ars Praefectus

Wise, Aged Ars Veteran

Ars Praefectus

Ars Scholae Palatinae

Ars Tribunus Angusticlavius

Ars Legatus Legionis

Ars Tribunus Angusticlavius

Wise, Aged Ars Veteran

Ars Tribunus Angusticlavius

Ars Centurion