OpenAI’s math breakthrough played to AI’s strengths

Thank you for this write up - I had done a deep dive myself on the problem and how it was disproved. I found this paper quite useful - https://arxiv.org/html/2605.20695v1.

As the article says, it plays to AI's strengths to explore cross-domain connections (algebraic number theory to discrete geometry).

Terence Tao seems to think that this is the new research pattern: human experts converse with models, models propose constructions - experts verify, interpret, and either expand on it (often in conversation with these models) - and finally, they verify and integrate them into the field.

Last but not least- a bit of context that might help people understand how it was solved (I don't think I read this in the article but this is my understanding):

This is the process OpenAI used (willing to be corrected if I missed anything):
  1. The model was not a specialised Math model - it was a general reasoning model.
  2. The model was given a formal problem statement.
  3. The successful response appears to come from one prompt/run, not a human-guided back-and-forth.
  4. OpenAI published an abridged chain of thought for you to see how the model got to this solution
  5. OpenAI may have run the same or similar task multiple times since success was probabilistic.
  6. The result was later verified, cleaned up, and extended by humans.
Edit: added the bit about it not being a specialised model and chain of thought after I saw numerous commenters posting misinformation about it.
 
Last edited:
Upvote
140 (141 / -1)
Post content hidden for low score. Show…

Fifteen12

Smack-Fu Master, in training
94
We really need to stop anthropomorphizing AI systems in language. They are not smart, cannot reason, cannot think, and cannot conceptualize anything. They are markov chain generators stacked on top of each other until the Amazon burns down entirely.
To be pedantic, LLMs and neural nets are just as “smart” as any other smart technology, they perform reasoning as well, and if conception can be given by novel rearrangement, then certainly their outputs lead to new concepts. It’s not anthromorphizing to describe them in these ways. “Them” refers to inanimate objects as well as people. This is a really clean take on AI that doesn’t talk about them feeling—it might be different if we were talking about AI being proud of its accomplishments.
 
Upvote
42 (79 / -37)
Post content hidden for low score. Show…

peterford

Ars Praefectus
4,295
Subscriptor++
It makes a lot of sense to me that even "non thinking" ML of whatever flavour could find a lot of previously unnoticed connections between the tree branches of knowledge - be this in Maths Chemistry or Biology; as the article notes finding these connections can take rare overlapping knowledge areas. Once you have the compute ability you can (if interested) just throw more compute at randomish directions until one of them returns something interesting.

Whilst genuinely new advances might come earlier in Maths, I think new advances in Chemistry, Biology and similar are going to be slower because of the lab requirements. Well, if and until those go dark too - and at that point we're possibly getting into weird "magic" technology.
 
Upvote
7 (12 / -5)

kale

Seniorius Lurkius
12
Firstly, by most definitions of "reasoning", an LLM can reason. They aren't sentient, because they don't have volition, and aren't experiencing anything the way a person does. But they can reason. Yes, it's a statistical model that is reducing error, but this is also the way that the human brain works. There was a recent theoretical model proposed that the entire concept of sentience can be proposed as a solution to "minimizing surprise". Where you build your internal model of the world around you to predict what happens next, and you learn when predicted behavior deviates from reality.

One question I have: Did OpenAI use one of the publicly-available models? Or is this an internal model? I figure the toolchain around it is custom, the way any of us would make one with the API, but I was wondering if the API calls themselves were to the standard LLM model they make available to customers, or if it's some kind of supercharged model with extra resources used for research problems like this.
 
Upvote
-17 (36 / -53)

ghub005

Ars Tribunus Angusticlavius
8,702
I studied mathematics in the 1990s and completed a university degree in the subject. Reading this article is making me feel the same feelings that I experienced when I encountered Stephen Wolfram’s Mathematica for the very first time.

Sufficiently advanced technology has - within my lifetime - become indistinguishable from magic.
 
Upvote
54 (62 / -8)

Nakamalkat

Smack-Fu Master, in training
26
I saw a YouTube video on this by Dr Trefor Bazzet. He made a follow-up video, about the Sum Product Conjecture, which has also been proven false (for Real numbers). Apparently some of the reviewers of the AI's Unit Distance Conjecture realised that the techniques the AI had used could be applied to the Sum Product Conjecture. Their paper was entirely human generated but it does show that AI tools can provide useful insights to human mathematicians.
 
Upvote
54 (55 / -1)
The AI model cleverly applied existing ideas drawn from several subfields of mathematics to create a full proof. But it didn’t pioneer any genuinely new techniques. The result has since been cleaned up and extended by human mathematicians.
So it did the equivalent of a uniquely-wide metastudy that still required actual intelligence to have any (potential) value? Am I wrong or is the actual innovation the multi-domain breadth of the human-prompted search?
 
Upvote
35 (38 / -3)

Lexus Lunar Lorry

Ars Scholae Palatinae
921
Subscriptor++
Terence Tao seems to think that this is the new research pattern: human experts converse with models, models propose constructions - experts verify, interpret, and either expand on it (often in conversation with these models) - and finally, they verify and integrate them into the field.
Hopefully mathematics don't end up like software engineering, where people thought this too, but the trend seems to be towards Warhammer 40k:
  • The boss tells the engineer engineseer to do something
  • The engineseer chants holy mantras to the sacred machine
  • The machine produces something that no human understands
 
Upvote
45 (52 / -7)

Uncivil Servant

Ars Scholae Palatinae
4,776
Subscriptor
This is going to be a lot more difficult in other fields. Mathematics tends to be unambiguous, words have very specific meanings within a mathematical context.

This is going to be a mess if you try it with medicine. There will simply be too many spurious connections where a single word has multiple different meanings. There are changes in diagnostic language as well. So as an absurd example, an LLM may conclude that "mental retardation" no longer exists, when the reality is that those patients were subsumed under a larger concept of neurodevelopmental disorders, etc.

Maybe other fields will be as simple as mathematics, but I doubt it. Finally, it's worth remembering that at their core, all computers are just fancy calculators with better UI and IO. A computer helped solve/prove/disprove a math theory. 1952 is calling and wants its headlines back.
 
Upvote
0 (14 / -14)

Danathar

Ars Praefectus
4,559
Subscriptor
I’m finding what is interesting is the human reaction to all this. The people who are extremely anti-AI are just shaking their heads left and right saying “this didn’t happen. This didn’t happen. This didn’t happen. It can’t reason it can’t reason it can’t reason.”

well… It can and it did.

I am as skeptical about AI as a next person, but I’m not about to ignore evidence that’s in front of me.

Mathematicians can be are about as cold eyed and analytical as one can get and if they’re saying that this was done then it was as far as I’m concerned.
 
Upvote
-18 (38 / -56)
Mathematicians can be are about as cold eyed and analytical as one can get and if they’re saying that this was done then it was as far as I’m concerned.
C'mon man. Mathematicians fight about shit all the time. They're no less susceptible to hype and marketing than anyone else.
 
Upvote
73 (77 / -4)
It looks as if human intelligence is going the way of God.

Theology has increasingly led to the "god of the gaps" - attempting to develop god-based explanations for things we cannot yet explain, only for them to get explained one after another.

The uniqueness of human intelligence seems to be going the same way, with one barrier after another falling.

The questions have to be, is there some bastion of human creativity that is beyond the reach of AI? If Baroque art or Beethoven had never existed, would some future AI somehow reproduce them? Would we get Haydn's Creation by sticking a copy of the Bible and Catholic liturgy into the training data? And would it matter if we didn't.

I would like the answers to be yes, no, no, yes. But I can't prove it, and perhaps some self-serving future AI will provide convincing reasons why I am wrong.
Yes, I've used that term before as well to refer to this tendency. I'm happy to see that more people are seeing the same thing - Terence Tao used exactly that term as well ('god of the gaps') in his latest paper:

Terence Tao:

And as AI performance continues to advance, such a human-chauvinistic viewpoint risks degenerating into an increasingly untenable “god of the gaps” philosophy, in which an ever-shrinking list of qualities are touted as indicators of essential human achievement that AI is still not yet able to replicate."
 
Upvote
-6 (17 / -23)
Hopefully mathematics don't end up like software engineering, where people thought this too, but the trend seems to be towards Warhammer 40k:
  • The boss tells the engineer engineseer to do something
  • The engineseer chants holy mantras to the sacred machine
  • The machine produces something that no human understands
Yes this is a serious risk. I'll give you a real life analogy with chess - at one point, grandmasters used to be good enough to verify moves suggested by AI but since AlphaZero, this is no longer the case.

AI is so much smarter than us in this domain that we spend time trying to understand why one move is better than another but we have lost the 'right' to question it (for the most part).
 
Upvote
-12 (10 / -22)
Post content hidden for low score. Show…

AlbatrossMoss

Wise, Aged Ars Veteran
108
Subscriptor
The title of the article seems misleading. The OpenAI model did not "solve a math problem". It has simply generated a few thousand (?) proofs and it was up to humans to figure out which one (if any) was correct. How many of those were wrong?

The LLM cannot tell right from wrong, as is too often demonstrated, so I expect there have been many failed attempts.

But I need to read the article again.

It feels too much like https://xkcd.com/1838/ (on Machine Learning).
 
Upvote
13 (40 / -27)
Post content hidden for low score. Show…
Post content hidden for low score. Show…

MilanKraft

Ars Tribunus Angusticlavius
6,994
I’m finding what is interesting is the human reaction to all this. The people who are extremely anti-AI are just shaking their heads left and right saying “this didn’t happen. [snip]... It can’t reason it can’t reason it can’t reason.”

well… It can and it did.

I am as skeptical about AI as a next person, but I’m not about to ignore evidence [snip]
Unless I missed a comment somewhere, you're mischaracterizing most of the "LLMs can't reason" sentiments here. While many speak out here, that's because they're knowledgeable about the domain, rather than just being some rando Farcebook group where people are against something to be against it / to be edgy / whatever stupid motives they have.

Nobody is saying "it didn't happen;" they're saying things like this can happen, but in a context different than an LLM "understanding and reasoning" in the way humans do. And while the author here did an OK job of not anthropomorphizing the model (I think only one allusion to "reasoning"), it's objectively true these companies are misleading people about what LLMs are (and are not) when naming their features, functions, and speaking publicly.


Here is my overall take on what this article is saying:

The model (like all LLM models) basically ran many trial-and-errors and came up with a viable solution based on related things it had trained on, and patterns it had found. This is fine in the sense that, if an LLM can run a problem through 100s or 1000s of iterations that would take a many years off a human math guru's life, then by all means use the tool for that.

But in the end this does sound a bit like "throw a bunch of solutions against the wall (without really understanding what it's doing) and see what stuck," then the humans can clean it up into some type of theorem (not sure if that's the right word here but one gets the idea).

Its also important to note the model in question is not ChatGPT (what 99% of OpenAI users will have access to) — it's a specialized [LLM variant] that was [very likely] trained on a corpus of university math texts and validated papers are floating around out there. Which is also fine but in the end still is not, to borrow a ridiculous phrase used by Sam Altman a couple years ago that brings my comment full circle, "a math PhD in your pocket". It doesn't "know math" per se, it simply finds patterns in a very specialized set of training data. Again this is not useless, but it is also not "working the problem" in the way OpenAI will likely promote it.
 
Last edited:
Upvote
49 (58 / -9)
It's been ~3 years from the release of ChatGPT to the general public. This technology is just getting started.

Sitting here in front of my box running 10x frontier agents who are supervising another ~30 sub agents running slightly less capable frontier models. For $20/day. Cranking out high quality code at a rate that completely blows my mind. I used to employ rooms full of developers for hundreds of thousands of dollars per month to do 1/100th as much ( or less ).
You've gotta get hoovered up by one of the behemoths like Goldman who've tried to do that and reported near-zero benefits.
 
Upvote
27 (28 / -1)

Danathar

Ars Praefectus
4,559
Subscriptor
IMHO you're mischaracterizing most of the "LLMs can't reason" comments. Nobody is saying "it didn't happen," they're saying things like this happen in a context different than an LLM understanding and reasoning in the way humans do. And while the author here did an OK job of not anthropomorphizing the model (I think only one allusion to "reasoning"), it's objectively true these companies are misleading people about what these systems are (and are not) when naming their features, functions, and speaking publicly about them (obviously to drive up capital infusions and ultimately as large an IPO as possible).

Here is my overall take on what this article is saying:

The model (like all LLM models) basically ran many trial-and-errors and came up with a viable solution based on related things it had trained on, and patterns it had found. This is fine in the sense that, if an LLM can run a problem through 100s or 1000s of iterations that would take a many years off a human math guru's life, then it's OK for them to use that tool.

But in the end this does sound a bit like "throw a bunch of solutions against the wall and see what stuck, then the humans can clean it up into some type of theorem (not sure if that's the right word here but one gets the idea)."
Its also important to note the model in question is not ChatGPT; it's a specialized math model that was trained on whatever corpus of university math texts and validated papers are floating around out there. Which is also fine, but in the end it still is not (to borrow a ridiculous phrase used by Sam Altman a couple years ago that brings my comment full circle) "a math PhD in your pocket". It doesn't "know math" per se, it simply finds patterns in a very specialized set of training data. Again this is not useless, but it is also not "working the problem" in the way humans do AFAICT.
Yes, but isn’t that basically what people do too? They try different methods and see which one works. How is this fundamentally different?

The other thing is that I don’t see evidence, at least from this article, that the LLM generated a large pile of complete proofs and then humans combed through them until they found the correct one. It’s certainly possible that the model worked through many different approaches, permutations, and failed attempts internally before arriving at a proof it considered valid. But that also sounds a lot like how human mathematicians work: they try many approaches, abandon the ones that fail, and keep developing the one that succeeds.

The only thing I can clearly find in the article is that humans verified the proof the LLM produced. That is different from saying humans searched through a bunch of AI outputs and selected the right answer.
 
Upvote
27 (40 / -13)

JohnDeL

Ars Tribunus Angusticlavius
8,954
Subscriptor
It is important to note that a large part of the reason that a mathematical LLM like this or a protein LLM or any other science-based LLM works is because the data set has been scrupulously cleaned and QCd. For example, if someone had slipped π=3 into the training data set, the output would have had quite a few errors in it.

In contrast, the average LLM is trained on all sorts of nonsensical data (see: the internet) and so the LLM outputs all sorts of nonsense (GIGO, as we used to say back in the day when we carved the symbols by hand on clay tablets).

And, unlike a person, a LLM is incapable of deleting training data that is erroneous. As a result, those bad inputs end up creating bad outputs; sometimes in obvious ways, sometimes in not so obvious ones.

And that is why LLMs are good as research tools but not for much more. Because in research, the user is usually smart enough to know the limitations of the LLM and wise enough not to take its advice about using glue to hold the cheese on pizza. But in general use, those two qualifiers are more the exception than the rule.
 
Upvote
36 (42 / -6)

Qyygle

Ars Praetorian
507
Subscriptor
It is important to note that a large part of the reason that a mathematical LLM like this or a protein LLM or any other science-based LLM works is because the data set has been scrupulously cleaned and QCd. For example, if someone had slipped π=3 into the training data set, the output would have had quite a few errors in it.

In contrast, the average LLM is trained on all sorts of nonsensical data (see: the internet) and so the LLM outputs all sorts of nonsense (GIGO, as we used to say back in the day when we carved the symbols by hand on clay tablets).

And, unlike a person, a LLM is incapable of deleting training data that is erroneous. As a result, those bad inputs end up creating bad outputs; sometimes in obvious ways, sometimes in not so obvious ones.

And that is why LLMs are good as research tools but not for much more. Because in research, the user is usually smart enough to know the limitations of the LLM and wise enough not to take its advice about using glue to hold the cheese on pizza. But in general use, those two qualifiers are more the exception than the rule.
But but... Daddy altman promised me I could turn off my brain :pikachu:
 
Upvote
-4 (11 / -15)

JohnDeL

Ars Tribunus Angusticlavius
8,954
Subscriptor
Yes, but isn’t that basically what people do too? They try different methods and see which one works. How is this fundamentally different?

Humans will rule out entire classes of solutions based on patterns that they observe in the data. LLMs don't do that.
 
Upvote
-4 (11 / -15)

MilanKraft

Ars Tribunus Angusticlavius
6,994
Yes, but isn’t that basically what people do too? They try different methods and see which one works. How is this fundamentally different?
It is fundamentally different because the human mathematicians actually understand the rules, theorems, and other concepts [in a given math domain], and literally think-and-apply their way through the variables and boundaries of a problem. IOW, they are aware of what they are doing and WHY the rules they are applying work.

The LLM has literally zero understanding of what it is seeing / doing, or of any solutions it identifies. It is a blind pattern matcher, simple as that. This is the core thing people are being mislead about.
 
Last edited:
Upvote
16 (35 / -19)
It is important to note that a large part of the reason that a mathematical LLM like this or a protein LLM or any other science-based LLM works is because the data set has been scrupulously cleaned and QCd....

In contrast, the average LLM is trained on all sorts of nonsensical data (see: the internet) and so the LLM outputs all sorts of nonsense (GIGO, as we used to say back in the day when we carved the symbols by hand on clay tablets).
Umm... No. This wasn't a 'mathematical LLM'

"The proof came from a new general-purpose reasoning model, rather than from a system trained specifically for mathematics, scaffolded to search through proof strategies, or targeted at the unit distance problem in particular"

RTFA
https://openai.com/index/model-disproves-discrete-geometry-conjecture/
 
Upvote
12 (21 / -9)

coopster

Ars Praetorian
405
Subscriptor
Definitely the best summation of the process/situation that I've seen, including stipulations that it wasn't really an "active" problem but one that was assumed mostly true and not worth the effort of fully proving/disproving.

The "agents checking agents work" does unlock some interesting workflows for certain types of problems. Throw enough crap against the wall and a little might stick.
 
Upvote
0 (3 / -3)
Its also important to note the model in question is not ChatGPT (what 99% of OpenAI users will have access to) — it's a specialized math model that was trained solely on a corpus of university math texts and validated papers are floating around out there.
Where do you guys get your info from? You spout such misinformation so confidently too...

"The proof came from a new general-purpose reasoning model, rather than from a system trained specifically for mathematics, scaffolded to search through proof strategies, or targeted at the unit distance problem in particular"
https://openai.com/index/model-disproves-discrete-geometry-conjecture/
 
Upvote
19 (28 / -9)

M_Binks

Seniorius Lurkius
41
Subscriptor
We tend to overestimate near term disruption and under estimate long term disruption. If AI is doing what it is doing today - less than 3 years from going mainstream - I can only imagine what 30 years will bring us.
And in 6 months my baby will weigh 7.5 billion lbs.

I'm convinced there's value in AI (there sure was a few years ago, when we called it "machine learning" or "computer vision" or any of a dozen other names). I'm just not sure we can count on it continuing to grow in an exponential, or even a linear, fashion forever.

It's already consumed all the written work on the internet and every book published; we can't just magically double our training corpus. You can use the AI to build more data to train on, but I'm skeptical that that works as well, or that you can keep doing that forever.

Getting that "last 10%" has a funny habit of consuming exponentially more effort than the previous 90%. I'm not sure existing techniques can get us that next little bit, and I'm even less sure that there's a business model that will support continuing to chase improvements forever.
 
Upvote
36 (40 / -4)