How OpenAI is using GPT-5 Codex to improve the AI tool itself

VelvetRemedy · Dec 12, 2025

This is an awfully uncritical article of tech that has not delivered proven productivity improvements in real-world analysis.

Of course, it's quite effective at wasting the time of actual developers.

randomcat · Dec 12, 2025

He dismissed concerns that AI capabilities have plateaued. “I think we’re very far from plateauing,” he said. “If you look at the velocity on the research team here, we’ve been shipping models almost every week or every other week.”

shipping != improvement

itanod · Dec 12, 2025

Relax folks, it won’t break ChatGPT, it will add fallbacks all over the place to avoid crashes, hangs or accidental suggestions to launch an ethnic cleansing.

The Sheep Look Up · Dec 12, 2025

Bro. If you want a softball interview podcast, do that. If you want to do journalism, this ain't it. Even just a verbatim transcript would be of more worth.

Sarty · Dec 12, 2025

Who knew OpenAI had all of these amazing, world-changing, earth-shattering developments waiting for (checks calendar) just like a week and a half about Altman's freakout email?

Gosh, does Altman know what his own company actually has waiting in the wings? Or is this a flailing squid-ink-cloud of bullshit?

raxadian · Dec 12, 2025

Pop the bubble.

Fatesrider · Dec 12, 2025

“The vast majority of Codex is built by Codex,” OpenAI told us about its new AI coding agent.

Replace "Codex" and the "AI's" in both uses with "Shit", and wow, truth.

Missing Minute · Dec 12, 2025

It's interesting that despite there being countless different jobs that require composing text, the one job that appears to have most widely adopted LLMs to increase productivity, is developers. Can't be a coincidence. Possible explanations:

Developers are the knowledge workers that are the most adaptable to new ways of doing things.
Developers are the most inefficient knowledge workers.
Something about development makes LLMs particularly useful and effective.
LLMs don't actually provide enough advantages to justify the incredibly wide adoption they have seen and developers are particularly vulnerable to believing that a given technology improves output when it doesn't actually do so.
Developers aren't using LLMs as widely as it seems and instead they are just the loudest about it.

AndrewZ · Dec 12, 2025

OpenAI has a long history of "grand exaggeration" when it comes to AI's supposed capabilities and achievements. Could it be that they are pumping up their coming IPO?

Can we see an independent review of these capabilities??

Missing Minute · Dec 12, 2025

VelvetRemedy said:
This is an awfully uncritical article of tech that has not delivered proven productivity improvements in real-world analysis.

Of course, it's quite effective at wasting the time of actual developers.

When all you do all day every day at work is read about AI it's really easy to drown your critical thinking in the torrent of content praising it.

Robin-3 · Dec 12, 2025

Given the well-known issues with confabulation in AI models when people attempt to use them as factual resources, could it be that coding has become the killer app for LLMs? We wondered if OpenAI has noticed that coding seems to be a clear business use case for today’s AI models with less hazard than, say, using AI language models for writing or as emotional companions.

I wonder if anyone's looked at how well AI-supported coding does, especially in complex environments where high quality output is important? Oh look, there's an article about that on this cool site I sometimes visit. A quote from that article:

These factors lead the researchers to conclude that current AI coding tools may be particularly ill-suited to “settings with very high quality standards, or with many implicit requirements (e.g., relating to documentation, testing coverage, or linting/formatting) that take humans substantial time to learn.” [...]

(and)

For now, however, METR’s study provides some strong evidence that AI’s much-vaunted usefulness for coding tasks may have significant limitations in certain complex, real-world coding scenarios.

... Not 100% the same thing, I know. But seriously - this writeup feels like there was no pushback against the talking points provided by people with a financial interest in selling this thing.

WewusLaddeus · Dec 12, 2025

Shit-in and shit-out

Robin-3 · Dec 12, 2025

As a side note, I'd love to know how much of the increased use of AI is from employees being told in no uncertain terms "management expects everyone to integrate AI into their daily tasks ASAP," whether that integration makes any sense or not.

jbblanchet · Dec 12, 2025

Turtle all the way down...

hillspuck · Dec 12, 2025

Missing Minute said:
It's interesting that despite there being countless different jobs that require composing text, the one job that has most widely adopted LLMs to increase productivity, is developers.

What's your source on that statistic?

As a programmer, I have USED LLMs quite often. But as far as "adopting" them, I'm far from it. They are highly faulty and of limited use. I typically give them a shot on a lot of things but don't trust anything they output. They frequently waste much more of my time than they save other times. I have just about given up on asking them anything about Unity, because the error rate is astronomical. I will use them for Powershell scripts, though, because they usually get that right (and they are small enough to be easy to check).

ybz90 · Dec 12, 2025

VelvetRemedy said:
This is an awfully uncritical article of tech that has not delivered proven productivity improvements in real-world analysis.

Of course, it's quite effective at wasting the time of actual developers.

While it's fun to hate on AI (and justifiably so), I would argue one of the only proven productivity improvements is in coding. The caveat is it depends heavily on (1) the field of work being done and (2) the type of work being done.

Some things like web or app development with lots of robust documentation and examples, these agents are fantastic and practically set and forget. For most well-defined actionable prompts in other areas, agents typically will do a decent job. Designing unit tests and auditing code; they make some mistakes, but you can choose to reject false positives and you can do multiple iterations. They're really good also at summarizing, helping to plan and organize, and drafting documentation.

What I would personally NEVER use them for? Large refactor tasks -- their context lengths just aren't big enough to handle too much content at once and they WILL get confused, hallucinate, or worse, sometimes commit outright fraud (and then lie to you about it). Anything related to science or research -- if you're working on something new, by definition, they're probabilistic so they won't be able to help you with creativity. Also just avoid ambiguous prompts in general; as much as people want to, you can't outsource the actual thinking itself.

I treat them like I would treat an extremely brown-nosey first year PhD student. Highly motivated and in the right environment) very productive, but not too bright independently. So just like such a student, you'll need to set them up for success by doing the thinking for them and doing some hand-holding.

But if you put in that legwork, the productivity gains are very real. In the last year, we've probably multiplied our output -- peer-reviewed publications and open sourced PRs, not AI slop -- at least five-fold. But as mentioned above, YMMV.

EDIT: A bunch of TDs, but no replies. If you disagree, I'd love to hear about WHY you do. I'm giving a concrete anecdote of where AI has been extremely valuable for the group I lead and also pointed out the clear limitations in my experience where it falls flat. I'm genuinely surprised to that my opinion -- that AI in coding is not remotely "smart" or PhD-level as OpenAI claims but still a tremendously valuable tool -- is controversial or unpopular.

As far the link the person to whom I replied originally shared (worth a read, kind of sad/hilarious), that's a perfect example of someone who did NOT do any legwork but vibe coded their way to a largely untested 10K+ lines of unmaintainable code. And then doubled down on just trusting that the AI understands the code. Literally the opposite of what I'm espousing for anyone who wants to try incorporating LLMs into their workflows.

DamanielH · Dec 12, 2025

AndrewZ said:
OpenAI has a long history of "grand exaggeration" when it comes to AI's supposed capabilities and achievements. Could it be that they are pumping up their coming IPO?

Can we see an independent review of these capabilities??

Not to mention a history of 'journalists' regurgitating their lies and exaggerations as 'news'.

Do better.

ubercurmudgeon · Dec 12, 2025

AndrewZ said:
OpenAI has a long history of "grand exaggeration" when it comes to AI's supposed capabilities and achievements. Could it be that they are pumping up their coming IPO?

"Full Self-Coding".

odikweos · Dec 12, 2025

VelvetRemedy said:
This is an awfully uncritical article of tech that has not delivered proven productivity improvements in real-world analysis.

Of course, it's quite effective at wasting the time of actual developers.

Right. How can anyone evaluate any of these claims? Claims made, check.

Claims are worth their weight in gold!

xhugglesx · Dec 12, 2025

This sounds like something that didn't happen.

Scott_Tu · Dec 12, 2025

The only thing I'll give them is at least this time it isnt a horror story about AI trying to 'escape' or whatever fearmongering brag they've tried in the past.

WaveMotionGum · Dec 12, 2025

Missing Minute said:
It's interesting that despite there being countless different jobs that require composing text, the one job that has most widely adopted LLMs to increase productivity, is developers. Can't be a coincidence. Possible explanations:

Developers are the knowledge workers that are the most adaptable to new ways of doing things.

Developers are the most inefficient knowledge workers.

Something about development makes LLMs particularly useful and effective.

LLMs don't actually provide enough advantages to justify the incredibly wide adoption they have seen and developers are particularly vulnerable to believing that a given technology improves output when it doesn't actually do so.

Developers aren't using LLMs as widely as it seems and instead they are just the loudest about it.

6. Developers were already using auto complete and intellisense in their IDEs.
7. Developers were already reusing code wherever they could.

Missing Minute · Dec 12, 2025

hillspuck said:
What's your source on that statistic?

As a programmer, I have USED LLMs quite often. But as far as "adopting" them, I'm far from it. They are highly faulty and of limited use. I typically give them a shot on a lot of things but don't trust anything they output. They frequently waste much more of my time than they save other times. I have just about given up on asking them anything about Unity, because the error rate is astronomical. I will use them for Powershell scripts, though, because they usually get that right (and they are small enough to be easy to check).

See:
5. Developers aren't using LLMs as widely as it seems and instead they are just the loudest about it.

benjedwards · Dec 12, 2025

Robin-3 said:
I wonder if anyone's looked at how well AI-supported coding does, especially in complex environments where high quality output is important? Oh look, there's an article about that on this cool site I sometimes visit. A quote from that article:

(and)

... Not 100% the same thing, I know. But seriously - this writeup feels like there was no pushback against the talking points provided by people with a financial interest in selling this thing.

Thanks for the feedback. I have updated the piece to specifically mention the METR study.

We plan to compare the performance of these agentic coding tools (Codex, Claude Code, Gemini CLI, maybe Mistral Vibe) in a future piece very soon, so stay tuned.

h2500 · Dec 12, 2025

Robin-3 said:
As a side note, I'd love to know how much of the increased use of AI is from employees being told in no uncertain terms "management expects everyone to integrate AI into their daily tasks ASAP," whether that integration makes any sense or not.

So far everywhere I looked, that’s what it was.

enlightened.doggo · Dec 12, 2025

And yet to their core product is basically the same as it was last year. They're burning mountains of cash on a spam generator with no path to profitability or intelligence.

Fabermetrics · Dec 12, 2025

We've reached the NFTs of NFTs phase of the circlejerk.

no_free_lunches_for_ai · Dec 12, 2025

ybz90 said:
While it's fun to hate on AI (and justifiably so), I would argue one of the only proven productivity improvements is in coding. The caveat is it depends heavily on (1) the field of work being done and (2) the type of work being done.

Some things like web or app development with lots of robust documentation and examples, these agents are fantastic and practically set and forget. For most well-defined actionable prompts in other areas, agents typically will do a decent job. Designing unit tests and auditing code; they make some mistakes, but you can choose to reject false positives and you can do multiple iterations. They're really good also at summarizing, helping to plan and organize, and drafting documentation.

What I would personally NEVER use them for? Large refactor tasks -- their context lengths just aren't big enough to handle too much content at once and they WILL get confused, hallucinate, or worse, sometimes commit outright fraud (and then lie to you about it). Anything related to science or research -- if you're working on something new, by definition, they're probabilistic so they won't be able to help you with creativity. Also just avoid ambiguous prompts in general; as much as people want to, you can't outsource the actual thinking itself.

I treat them like I would treat an extremely brown-nosey first year PhD student. Highly motivated and in the right environment) very productive, but not too bright independently. So just like such a student, you'll need to set them up for success by doing the thinking for them and doing some hand-holding.

But if you put in that legwork, the productivity gains are very real. In the last year, we've probably multiplied our output -- peer-reviewed publications and open sourced PRs, not AI slop -- at least five-fold. But as mentioned above, YMMV.

EDIT: A bunch of TDs, but no replies. If you disagree, I'd love to hear about WHY you do. I'm giving a concrete anecdote of where AI has been extremely valuable for the group I lead and also pointed out the clear limitations in my experience where it falls flat. I'm genuinely surprised to that my opinion -- that AI in coding is not remotely "smart" or PhD-level as OpenAI claims but still a tremendously valuable tool -- is controversial or unpopular.

As far the link the person to whom I replied originally shared (worth a read, kind of sad/hilarious), that's a perfect example of someone who did NOT do any legwork but vibe coded their way to a largely untested 10K+ lines of unmaintainable code. And then doubled down on just trusting that the AI understands the code. Literally the opposite of what I'm espousing for anyone who wants to try incorporating LLMs into their workflows.

I'd say in large part because of this study. It's mentioned that users who used AI even thought they had made performance gains when in fact they had not. So you saying you gained "at least five-fold" w/o any data to back you up just gives off the same "yeah it definitely made me faster" vibes as in the study.

2) I tested AI (Chat GPT 4.0? 3.X?, can't remember) for a relatively minor task I'd expect it to be able to handle (creating a custom gesture recognizer) and found that it missed a few very crucial lines of code that otherwise looked perfectly accurate. If I hadn't already written the code myself and compared it I might not have caught it at first. Perhaps I would have caught it, or caught it when testing it, or someone else might have; but either way it only got 95% of the way there.

For me, 95% good on a super simple task just doesn't cut it. I'm not interested in wasting my time debugging bad code, it's the least interesting and fulfilling part of the job so I'm not looking to expand on it. I can only imagine how much worse it could get when trying to tackle actual complex problems.

scrimbul · Dec 12, 2025

hillspuck said:
What's your source on that statistic?

As a programmer, I have USED LLMs quite often. But as far as "adopting" them, I'm far from it. They are highly faulty and of limited use. I typically give them a shot on a lot of things but don't trust anything they output. They frequently waste much more of my time than they save other times. I have just about given up on asking them anything about Unity, because the error rate is astronomical. I will use them for Powershell scripts, though, because they usually get that right (and they are small enough to be easy to check).

If you skim his posting history and be charitable, he was being facetious and the correct answer was to mash the '5' button.

The developers in question being loudest about it are OpenAI and other companies' paid marketing astroturfers posting here.

clb2c4e · Dec 12, 2025

ybz90 said:
But if you put in that legwork, the productivity gains are very real. In the last year, we've probably multiplied our output -- peer-reviewed publications and open sourced PRs, not AI slop -- at least five-fold. But as mentioned above, YMMV.

This is where you got my down vote from. "Improving productivity and multiplying output" through 5x more publications is, I would say, almost worse than AI slop.

Publication quantity is a problem not a virtue of academia right now.

If you could use AI to produce 5x fewer publications that are much higher quality, more succinct, and higher impact then I'd be impressed.

As it is, you are just contributing to the degradation of research by making it into a publication output machine.

scrimbul · Dec 12, 2025

clb2c4e said:
This is where you got my down vote from. "Improving productivity and multiplying output" through 5x more publications is, I would say, almost worse than AI slop.

Publication quantity is a problem not a virtue of academia right now.

If you could use AI to produce 5x fewer publications that are much higher quality, more succinct, and higher impact then I'd be impressed.

As it is, you are just contributing to the degradation of research by making it into a publication output machine.

Which is then used to train, unknowingly, by other AI models later on down the track when scraped, both legally and illegally, creating even more of a shit sandwich.

Sarty · Dec 12, 2025

scrimbul said:
The developers in question being loudest about it are OpenAI and other companies' paid marketing astroturfers posting here.

I wish. I've come to the regrettable conclusion that the biggest fanboys and boosters are, while utterly mistaken in their analyses, very real people displaying very real excitement.

enlightened.doggo · Dec 12, 2025

Missing Minute said:
Developers are the most inefficient knowledge workers.

This is a wild statement. Corporations employ armies of workers for what basically boils down to communication and operations.

_crane · Dec 12, 2025

iollmann said:
This should be utterly unsurprising to anyone familiar with build tools history. In an effort to eat their own dogfood, build tools have traditionally been used to build themselves, whenever possible. Clang-llvm builds clang-llvm. It is the natural next step for a AI code development project to author and build itself. If it’s producing poor quality code that is clearly a bug in the code base itself and we can fix that and then hypothetically see the improvement over the problem in question but also the rest of the code base. It is the sensible way to move forward.

When the time comes we would also expect to see robots built not in factories but by other robots from a bucket of spare parts. A factory can only deliver the throughput that it was specced for and is forever limited to that until you build another. Robots building robots can grow exponentially and is basically only limited by the logistics of delivering parts to an ever expanding body of robots. (Insert comical descriptions of rabbits with an unlimited supply of food reproducing faster than the speed of sound.) While a factory can bootstrap this process, ultimately the repeated doubling will dwarf its capacity into irrelevance. The early winner in the robot arms race will be the one who makes robot assembling robots with the shortest generational time and cheapest parts list.

what do you think a "factory" is?

maxoakland · Dec 12, 2025

The Sheep Look Up said:
Bro. If you want a softball interview podcast, do that. If you want to do journalism, this ain't it. Even just a verbatim transcript would be of more worth.

Seriously. I'm getting tired of Arstechnica articles that could've been a press release. For some reason, they're always about AI too.

I came to Ars because it had in-depth journalism about tech. The writers were knowledgable and not easily swayed by market speak.

Maybe Ars isn't the place to find that kind of journalism anymore?

How OpenAI is using GPT-5 Codex to improve the AI tool itself

Wise, Aged Ars Veteran

Ars Praefectus

Wise, Aged Ars Veteran

Ars Praetorian

Ars Tribunus Angusticlavius

Ars Praefectus

Ars Legatus Legionis

Wise, Aged Ars Veteran

Ars Legatus Legionis

Wise, Aged Ars Veteran

Ars Scholae Palatinae

Seniorius Lurkius

Ars Scholae Palatinae

Smack-Fu Master, in training

Ars Scholae Palatinae

Ars Centurion

Smack-Fu Master, in training

Ars Tribunus Militum

Ars Praefectus

Smack-Fu Master, in training

Wise, Aged Ars Veteran

Ars Centurion

Wise, Aged Ars Veteran

Ars Centurion

Ars Centurion

Ars Scholae Palatinae

Ars Praefectus

Wise, Aged Ars Veteran

Ars Praefectus

Wise, Aged Ars Veteran

Ars Praefectus

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Ars Scholae Palatinae