How OpenAI is using GPT-5 Codex to improve the AI tool itself

Post content hidden for low score. Show…

Sarty

Ars Tribunus Angusticlavius
7,816
Who knew OpenAI had all of these amazing, world-changing, earth-shattering developments waiting for (checks calendar) just like a week and a half about Altman's freakout email?

Gosh, does Altman know what his own company actually has waiting in the wings? Or is this a flailing squid-ink-cloud of bullshit?
 
Upvote
263 (274 / -11)

Missing Minute

Wise, Aged Ars Veteran
1,386
It's interesting that despite there being countless different jobs that require composing text, the one job that appears to have most widely adopted LLMs to increase productivity, is developers. Can't be a coincidence. Possible explanations:
  1. Developers are the knowledge workers that are the most adaptable to new ways of doing things.
  2. Developers are the most inefficient knowledge workers.
  3. Something about development makes LLMs particularly useful and effective.
  4. LLMs don't actually provide enough advantages to justify the incredibly wide adoption they have seen and developers are particularly vulnerable to believing that a given technology improves output when it doesn't actually do so.
  5. Developers aren't using LLMs as widely as it seems and instead they are just the loudest about it.
 
Last edited:
Upvote
71 (93 / -22)

Missing Minute

Wise, Aged Ars Veteran
1,386
This is an awfully uncritical article of tech that has not delivered proven productivity improvements in real-world analysis.

Of course, it's quite effective at wasting the time of actual developers.
When all you do all day every day at work is read about AI it's really easy to drown your critical thinking in the torrent of content praising it.
 
Upvote
135 (145 / -10)

Robin-3

Ars Scholae Palatinae
1,127
Subscriptor
Given the well-known issues with confabulation in AI models when people attempt to use them as factual resources, could it be that coding has become the killer app for LLMs? We wondered if OpenAI has noticed that coding seems to be a clear business use case for today’s AI models with less hazard than, say, using AI language models for writing or as emotional companions.

I wonder if anyone's looked at how well AI-supported coding does, especially in complex environments where high quality output is important? Oh look, there's an article about that on this cool site I sometimes visit. A quote from that article:

These factors lead the researchers to conclude that current AI coding tools may be particularly ill-suited to “settings with very high quality standards, or with many implicit requirements (e.g., relating to documentation, testing coverage, or linting/formatting) that take humans substantial time to learn.” [...]
(and)
For now, however, METR’s study provides some strong evidence that AI’s much-vaunted usefulness for coding tasks may have significant limitations in certain complex, real-world coding scenarios.

... Not 100% the same thing, I know. But seriously - this writeup feels like there was no pushback against the talking points provided by people with a financial interest in selling this thing.
 
Upvote
215 (222 / -7)

hillspuck

Ars Scholae Palatinae
2,179
It's interesting that despite there being countless different jobs that require composing text, the one job that has most widely adopted LLMs to increase productivity, is developers.
What's your source on that statistic?

As a programmer, I have USED LLMs quite often. But as far as "adopting" them, I'm far from it. They are highly faulty and of limited use. I typically give them a shot on a lot of things but don't trust anything they output. They frequently waste much more of my time than they save other times. I have just about given up on asking them anything about Unity, because the error rate is astronomical. I will use them for Powershell scripts, though, because they usually get that right (and they are small enough to be easy to check).
 
Upvote
135 (141 / -6)
This is an awfully uncritical article of tech that has not delivered proven productivity improvements in real-world analysis.

Of course, it's quite effective at wasting the time of actual developers.
While it's fun to hate on AI (and justifiably so), I would argue one of the only proven productivity improvements is in coding. The caveat is it depends heavily on (1) the field of work being done and (2) the type of work being done.

Some things like web or app development with lots of robust documentation and examples, these agents are fantastic and practically set and forget. For most well-defined actionable prompts in other areas, agents typically will do a decent job. Designing unit tests and auditing code; they make some mistakes, but you can choose to reject false positives and you can do multiple iterations. They're really good also at summarizing, helping to plan and organize, and drafting documentation.

What I would personally NEVER use them for? Large refactor tasks -- their context lengths just aren't big enough to handle too much content at once and they WILL get confused, hallucinate, or worse, sometimes commit outright fraud (and then lie to you about it). Anything related to science or research -- if you're working on something new, by definition, they're probabilistic so they won't be able to help you with creativity. Also just avoid ambiguous prompts in general; as much as people want to, you can't outsource the actual thinking itself.

I treat them like I would treat an extremely brown-nosey first year PhD student. Highly motivated and in the right environment) very productive, but not too bright independently. So just like such a student, you'll need to set them up for success by doing the thinking for them and doing some hand-holding.

But if you put in that legwork, the productivity gains are very real. In the last year, we've probably multiplied our output -- peer-reviewed publications and open sourced PRs, not AI slop -- at least five-fold. But as mentioned above, YMMV.

EDIT: A bunch of TDs, but no replies. If you disagree, I'd love to hear about WHY you do. I'm giving a concrete anecdote of where AI has been extremely valuable for the group I lead and also pointed out the clear limitations in my experience where it falls flat. I'm genuinely surprised to that my opinion -- that AI in coding is not remotely "smart" or PhD-level as OpenAI claims but still a tremendously valuable tool -- is controversial or unpopular.

As far the link the person to whom I replied originally shared (worth a read, kind of sad/hilarious), that's a perfect example of someone who did NOT do any legwork but vibe coded their way to a largely untested 10K+ lines of unmaintainable code. And then doubled down on just trusting that the AI understands the code. Literally the opposite of what I'm espousing for anyone who wants to try incorporating LLMs into their workflows.
 
Last edited:
Upvote
10 (110 / -100)
Post content hidden for low score. Show…

DamanielH

Smack-Fu Master, in training
93
OpenAI has a long history of "grand exaggeration" when it comes to AI's supposed capabilities and achievements. Could it be that they are pumping up their coming IPO?

Can we see an independent review of these capabilities??
Not to mention a history of 'journalists' regurgitating their lies and exaggerations as 'news'.

Do better.
 
Upvote
163 (173 / -10)
This is an awfully uncritical article of tech that has not delivered proven productivity improvements in real-world analysis.

Of course, it's quite effective at wasting the time of actual developers.
Right. How can anyone evaluate any of these claims? Claims made, check.

Claims are worth their weight in gold!
 
Upvote
58 (59 / -1)

WaveMotionGum

Ars Centurion
369
Subscriptor
It's interesting that despite there being countless different jobs that require composing text, the one job that has most widely adopted LLMs to increase productivity, is developers. Can't be a coincidence. Possible explanations:
  1. Developers are the knowledge workers that are the most adaptable to new ways of doing things.
  2. Developers are the most inefficient knowledge workers.
  3. Something about development makes LLMs particularly useful and effective.
  4. LLMs don't actually provide enough advantages to justify the incredibly wide adoption they have seen and developers are particularly vulnerable to believing that a given technology improves output when it doesn't actually do so.
  5. Developers aren't using LLMs as widely as it seems and instead they are just the loudest about it.
6. Developers were already using auto complete and intellisense in their IDEs.
7. Developers were already reusing code wherever they could.
 
Upvote
118 (120 / -2)

Missing Minute

Wise, Aged Ars Veteran
1,386
What's your source on that statistic?

As a programmer, I have USED LLMs quite often. But as far as "adopting" them, I'm far from it. They are highly faulty and of limited use. I typically give them a shot on a lot of things but don't trust anything they output. They frequently waste much more of my time than they save other times. I have just about given up on asking them anything about Unity, because the error rate is astronomical. I will use them for Powershell scripts, though, because they usually get that right (and they are small enough to be easy to check).
See:
5. Developers aren't using LLMs as widely as it seems and instead they are just the loudest about it.
 
Upvote
35 (42 / -7)
I wonder if anyone's looked at how well AI-supported coding does, especially in complex environments where high quality output is important? Oh look, there's an article about that on this cool site I sometimes visit. A quote from that article:


(and)


... Not 100% the same thing, I know. But seriously - this writeup feels like there was no pushback against the talking points provided by people with a financial interest in selling this thing.
Thanks for the feedback. I have updated the piece to specifically mention the METR study.

We plan to compare the performance of these agentic coding tools (Codex, Claude Code, Gemini CLI, maybe Mistral Vibe) in a future piece very soon, so stay tuned.
 
Upvote
109 (118 / -9)
Post content hidden for low score. Show…
As a side note, I'd love to know how much of the increased use of AI is from employees being told in no uncertain terms "management expects everyone to integrate AI into their daily tasks ASAP," whether that integration makes any sense or not.
So far everywhere I looked, that’s what it was.
 
Upvote
57 (60 / -3)
While it's fun to hate on AI (and justifiably so), I would argue one of the only proven productivity improvements is in coding. The caveat is it depends heavily on (1) the field of work being done and (2) the type of work being done.

Some things like web or app development with lots of robust documentation and examples, these agents are fantastic and practically set and forget. For most well-defined actionable prompts in other areas, agents typically will do a decent job. Designing unit tests and auditing code; they make some mistakes, but you can choose to reject false positives and you can do multiple iterations. They're really good also at summarizing, helping to plan and organize, and drafting documentation.

What I would personally NEVER use them for? Large refactor tasks -- their context lengths just aren't big enough to handle too much content at once and they WILL get confused, hallucinate, or worse, sometimes commit outright fraud (and then lie to you about it). Anything related to science or research -- if you're working on something new, by definition, they're probabilistic so they won't be able to help you with creativity. Also just avoid ambiguous prompts in general; as much as people want to, you can't outsource the actual thinking itself.

I treat them like I would treat an extremely brown-nosey first year PhD student. Highly motivated and in the right environment) very productive, but not too bright independently. So just like such a student, you'll need to set them up for success by doing the thinking for them and doing some hand-holding.

But if you put in that legwork, the productivity gains are very real. In the last year, we've probably multiplied our output -- peer-reviewed publications and open sourced PRs, not AI slop -- at least five-fold. But as mentioned above, YMMV.

EDIT: A bunch of TDs, but no replies. If you disagree, I'd love to hear about WHY you do. I'm giving a concrete anecdote of where AI has been extremely valuable for the group I lead and also pointed out the clear limitations in my experience where it falls flat. I'm genuinely surprised to that my opinion -- that AI in coding is not remotely "smart" or PhD-level as OpenAI claims but still a tremendously valuable tool -- is controversial or unpopular.

As far the link the person to whom I replied originally shared (worth a read, kind of sad/hilarious), that's a perfect example of someone who did NOT do any legwork but vibe coded their way to a largely untested 10K+ lines of unmaintainable code. And then doubled down on just trusting that the AI understands the code. Literally the opposite of what I'm espousing for anyone who wants to try incorporating LLMs into their workflows.
I'd say in large part because of this study. It's mentioned that users who used AI even thought they had made performance gains when in fact they had not. So you saying you gained "at least five-fold" w/o any data to back you up just gives off the same "yeah it definitely made me faster" vibes as in the study.

2) I tested AI (Chat GPT 4.0? 3.X?, can't remember) for a relatively minor task I'd expect it to be able to handle (creating a custom gesture recognizer) and found that it missed a few very crucial lines of code that otherwise looked perfectly accurate. If I hadn't already written the code myself and compared it I might not have caught it at first. Perhaps I would have caught it, or caught it when testing it, or someone else might have; but either way it only got 95% of the way there.

For me, 95% good on a super simple task just doesn't cut it. I'm not interested in wasting my time debugging bad code, it's the least interesting and fulfilling part of the job so I'm not looking to expand on it. I can only imagine how much worse it could get when trying to tackle actual complex problems.
 
Upvote
124 (129 / -5)
What's your source on that statistic?

As a programmer, I have USED LLMs quite often. But as far as "adopting" them, I'm far from it. They are highly faulty and of limited use. I typically give them a shot on a lot of things but don't trust anything they output. They frequently waste much more of my time than they save other times. I have just about given up on asking them anything about Unity, because the error rate is astronomical. I will use them for Powershell scripts, though, because they usually get that right (and they are small enough to be easy to check).
If you skim his posting history and be charitable, he was being facetious and the correct answer was to mash the '5' button.

The developers in question being loudest about it are OpenAI and other companies' paid marketing astroturfers posting here.
 
Upvote
50 (50 / 0)

clb2c4e

Wise, Aged Ars Veteran
145
But if you put in that legwork, the productivity gains are very real. In the last year, we've probably multiplied our output -- peer-reviewed publications and open sourced PRs, not AI slop -- at least five-fold. But as mentioned above, YMMV.
This is where you got my down vote from. "Improving productivity and multiplying output" through 5x more publications is, I would say, almost worse than AI slop.

Publication quantity is a problem not a virtue of academia right now.

If you could use AI to produce 5x fewer publications that are much higher quality, more succinct, and higher impact then I'd be impressed.

As it is, you are just contributing to the degradation of research by making it into a publication output machine.
 
Upvote
161 (163 / -2)
This is where you got my down vote from. "Improving productivity and multiplying output" through 5x more publications is, I would say, almost worse than AI slop.

Publication quantity is a problem not a virtue of academia right now.

If you could use AI to produce 5x fewer publications that are much higher quality, more succinct, and higher impact then I'd be impressed.

As it is, you are just contributing to the degradation of research by making it into a publication output machine.
Which is then used to train, unknowingly, by other AI models later on down the track when scraped, both legally and illegally, creating even more of a shit sandwich.
 
Upvote
65 (66 / -1)

Sarty

Ars Tribunus Angusticlavius
7,816
The developers in question being loudest about it are OpenAI and other companies' paid marketing astroturfers posting here.
I wish. I've come to the regrettable conclusion that the biggest fanboys and boosters are, while utterly mistaken in their analyses, very real people displaying very real excitement.
 
Upvote
81 (81 / 0)
Post content hidden for low score. Show…

_crane

Wise, Aged Ars Veteran
214
This should be utterly unsurprising to anyone familiar with build tools history. In an effort to eat their own dogfood, build tools have traditionally been used to build themselves, whenever possible. Clang-llvm builds clang-llvm. It is the natural next step for a AI code development project to author and build itself. If it’s producing poor quality code that is clearly a bug in the code base itself and we can fix that and then hypothetically see the improvement over the problem in question but also the rest of the code base. It is the sensible way to move forward.

When the time comes we would also expect to see robots built not in factories but by other robots from a bucket of spare parts. A factory can only deliver the throughput that it was specced for and is forever limited to that until you build another. Robots building robots can grow exponentially and is basically only limited by the logistics of delivering parts to an ever expanding body of robots. (Insert comical descriptions of rabbits with an unlimited supply of food reproducing faster than the speed of sound.) While a factory can bootstrap this process, ultimately the repeated doubling will dwarf its capacity into irrelevance. The early winner in the robot arms race will be the one who makes robot assembling robots with the shortest generational time and cheapest parts list.
what do you think a "factory" is?
 
Upvote
68 (68 / 0)

maxoakland

Ars Scholae Palatinae
1,309
Bro. If you want a softball interview podcast, do that. If you want to do journalism, this ain't it. Even just a verbatim transcript would be of more worth.
Seriously. I'm getting tired of Arstechnica articles that could've been a press release. For some reason, they're always about AI too.

I came to Ars because it had in-depth journalism about tech. The writers were knowledgable and not easily swayed by market speak.

Maybe Ars isn't the place to find that kind of journalism anymore?
 
Upvote
120 (138 / -18)