How OpenAI is using GPT-5 Codex to improve the AI tool itself

Been a dev for 18 years. I've always picked up the new tools first out of my team because I enjoy test driving the latest stuff and for some reason I have patience for not getting shit done while I learn new tooling. Probably because I love the efficiency win when you do learn it.
So I was on point with eval'ing MS copilot for the last couple years. Just like any other new tool you need to know how to use the tool to exploit it effectively. I've been iterating my approach to using copilot. I work in .NET almost exclusively.

TLDR - initially a garbage time-waster, hallucinating constantly, couldn't trust it with more than a couple lines, or some UX cleanup. Then with GPT 4, could trust it more. Claude Sonnet was the real eye-opener. Now GPT-5-Codex-Max has been good enough to handle a fairly complex refactor on 6000 lines of code.
Everyone saying this is hype has probably not been on the ground using this stuff.
It's coming for my job, but IMHO the way to stave that off longest will be to learn how to wield the AI sword before it cuts off my head.

This last year I switched jobs and am working in an semi-unfamiliar framework (Blazor on .NET). For a greenfield project that is basically doing things by the book, AI's been awesome and allowed increasing the feature scope and decreasing the timeline. There's features I wouldn't have even tried to pick up because I knew it would take too long that I was able to implement utilizing Copilot.
Tips:
  • I think you have to be an experienced dev to really use it effectively because you know what it should look like and your BS detector can shut down the sycophantic aspects of the LLM.
  • You have to be in the loop and learn how to use it effectively.
  • Don't ask it to do too much, it will waste your time and slop out garbage.
  • Sometimes they will grind on problems and just get screwed up and make things worse. You need to be able to recognize when they're grinding down the wrong path and just solve the problem yourself :D
  • Spec out how you would approach the problem first, and compare its approach.
  • Use different models for different tasks. They'll all try to do what you want but using GPT 4 for something you should be using Sonnet 4.5 for will result in wasted time and garbage.
  • Ask the same question to different models for variations on architecture and approach.
  • Provide as much focused context as you can. e.g. linking files by using the #foo.cs syntax.
  • Use copilot-instructions.md to describe your project architecture and quirks so you don't have to tell it the same thing over and over.
  • Visual Studio Code gets the latest stuff first.
  • Agent mode was a game changer, especially with Claude Sonnet 4 and 4.5
  • Plan mode + agent mode has been a game changer for bigger projects because it keeps the LLM on track.
  • Can it knock out stuff in 5 minutes that would take a junior dev a couple days.. yes. "This page looks old and tired. Use the latest version of bootstrap and UI/UX best practices to make this page accessible". That's a 1 minute job to take junk html to "probably better than I could do". Not 100% but certainly 95%.
  • It's good at simple powershell problems and speeds up script writing so much that I'll use it to write scripts to automate repetitive tasks that I never would have done because I'm not a powershell expert and it can be incredibly fiddly.
  • Along with that, working outside your known languages or frameworks becomes much easier. (and dangerous because you don't know what you don't know.)
  • Agent mode makes it tempting not to, but you must code review their stuff. The better models put out more subtle errors. Use your critical thinking.
Part of my profession as of recently has been recovering vibecoded projects and turning them into real products. The people who built these prototypes thought they were nearly finished, when they were not. These things are nowhere near replacing talented engineers and you don't have to use them.
 
Upvote
32 (32 / 0)

Ozy

Ars Tribunus Angusticlavius
7,448
Part of my profession as of recently has been recovering vibecoded projects and turning them into real products. The people who built these prototypes thought they were nearly finished, when they were not. These things are nowhere near replacing talented engineers and you don't have to use them.
Are they any better than the were 6 mo. ago in your opinion? How much have you used the tools yourself?
 
Upvote
-10 (1 / -11)
Post content hidden for low score. Show…
The critical question is "how do you differentiate an early disruptive technology from an early dead-end technology?". Coming swiftly on the heels of of metaverse and blockchain hype, it is a particularly critical question.

If you've got an answer for that, at least some people here will listen. If you just assert "it's not going away, it's going to change everything" without explaining why you believe it's more likely that the technology is early s-curve than late s-curve, people will not take you seriously because you're not making a serious argument.
I'm not OP but the comments on these articles largely mirror what devs in my company were saying around a year ago. But the split between then and now is growing and growing at an accelerated rate. At work, people around me are experimenting, seeing what works and what doesn't. Mind you, due to the risk-averse nature of my company, we don't have access to the latest and greatest LLMs. What's been happening is the devs are adapting and learning how to work with what we do have access to (generally, Claude 3.x (3.7?) and just last month, GPT 5).

So it's striking to me every time I read comments around here, the dichotomy is so strangely huge. Meanwhile, we have a group piloting Claude Code for test generation and UX designers starting to get access to Figma Make for iterating through actual functional feature enhancements. Usage internally is growing and growing at a high rate. I haven't seen technology get adopted this quickly and as diversely in my 20 years at the company.
 
Upvote
-1 (6 / -7)
EDIT: A bunch of TDs, but no replies. If you disagree, I'd love to hear about WHY you do.
IMO a lot of Ars commenters are very principled people, like using Linux because it's open source. So they see companies like OpenAI as criminals that stole art, ideas, comments, etc from the entire internet. Plus they're usually tech literate and educated, so the built-in failure rate (no matter how small it may get) means the thing is a total write-off.
 
Upvote
26 (26 / 0)
Post content hidden for low score. Show…
Are they any better than the were 6 mo. ago in your opinion? How much have you used the tools yourself?
I use LLMs as a replacement for stack overflow and don't waste time vibecoding. It's a better search experience overall, but information on software libraries is drifting further out of date.
 
Upvote
13 (13 / 0)
This is an awfully uncritical article of tech that has not delivered proven productivity improvements in real-world analysis.

Of course, it's quite effective at wasting the time of actual developers.
As a dev myself, I feel like in may ways I have pushed current AI models to their absolute limits in their ability to "help" developers.

And yes....I think they ultimately waste more of my time than they save if I just blindly try to use them for assistance in every possible task.

They are handy for certain things. Layouts are something they are pretty good at, getting CSS/HTML in the correct place. They are useful for optimizing small snippets or cleaning up simple blocks of code.

I don't even bother to ask "big picture" architecture questions. It's more likely to lead you down a rabbit hole of stupidity than it is to actually solve a problem in a way that makes sense for your specific application.
 
Upvote
16 (16 / 0)

frogstomp

Seniorius Lurkius
23
Subscriptor
I wonder if anyone's looked at how well AI-supported coding does, especially in complex environments where high quality output is important? Oh look, there's an article about that on this cool site I sometimes visit.
I hate that we can do this on seemingly every AI story ars has published recently. What has happened? Where have the science and the skepticism (original meaning) gone?
If I want to read a regurgitated press release I can do that anywhere. It's not what I come to ars for.
 
Upvote
15 (16 / -1)

DeeplyUnconcerned

Ars Scholae Palatinae
1,017
Subscriptor++
I hate that we can do this on seemingly every AI story ars has published recently. What has happened? Where have the science and the skepticism (original meaning) gone?
If I want to read a regurgitated press release I can do that anywhere. It's not what I come to ars for.
The thing I value Ars for is not skepticism, it's curiosity. It's the articles where (I assume) the writer has kept asking "what really happened?" or "what does that mean?" until their own curiosity is satisfied, because that results in articles where my curiosity is also satisfied. The example I'd give by default is that whenever there's a big internet security issue, I'm made aware of it by regular news sites writing articles saying "omg Heartbleed", and I expect there to be a long Ars article a few days later that explains it thoroughly, because an Ars writer who knows their beat has gone and dug around and talked to people until they have an explanation they're satisfied with.
 
Upvote
17 (18 / -1)

VoterFrog

Smack-Fu Master, in training
74
1. Your claim that reviewing the code takes as much time as writing it is baseless.
2. This year I stopped reviewing the code I get generated by GH Copilot. There is simply no need for that. Obviously, for some projects the code reviews are needed more than for others, but consider this:
  • you can use a different LLM to review the code generated by your LLM of choice
  • LLM automatically (and quickly) generates more tests you ever could
  • the best LLMs now create throwaway tests and use them in the process of code generation to test the code/algorithms they are working on
  • in many cases, all you need is that the generated code passed your own tests. The code quality has always been a somewhat subjective topic anyways (with lots of tradeoffs involved). If the code passes all your tests and the performance is adequate, why do you care about the code quality? Sure better code is good from the support/maintenance perspective. This part is critical for long term projects (like, let's say, MS Word). There are relatively few software projects like that. Also, keep in mind, that the software upgrade will be perfomed by LLM anyways.
See now, I'm the opposite in some ways. I spend a lot of time reviewing even my own code. Tending to it like one would prune a bonsai. What the agent does for me is get me to that initial tending state more quickly and then it let's me turn my thoughts on how to improve the code into reality in no time.

I just ask it to do some small refactor. It goes off and does it, writes tests, gets it all working. And while it's doing its thing I'm reading the code and deciding what, if anything, I want to tend to. And because it's handling its own testing and fixing, what I get out always works.

I care about code quality because it makes the entire system more reliable, testable, and able to be extended without regressions. The right kind of structure and abstractions makes testing it a lot more robust. But getting that right takes some foresight into what parts are most likely to change.

It might sound slow but all that has been more than fast enough to make peer review my main bottleneck these days.
 
Upvote
12 (12 / 0)
It's interesting that despite there being countless different jobs that require composing text, the one job that appears to have most widely adopted LLMs to increase productivity, is developers. Can't be a coincidence. Possible explanations:
  1. Developers are the knowledge workers that are the most adaptable to new ways of doing things.
  2. Developers are the most inefficient knowledge workers.
  3. Something about development makes LLMs particularly useful and effective.
  4. LLMs don't actually provide enough advantages to justify the incredibly wide adoption they have seen and developers are particularly vulnerable to believing that a given technology improves output when it doesn't actually do so.
  5. Developers aren't using LLMs as widely as it seems and instead they are just the loudest about it.
It's a good question. It's definitely a mix.
There are some things that LLMs do well and some things they do poorly.
There are also some things that people just don't feel like they need or want help with.
And some things where it feels like cheating or otherwise improper to use an LLM.

Coding is kind of a perfect case, where 1) LLMs are surprisingly capable, 2) coding is really quite a pain in the ass, and 3) programmers don't feel about weird about it.

Coding is actually a new-ish thing. I think until late 2024, the models weren't good enough, and developers only used them as a sort stack-overflow alternative. And LLMs were more commonly used for writing, editing, proofreading, and translation. In 2023, Nature half-jokingly referred to ChatGPT as one of their Scientists of the Year.
 
Last edited:
Upvote
-3 (2 / -5)
Post content hidden for low score. Show…
I use LLMs as a replacement for stack overflow and don't waste time vibecoding. It's a better search experience overall, but information on software libraries is drifting further out of date.
That's where MCP comes into play. There are multiple public MCP server that tell LLM where to look for documentation for the version of API/libraries used by your project.
 
Upvote
-19 (2 / -21)
I use them everyday. Github Co-Pilot (Claude, ChatGPT, etc.) If you're getting AI slop then you need to change what you're doing. I'm looking like a super star with fairly typical human in the middle stuff. Python scripts, java, javascript, etc. Just fine. Also great for Teraform and Devops stuff.

Hell, I threw it at a bunch of COBOL and it did really well.

AI slop is a problem, but it's one addressable by understanding prompting techniques, context compression, and session management.
The hive mind has spoken. I’ve shared a similar anecdote but we’re getting downvoted by people who probably haven’t tried it (at least one person actually admitted they hadn’t tried any of the latest generation of LLMs) or are in an area where it doesn’t help them. And for some reason are unwilling to entertain the possibility that they’re experiences don’t apply to everyone and there are some real use cases that benefit.
 
Upvote
-16 (8 / -24)

Kenjitsuka

Ars Scholae Palatinae
1,196
we’ve been shipping models almost every week or every other week.”
My neighbours ship a bag of trash off every single week. That's twice as fast!
the company has seen the model work independently for 24 hours on complex tasks.
Since when was "time spent" EVER an indicator of quality or real productivity?!!!

Another terrible article, as most comments here already pointed out!!!
 
Upvote
18 (19 / -1)
Post content hidden for low score. Show…

Ozy

Ars Tribunus Angusticlavius
7,448
IMO a lot of Ars commenters are very principled people, like using Linux because it's open source. So they see companies like OpenAI as criminals that stole art, ideas, comments, etc from the entire internet. Plus they're usually tech literate and educated, so the built-in failure rate (no matter how small it may get) means the thing is a total write-off.
Built in failure rate? Is this some sort of indication, in your mind, that humans write less error prone code than the most recent AI tools? I mean, I get it when the AI output is supposed to be the final output. You wouldn't want that "built-in" failure rate writing legal briefs, making final medical diagnoses, navigating our highways. But code ALWAYS has errors in it when it is written, whether it's by a human or AI agent, and code is NOT the desired output, it's whatever that code is supposed to do that is the desired output. So, as long as you have a way to find errors, again human or AI created, as long as you have tests, code reviews, strict compiler checks...basically all of the infrastructure in place to find HUMAN generated errors, why would a 'built-in' error rate for AI code generation matter in the slightest?

If that really is what's holding back all of your 'tech literate' colleauges, I wonder what they actually expect from human coders? Can you put me in touch with one of these educated, tech literate people so I can find where they find software engineers that write code without errors. Pretty please?
 
Upvote
-16 (4 / -20)

el_oscuro

Ars Praefectus
3,129
Subscriptor++
I wonder if anyone's looked at how well AI-supported coding does, especially in complex environments where high quality output is important? Oh look, there's an article about that on this cool site I sometimes visit.

I hate that we can do this on seemingly every AI story ars has published recently. What has happened? Where have the science and the skepticism (original meaning) gone?
If I want to read a regurgitated press release I can do that anywhere. It's not what I come to ars for.
That story was published last July. Conde Nast has since announced it's partnership with Open AI in August. That might explain some things.
 
Upvote
12 (13 / -1)
And some things where it feels like cheating or otherwise improper to use an LLM.
Also-- this probably convinces some people not to use LLMs in their jobs. Other people probably do use LLMs, secretly, and simply get away with it.

If a doctor, for example, uses an LLM to diagnose patients, he's probably not going to get in trouble unless he's stupid enough to brag about it for some reason.
 
Last edited:
Upvote
6 (6 / 0)
Post content hidden for low score. Show…
Built in failure rate? Is this some sort of indication, in your mind, that humans write less error prone code than the most recent AI tools? I mean, I get it when the AI output is supposed to be the final output. You wouldn't want that "built-in" failure rate writing legal briefs, making final medical diagnoses, navigating our highways. But code ALWAYS has errors in it when it is written, whether it's by a human or AI agent, and code is NOT the desired output, it's whatever that code is supposed to do that is the desired output. So, as long as you have a way to find errors, again human or AI created, as long as you have tests, code reviews, strict compiler checks...basically all of the infrastructure in place to find HUMAN generated errors, why would a 'built-in' error rate for AI code generation matter in the slightest?
Because over 60 years of study of quality management, the development of which drove the Japanese economic miracle and guides large-scale industry ever since, says that systems based on adding and checking error are never as reliable as systems that identify and remove sources of error.

Any error that enters the productive flow multiplies effort and cost. At the very least, you’re paying someone to make the error and remove it - and the further the error persists, the more effort to remove. The 1-10-100 rule in data entry is another example of this principle.

Doing stuff badly the first time and catching errors later is a bad idea everywhere. That’ll get you fired from Starbucks. I didn’t expect the graduates I supervised to deliver me code with no errors, but I expected it to perform the function it was designed to in all but corner cases. Just like I expect a Big Mac to have two beef patties, special sauce, cheese, etc.
 
Last edited:
Upvote
29 (29 / 0)
The fact that comments like this get downvoted so aggressively makes me fear for the audience in this forum. Somehow Ars readers, whom I used to consider so astute and technical that they’d add significant content to most articles, have become devout AI deniers?
Maybe that should make you re evaluate your position. But no, it's everyone else who is wrong
 
Upvote
29 (30 / -1)
Maybe that should make you re evaluate your position. But no, it's everyone else who is wrong
The thing is though, it’s somewhat unique to Ars. Devs at my work used to sound like this last year but that’s changed a lot the past few months. A few other commenters here have commented similarly. I still read the comments here, but now it’s mostly as a check on unbridled hype.
 
Upvote
-16 (4 / -20)
Post content hidden for low score. Show…
The critical question is "how do you differentiate an early disruptive technology from an early dead-end technology?". Coming swiftly on the heels of of metaverse and blockchain hype, it is a particularly critical question.
Speaking as a tech enthusiast...

I don't know if anyone who took the metaverse seriously.

The blockchain-- well-- bit-coin-- had a lot of early followers, but they were more like "lol, one day we won't pay taxes", not "we are going to change the world with this technology". Block chain was always a neat idea, but nobody put forward a serious case for it.

AI has been an open ended question, since the days of Turing, who basically tossed out the notion that Moore's Law would lead us to AGI. For decades we've been asking if and how and when AI will hit a wall instead of forever climbing. And the answer has basically always been "maybe"? Unlike the other two examples.
 
Last edited:
Upvote
-14 (2 / -16)
It's interesting that despite there being countless different jobs that require composing text, the one job that appears to have most widely adopted LLMs to increase productivity, is developers. Can't be a coincidence. Possible explanations:
  1. Developers are the knowledge workers that are the most adaptable to new ways of doing things.
  2. Developers are the most inefficient knowledge workers.
  3. Something about development makes LLMs particularly useful and effective.
  4. LLMs don't actually provide enough advantages to justify the incredibly wide adoption they have seen and developers are particularly vulnerable to believing that a given technology improves output when it doesn't actually do so.
  5. Developers aren't using LLMs as widely as it seems and instead they are just the loudest about it.
Developers can analyse the code and complain, with outsiders considering them educated enough on the issue to listen to. Replace them and no one can answer questions about how the software works.
 
Upvote
11 (11 / 0)

Thundercloud

Smack-Fu Master, in training
5
Back when I study programming we had assignments were it was expressly forbidden to use the unix/linux shell to solve the problem. One of the reasons was of course we were supposed to learn to program from scratch, but my point is that good use of the tooling could let you solve the task with minimal effort long before the advent of AI.

Having the AI write boilerplate definitions is neat, but you could do the same even quicker with reg-exps and good use of the old school tooling. This without introducing stupid errors like the AI deciding to change the names used for the column names in your SQL statement because the AI is influenced by other uses of the words in unrelated contexts.

The reason newbies don't generally don't use the more effective techniques are because it takes real effort to learn the tooling and doing the boring work by hand is often quicker than to really learn the tooling.

The big change with AI is IMHO that it lowers the threashold to use tooling that allow large scale changes. You can just use natural language to make the machine guess what you mean and it never gets impatient with your failed attempts. When it works you feel so smug that all the time you wasted before on the task is soon forgotten.

From another perspective the AI revolution is real only when companies are starting to report they get less bugs in the production code by using AI. Very little suggests we are approaching such a state, rather it seems like managment is such believer in vibe coding that the number of bugs and security problems are exploding.

The million dollar question is if using an AI really builds software expertice so you learn to tell what is maintainable code. Being a senior developer is all about having a good BS detector for bad code and IMHO don't learn that without coding yourself.
 
Upvote
21 (22 / -1)

_crane

Wise, Aged Ars Veteran
214
Speaking as a tech enthusiast...

I don't know if anyone who took the metaverse seriously.

The blockchain-- well-- bit-coin-- had a lot of early followers, but they were more like "lol, one day we won't pay taxes", not "we are going to change the world with this technology". Block chain was always a neat idea, but nobody put forward a serious case for it.

AI has been an open ended question, since the days of Turing, who basically tossed out the notion that Moore's Law would lead us to AGI. For decades we've been asking if and how and when AI will hit a wall instead of forever climbing. And the answer has basically always been "maybe"? Unlike the other two examples.
a lot of the ai hype I see could be almost directly copied from metaverse/blockchain/nft equivalents, with just the relevant keywords changed. often it comes from the exact same people.
 
Last edited:
Upvote
19 (19 / 0)
One recent example: I wanted to code some javascript to fill out a form on a website. On many websites, it's one line of code, like document.getElementByID('formField').value='FirstName'
But now most websites use fancy frameworks, so if you do that, the field is updated visually, but doesn't trigger an interaction that the website uses. A year ago, I spent hours trying to get it to work on my own, even searching sites like Stack Overflow, but was unsuccessful. Last week I decided to revisit the problem and asked ChatGPT how to do it, and it gave me 7 lines of code that worked the first time I tried it. I then asked ChatGPT for some code so I could click a button in Excel and have that data sent through an AutoHotKey script that would send the javascript code to Chrome to fill out the form. One part of the code didn't work, but I was able to replace it with some code from a different project.
However there’s quite a large gap between “I plagiarised seven lines of boilerplate scripting code when I couldn’t be bothered to RTFM” and “our IDE is so smart it writes itself! Please send money or we might accidentally make Roko’s Basilisk”.
 
Upvote
15 (15 / 0)

SirUna

Seniorius Lurkius
46
Subscriptor
[....]

This last year I switched jobs and am working in an semi-unfamiliar framework (Blazor on .NET). [...]
  • I think you have to be an experienced dev to really use it effectively because you know what it should look like and your BS detector can shut down the sycophantic aspects of the LLM. [...]

How long did you wait until you added AI to you workflow with this unfamilar framework?
 
Upvote
3 (3 / 0)
a lot of the ai hype I see could be almost directly copied from metaverse/blockchain/nft... often it comes from the exact same people.
Well, you shouldn't listen to those people. You should ignore them.

But ignoring them doesn't mean to jumping to the exact opposite conclusion. In doing so, you run from one position of ignorance to another position of ignorance.

You should, as much as possible, form your own conclusions. And when listening to people, you should look for who is the smartest in the room, and give them a chance to make their case.

I myself, certainly never believed in 'NFTs' and I have never owned a bitcoin or a 'meta' device. The same goes for all the people I know in real life who use AI regularly and expect it to change the world.

When you define your opinions as the opposite of people you don't like, you are essentially giving them power.
 
Last edited:
Upvote
-13 (0 / -13)

columbrian

Wise, Aged Ars Veteran
104
Thanks for the feedback. I have updated the piece to specifically mention the METR study.

We plan to compare the performance of these agentic coding tools (Codex, Claude Code, Gemini CLI, maybe Mistral Vibe) in a future piece very soon, so stay tuned.

Yeah, ok. But try to give it a task that isn't asking it to write the same code everyone else has already written. If you can find an example of the code you're asking for online, it doesn't count asking the AI to regurgitate some (likely broken) form of it.

Claude recently offered to show me how smart it is by designing the architecture of a project of my choosing. When I asked it to come up with a Zephyr based IoT device that captured GPS location and pushed it to a web service, it happily drew an SVG diagram that was nothing short of hilarious. It had queues, ring buffers, and all kinds of random stuff. It was fully buzzword compliant, but nothing in the diagram made any sense because nothing was connected in any meaningful way. It was like a failing undergraduate student's project submission after they pulled an all-nighter.

About the only things I have found LLMs useful for, other than copying other people's work without knowing who the original authors were, is to summarize other information such as search results, and to pad things out with narrative fluff. Of course, the person reading my AI fluff is probably using an LLM to summarize it so they don't have to waste their time reading it.
 
Upvote
20 (20 / 0)

richgroot

Smack-Fu Master, in training
62
Subscriptor++
What I would personally NEVER use them for? Large refactor tasks -- their context lengths just aren't big enough to handle too much content at once and they WILL get confused, hallucinate, or worse, sometimes commit outright fraud (and then lie to you about it). Anything related to science or research -- if you're working on something new, by definition, they're probabilistic so they won't be able to help you with creativity. Also just avoid ambiguous prompts in general; as much as people want to, you can't outsource the actual thinking itself.
Yes! I discovered that "large" is pretty small when I tried to refactor some code and had to revert from github. Grumble.
 
Upvote
0 (0 / 0)

Selethorme

Ars Praetorian
522
Subscriptor++
Citation needed.
It was provided multiple times in this thread.
But even though I'm the ceo of my company, I don't force a single employee to use AI because they're absolutely crap at it.

I have noticed in trial runs we have had that it actually hurts their performance. I'm still holding educational webinars and paying for llm tools but no one is required to use them.
This says so much about your management style, and very little about the capability of your employees or AI.
And for some reason are unwilling to entertain the possibility that they’re experiences don’t apply to everyone and there are some real use cases that benefit.
Oh the irony. No, there've been plenty of people in this thread talking about their use of AI, and how it's comically bad at so many of the things y'all claim it's good at.
Increasing the time that LLM can work without losing the context is a very big deal. It means that the models now can be used for solving much more complex problems than before.
If it can spend more time on it and still get it wrong, it's literally less efficient.
it is genuinely hilarious to see the "wall" crowd in this thread trying to hand-wave the most significant event in the history of software engineering.
Because it's not. And it's very telling you have no actual response to the many detailed explanations. Y'all are high on your own supply.
openai just admitted that codex is building codex. we are officially in the recursive loop.
"Admitted" is a very telling bit of word choice in terms of believing your own hype. OpenAI wants you to believe that for the very reason that you choose to: it makes you believe more in the capability of the system. Even though, as pointed out, again, in the comments, it's utterly bullshit.
to the guy quoting the metr study from july:
The one that even the author of the piece, who is transparently on your side, admitted was a good point?
the fact that four engineers shipped sora for android in 18 days from scratch should be a siren for anyone still arguing that this is just "fancy intellisense."
My guy, a single dev can spin up an app for android in an afternoon. It's really not hard to build an app today. That they wrapped a slick GUI around it is not meaningfully impressive across 18 days.
No, dude, if AI can write unit tests, it can self-correct errors. I've seen it happen.
But it can't.
You are still operating under the assumption that AI code with MORE errors than human engineers, and/or that these are qualitatively different errors than human generated errors.
Because it does. Demonstrably. As was referenced multiple times in the thread.
 
Upvote
13 (15 / -2)
Yeah, ok. But try to give it a task that isn't asking it to write the same code everyone else has already written. If you can find an example of the code you're asking for online, it doesn't count asking the AI to regurgitate some (likely broken) form of it.

Claude recently offered to show me how smart it is by designing the architecture of a project of my choosing. When I asked it to come up with a Zephyr based IoT device that captured GPS location and pushed it to a web service, it happily drew an SVG diagram that was nothing short of hilarious. It had queues, ring buffers, and all kinds of random stuff. It was fully buzzword compliant, but nothing in the diagram made any sense because nothing was connected in any meaningful way. It was like a failing undergraduate student's project submission after they pulled an all-nighter.

About the only things I have found LLMs useful for, other than copying other people's work without knowing who the original authors were, is to summarize other information such as search results, and to pad things out with narrative fluff. Of course, the person reading my AI fluff is probably using an LLM to summarize it so they don't have to waste their time reading it.
This is a particularly bad use of AI. LLMs are absolutely terrible at anything visual. Hopefully you learned your lesson there. Gemini 3 was the first step towards an AI that can "see" but it's still a ways from using it for technical diagrams. Claude is still blind to the point of ridiculous.

It would be nice to have an open conversation here about what exactly people do with LLMs and why. Where LLMs succeed and where they fail.

I suspect we would agree on a few things, and (imho) that would be nice. One thing we might all agree on is that LLMs are (currently) terrible at making diagrams.
 
Last edited:
Upvote
-7 (3 / -10)