"The vast majority of Codex is built by Codex," OpenAI told us about its new AI coding agent.
See full article...
See full article...
Part of my profession as of recently has been recovering vibecoded projects and turning them into real products. The people who built these prototypes thought they were nearly finished, when they were not. These things are nowhere near replacing talented engineers and you don't have to use them.Been a dev for 18 years. I've always picked up the new tools first out of my team because I enjoy test driving the latest stuff and for some reason I have patience for not getting shit done while I learn new tooling. Probably because I love the efficiency win when you do learn it.
So I was on point with eval'ing MS copilot for the last couple years. Just like any other new tool you need to know how to use the tool to exploit it effectively. I've been iterating my approach to using copilot. I work in .NET almost exclusively.
TLDR - initially a garbage time-waster, hallucinating constantly, couldn't trust it with more than a couple lines, or some UX cleanup. Then with GPT 4, could trust it more. Claude Sonnet was the real eye-opener. Now GPT-5-Codex-Max has been good enough to handle a fairly complex refactor on 6000 lines of code.
Everyone saying this is hype has probably not been on the ground using this stuff.
It's coming for my job, but IMHO the way to stave that off longest will be to learn how to wield the AI sword before it cuts off my head.
This last year I switched jobs and am working in an semi-unfamiliar framework (Blazor on .NET). For a greenfield project that is basically doing things by the book, AI's been awesome and allowed increasing the feature scope and decreasing the timeline. There's features I wouldn't have even tried to pick up because I knew it would take too long that I was able to implement utilizing Copilot.
Tips:
- I think you have to be an experienced dev to really use it effectively because you know what it should look like and your BS detector can shut down the sycophantic aspects of the LLM.
- You have to be in the loop and learn how to use it effectively.
- Don't ask it to do too much, it will waste your time and slop out garbage.
- Sometimes they will grind on problems and just get screwed up and make things worse. You need to be able to recognize when they're grinding down the wrong path and just solve the problem yourself
- Spec out how you would approach the problem first, and compare its approach.
- Use different models for different tasks. They'll all try to do what you want but using GPT 4 for something you should be using Sonnet 4.5 for will result in wasted time and garbage.
- Ask the same question to different models for variations on architecture and approach.
- Provide as much focused context as you can. e.g. linking files by using the #foo.cs syntax.
- Use copilot-instructions.md to describe your project architecture and quirks so you don't have to tell it the same thing over and over.
- Visual Studio Code gets the latest stuff first.
- Agent mode was a game changer, especially with Claude Sonnet 4 and 4.5
- Plan mode + agent mode has been a game changer for bigger projects because it keeps the LLM on track.
- Can it knock out stuff in 5 minutes that would take a junior dev a couple days.. yes. "This page looks old and tired. Use the latest version of bootstrap and UI/UX best practices to make this page accessible". That's a 1 minute job to take junk html to "probably better than I could do". Not 100% but certainly 95%.
- It's good at simple powershell problems and speeds up script writing so much that I'll use it to write scripts to automate repetitive tasks that I never would have done because I'm not a powershell expert and it can be incredibly fiddly.
- Along with that, working outside your known languages or frameworks becomes much easier. (and dangerous because you don't know what you don't know.)
- Agent mode makes it tempting not to, but you must code review their stuff. The better models put out more subtle errors. Use your critical thinking.
Are they any better than the were 6 mo. ago in your opinion? How much have you used the tools yourself?Part of my profession as of recently has been recovering vibecoded projects and turning them into real products. The people who built these prototypes thought they were nearly finished, when they were not. These things are nowhere near replacing talented engineers and you don't have to use them.
I'm not OP but the comments on these articles largely mirror what devs in my company were saying around a year ago. But the split between then and now is growing and growing at an accelerated rate. At work, people around me are experimenting, seeing what works and what doesn't. Mind you, due to the risk-averse nature of my company, we don't have access to the latest and greatest LLMs. What's been happening is the devs are adapting and learning how to work with what we do have access to (generally, Claude 3.x (3.7?) and just last month, GPT 5).The critical question is "how do you differentiate an early disruptive technology from an early dead-end technology?". Coming swiftly on the heels of of metaverse and blockchain hype, it is a particularly critical question.
If you've got an answer for that, at least some people here will listen. If you just assert "it's not going away, it's going to change everything" without explaining why you believe it's more likely that the technology is early s-curve than late s-curve, people will not take you seriously because you're not making a serious argument.
IMO a lot of Ars commenters are very principled people, like using Linux because it's open source. So they see companies like OpenAI as criminals that stole art, ideas, comments, etc from the entire internet. Plus they're usually tech literate and educated, so the built-in failure rate (no matter how small it may get) means the thing is a total write-off.EDIT: A bunch of TDs, but no replies. If you disagree, I'd love to hear about WHY you do.
I use LLMs as a replacement for stack overflow and don't waste time vibecoding. It's a better search experience overall, but information on software libraries is drifting further out of date.Are they any better than the were 6 mo. ago in your opinion? How much have you used the tools yourself?
As a dev myself, I feel like in may ways I have pushed current AI models to their absolute limits in their ability to "help" developers.This is an awfully uncritical article of tech that has not delivered proven productivity improvements in real-world analysis.
Of course, it's quite effective at wasting the time of actual developers.
I hate that we can do this on seemingly every AI story ars has published recently. What has happened? Where have the science and the skepticism (original meaning) gone?I wonder if anyone's looked at how well AI-supported coding does, especially in complex environments where high quality output is important? Oh look, there's an article about that on this cool site I sometimes visit.
The thing I value Ars for is not skepticism, it's curiosity. It's the articles where (I assume) the writer has kept asking "what really happened?" or "what does that mean?" until their own curiosity is satisfied, because that results in articles where my curiosity is also satisfied. The example I'd give by default is that whenever there's a big internet security issue, I'm made aware of it by regular news sites writing articles saying "omg Heartbleed", and I expect there to be a long Ars article a few days later that explains it thoroughly, because an Ars writer who knows their beat has gone and dug around and talked to people until they have an explanation they're satisfied with.I hate that we can do this on seemingly every AI story ars has published recently. What has happened? Where have the science and the skepticism (original meaning) gone?
If I want to read a regurgitated press release I can do that anywhere. It's not what I come to ars for.
See now, I'm the opposite in some ways. I spend a lot of time reviewing even my own code. Tending to it like one would prune a bonsai. What the agent does for me is get me to that initial tending state more quickly and then it let's me turn my thoughts on how to improve the code into reality in no time.1. Your claim that reviewing the code takes as much time as writing it is baseless.
2. This year I stopped reviewing the code I get generated by GH Copilot. There is simply no need for that. Obviously, for some projects the code reviews are needed more than for others, but consider this:
- you can use a different LLM to review the code generated by your LLM of choice
- LLM automatically (and quickly) generates more tests you ever could
- the best LLMs now create throwaway tests and use them in the process of code generation to test the code/algorithms they are working on
- in many cases, all you need is that the generated code passed your own tests. The code quality has always been a somewhat subjective topic anyways (with lots of tradeoffs involved). If the code passes all your tests and the performance is adequate, why do you care about the code quality? Sure better code is good from the support/maintenance perspective. This part is critical for long term projects (like, let's say, MS Word). There are relatively few software projects like that. Also, keep in mind, that the software upgrade will be perfomed by LLM anyways.
It's a good question. It's definitely a mix.It's interesting that despite there being countless different jobs that require composing text, the one job that appears to have most widely adopted LLMs to increase productivity, is developers. Can't be a coincidence. Possible explanations:
- Developers are the knowledge workers that are the most adaptable to new ways of doing things.
- Developers are the most inefficient knowledge workers.
- Something about development makes LLMs particularly useful and effective.
- LLMs don't actually provide enough advantages to justify the incredibly wide adoption they have seen and developers are particularly vulnerable to believing that a given technology improves output when it doesn't actually do so.
- Developers aren't using LLMs as widely as it seems and instead they are just the loudest about it.
That's where MCP comes into play. There are multiple public MCP server that tell LLM where to look for documentation for the version of API/libraries used by your project.I use LLMs as a replacement for stack overflow and don't waste time vibecoding. It's a better search experience overall, but information on software libraries is drifting further out of date.
The hive mind has spoken. I’ve shared a similar anecdote but we’re getting downvoted by people who probably haven’t tried it (at least one person actually admitted they hadn’t tried any of the latest generation of LLMs) or are in an area where it doesn’t help them. And for some reason are unwilling to entertain the possibility that they’re experiences don’t apply to everyone and there are some real use cases that benefit.I use them everyday. Github Co-Pilot (Claude, ChatGPT, etc.) If you're getting AI slop then you need to change what you're doing. I'm looking like a super star with fairly typical human in the middle stuff. Python scripts, java, javascript, etc. Just fine. Also great for Teraform and Devops stuff.
Hell, I threw it at a bunch of COBOL and it did really well.
AI slop is a problem, but it's one addressable by understanding prompting techniques, context compression, and session management.
My neighbours ship a bag of trash off every single week. That's twice as fast!we’ve been shipping models almost every week or every other week.”
Since when was "time spent" EVER an indicator of quality or real productivity?!!!the company has seen the model work independently for 24 hours on complex tasks.
Built in failure rate? Is this some sort of indication, in your mind, that humans write less error prone code than the most recent AI tools? I mean, I get it when the AI output is supposed to be the final output. You wouldn't want that "built-in" failure rate writing legal briefs, making final medical diagnoses, navigating our highways. But code ALWAYS has errors in it when it is written, whether it's by a human or AI agent, and code is NOT the desired output, it's whatever that code is supposed to do that is the desired output. So, as long as you have a way to find errors, again human or AI created, as long as you have tests, code reviews, strict compiler checks...basically all of the infrastructure in place to find HUMAN generated errors, why would a 'built-in' error rate for AI code generation matter in the slightest?IMO a lot of Ars commenters are very principled people, like using Linux because it's open source. So they see companies like OpenAI as criminals that stole art, ideas, comments, etc from the entire internet. Plus they're usually tech literate and educated, so the built-in failure rate (no matter how small it may get) means the thing is a total write-off.
I wonder if anyone's looked at how well AI-supported coding does, especially in complex environments where high quality output is important? Oh look, there's an article about that on this cool site I sometimes visit.
That story was published last July. Conde Nast has since announced it's partnership with Open AI in August. That might explain some things.I hate that we can do this on seemingly every AI story ars has published recently. What has happened? Where have the science and the skepticism (original meaning) gone?
If I want to read a regurgitated press release I can do that anywhere. It's not what I come to ars for.
Also-- this probably convinces some people not to use LLMs in their jobs. Other people probably do use LLMs, secretly, and simply get away with it.And some things where it feels like cheating or otherwise improper to use an LLM.
Because over 60 years of study of quality management, the development of which drove the Japanese economic miracle and guides large-scale industry ever since, says that systems based on adding and checking error are never as reliable as systems that identify and remove sources of error.Built in failure rate? Is this some sort of indication, in your mind, that humans write less error prone code than the most recent AI tools? I mean, I get it when the AI output is supposed to be the final output. You wouldn't want that "built-in" failure rate writing legal briefs, making final medical diagnoses, navigating our highways. But code ALWAYS has errors in it when it is written, whether it's by a human or AI agent, and code is NOT the desired output, it's whatever that code is supposed to do that is the desired output. So, as long as you have a way to find errors, again human or AI created, as long as you have tests, code reviews, strict compiler checks...basically all of the infrastructure in place to find HUMAN generated errors, why would a 'built-in' error rate for AI code generation matter in the slightest?
Maybe that should make you re evaluate your position. But no, it's everyone else who is wrongThe fact that comments like this get downvoted so aggressively makes me fear for the audience in this forum. Somehow Ars readers, whom I used to consider so astute and technical that they’d add significant content to most articles, have become devout AI deniers?
The thing is though, it’s somewhat unique to Ars. Devs at my work used to sound like this last year but that’s changed a lot the past few months. A few other commenters here have commented similarly. I still read the comments here, but now it’s mostly as a check on unbridled hype.Maybe that should make you re evaluate your position. But no, it's everyone else who is wrong
Speaking as a tech enthusiast...The critical question is "how do you differentiate an early disruptive technology from an early dead-end technology?". Coming swiftly on the heels of of metaverse and blockchain hype, it is a particularly critical question.
Developers can analyse the code and complain, with outsiders considering them educated enough on the issue to listen to. Replace them and no one can answer questions about how the software works.It's interesting that despite there being countless different jobs that require composing text, the one job that appears to have most widely adopted LLMs to increase productivity, is developers. Can't be a coincidence. Possible explanations:
- Developers are the knowledge workers that are the most adaptable to new ways of doing things.
- Developers are the most inefficient knowledge workers.
- Something about development makes LLMs particularly useful and effective.
- LLMs don't actually provide enough advantages to justify the incredibly wide adoption they have seen and developers are particularly vulnerable to believing that a given technology improves output when it doesn't actually do so.
- Developers aren't using LLMs as widely as it seems and instead they are just the loudest about it.
a lot of the ai hype I see could be almost directly copied from metaverse/blockchain/nft equivalents, with just the relevant keywords changed. often it comes from the exact same people.Speaking as a tech enthusiast...
I don't know if anyone who took the metaverse seriously.
The blockchain-- well-- bit-coin-- had a lot of early followers, but they were more like "lol, one day we won't pay taxes", not "we are going to change the world with this technology". Block chain was always a neat idea, but nobody put forward a serious case for it.
AI has been an open ended question, since the days of Turing, who basically tossed out the notion that Moore's Law would lead us to AGI. For decades we've been asking if and how and when AI will hit a wall instead of forever climbing. And the answer has basically always been "maybe"? Unlike the other two examples.
However there’s quite a large gap between “I plagiarised seven lines of boilerplate scripting code when I couldn’t be bothered to RTFM” and “our IDE is so smart it writes itself! Please send money or we might accidentally make Roko’s Basilisk”.One recent example: I wanted to code some javascript to fill out a form on a website. On many websites, it's one line of code, like document.getElementByID('formField').value='FirstName'
But now most websites use fancy frameworks, so if you do that, the field is updated visually, but doesn't trigger an interaction that the website uses. A year ago, I spent hours trying to get it to work on my own, even searching sites like Stack Overflow, but was unsuccessful. Last week I decided to revisit the problem and asked ChatGPT how to do it, and it gave me 7 lines of code that worked the first time I tried it. I then asked ChatGPT for some code so I could click a button in Excel and have that data sent through an AutoHotKey script that would send the javascript code to Chrome to fill out the form. One part of the code didn't work, but I was able to replace it with some code from a different project.
[....]
This last year I switched jobs and am working in an semi-unfamiliar framework (Blazor on .NET). [...]
- I think you have to be an experienced dev to really use it effectively because you know what it should look like and your BS detector can shut down the sycophantic aspects of the LLM. [...]
Well, you shouldn't listen to those people. You should ignore them.a lot of the ai hype I see could be almost directly copied from metaverse/blockchain/nft... often it comes from the exact same people.
Thanks for the feedback. I have updated the piece to specifically mention the METR study.
We plan to compare the performance of these agentic coding tools (Codex, Claude Code, Gemini CLI, maybe Mistral Vibe) in a future piece very soon, so stay tuned.
Yes! I discovered that "large" is pretty small when I tried to refactor some code and had to revert from github. Grumble.What I would personally NEVER use them for? Large refactor tasks -- their context lengths just aren't big enough to handle too much content at once and they WILL get confused, hallucinate, or worse, sometimes commit outright fraud (and then lie to you about it). Anything related to science or research -- if you're working on something new, by definition, they're probabilistic so they won't be able to help you with creativity. Also just avoid ambiguous prompts in general; as much as people want to, you can't outsource the actual thinking itself.
It was provided multiple times in this thread.Citation needed.
This says so much about your management style, and very little about the capability of your employees or AI.But even though I'm the ceo of my company, I don't force a single employee to use AI because they're absolutely crap at it.
I have noticed in trial runs we have had that it actually hurts their performance. I'm still holding educational webinars and paying for llm tools but no one is required to use them.
Oh the irony. No, there've been plenty of people in this thread talking about their use of AI, and how it's comically bad at so many of the things y'all claim it's good at.And for some reason are unwilling to entertain the possibility that they’re experiences don’t apply to everyone and there are some real use cases that benefit.
If it can spend more time on it and still get it wrong, it's literally less efficient.Increasing the time that LLM can work without losing the context is a very big deal. It means that the models now can be used for solving much more complex problems than before.
Because it's not. And it's very telling you have no actual response to the many detailed explanations. Y'all are high on your own supply.it is genuinely hilarious to see the "wall" crowd in this thread trying to hand-wave the most significant event in the history of software engineering.
"Admitted" is a very telling bit of word choice in terms of believing your own hype. OpenAI wants you to believe that for the very reason that you choose to: it makes you believe more in the capability of the system. Even though, as pointed out, again, in the comments, it's utterly bullshit.openai just admitted that codex is building codex. we are officially in the recursive loop.
The one that even the author of the piece, who is transparently on your side, admitted was a good point?to the guy quoting the metr study from july:
My guy, a single dev can spin up an app for android in an afternoon. It's really not hard to build an app today. That they wrapped a slick GUI around it is not meaningfully impressive across 18 days.the fact that four engineers shipped sora for android in 18 days from scratch should be a siren for anyone still arguing that this is just "fancy intellisense."
But it can't.No, dude, if AI can write unit tests, it can self-correct errors. I've seen it happen.
Because it does. Demonstrably. As was referenced multiple times in the thread.You are still operating under the assumption that AI code with MORE errors than human engineers, and/or that these are qualitatively different errors than human generated errors.
This is a particularly bad use of AI. LLMs are absolutely terrible at anything visual. Hopefully you learned your lesson there. Gemini 3 was the first step towards an AI that can "see" but it's still a ways from using it for technical diagrams. Claude is still blind to the point of ridiculous.Yeah, ok. But try to give it a task that isn't asking it to write the same code everyone else has already written. If you can find an example of the code you're asking for online, it doesn't count asking the AI to regurgitate some (likely broken) form of it.
Claude recently offered to show me how smart it is by designing the architecture of a project of my choosing. When I asked it to come up with a Zephyr based IoT device that captured GPS location and pushed it to a web service, it happily drew an SVG diagram that was nothing short of hilarious. It had queues, ring buffers, and all kinds of random stuff. It was fully buzzword compliant, but nothing in the diagram made any sense because nothing was connected in any meaningful way. It was like a failing undergraduate student's project submission after they pulled an all-nighter.
About the only things I have found LLMs useful for, other than copying other people's work without knowing who the original authors were, is to summarize other information such as search results, and to pad things out with narrative fluff. Of course, the person reading my AI fluff is probably using an LLM to summarize it so they don't have to waste their time reading it.