How do AI coding agents work? We look under the hood.

AlbatrossMoss · Dec 26, 2025

SraCet said:
A lot of people on this thread think it follows that if an LLM can't write perfect production code 100% of the time, that means they're completely useless. But that's a glaring logical fallacy.

Not sure "people in this thread" are entertaining that belief. It's too black-and-white.

I hope we can agree on this, though: LLMs are definitely very easy to abuse because they give the illusion of "someone who knows their shit". And we, humans, are too easy to trust the machine that wrote the really good-seeming text on the screen.

I would also argue that, in quite specific and narrow situations, using an LLMs is very, very counterproductive. Here's a few:

You're a student: this is the worst time in your life to use LLMs, because relying on one will prevent you from honing your own mental skills (I won't tolerate contradiction on this specific point, I've seen it clearly); as a student, you REALLY have to work through all the problems you're facing ("The work is the shortcut" and all that).
You're using it instead of a search engine: not a good use at all, because you're literally expecting to see what other people wrote on the Web (you're searching for info), but instead you're given an out-of-date probabilistic regurgitation of the Web. Searching with LLMs is just weird...
You're using it as if it's a psychotherapist: remember that commercial LLMs like ChatGPT are configured to pursue engagement (the "sycophancy" problem). This is very intentional. There were recent articles on Ars about this. I know we're discussing LLMs in software development here, but this one is really bad and must be mentioned everywhere (I hope you agree).
If you're using an LLM as a "soundboard" to bounce ideas off of, kindly read this comment for a much better alternative: https://meincmagazine.com/civis/threads/.1509894/page-3#post-44037947 and do not limit yourself to LLMs. Your brain is far better.
You're using an LLM to vibe-code a software "prototype": please remember that prototypes are not just one-off pieces of demonstrative throw-away code. A prototype is supposed to help you explore the project, the issues you might encounter and the viability of the software stack for the project. Working on the prototype is also supposed to make you think about scaling the project in whatever dimension, based on the issues you encountered with the prototype (performance, complexity etc). If you vibe-code the prototype, you'll have a demonstrative throw-away piece of code, but none of these insights. Also, an LLM will not tell you the software stack is inappropriate in one way or another; it will just generate code for whatever language/framework/library you've already chosen.

Shit, this comment got long... apologies.

SraCet · Dec 26, 2025

AlbatrossMoss said:
Not sure "people in this thread" are entertaining that belief. It's too black-and-white.

I hope we can agree on this, though: LLMs are definitely very easy to abuse because they give the illusion of "someone who knows their shit". And we, humans, are too easy to trust the machine that wrote the really good-seeming text on the screen.

I don't know what can be done about this. Many people overestimate the abilities of AIs. The abilities are also oversold to us by the companies developing said AIs, and the companies investing in said AIs.

As a result, many other people swing too far in the opposite direction and underestimate the usefulness of AIs. This is an equally stupid point of view.

The goal should be to get people to be honest and objective about the abilities and usefulness of AIs. That describes most of the professional developers I know, i.e., people who don't spend all their time arguing with each other on message boards. I don't know how to get everybody else to be honest and objective.

AlbatrossMoss said:
...

You're using it instead of a search engine: not a good use at all, because you're literally expecting to see what other people wrote on the Web (you're searching for info), but instead you're given an out-of-date probabilistic regurgitation of the Web. Searching with LLMs is just weird...

Not really. You ask an LLM to search for something and it can instantly run many searches on many combinations of phrases that you may not have thought of, collate all the results and present them to you in a nice way. This is pretty great.

Actually I was just using an LLM to search for something yesterday and saw that it was doing web searches on terms in foreign languages, which is something I definitely wouldn't have done (or thought to do) myself. Pretty great.

AlbatrossMoss · Dec 26, 2025

SraCet said:
Actually I was just using an LLM to search for something yesterday and saw that it was doing web searches on terms in foreign languages, which is something I definitely wouldn't have done (or thought to do) myself. Pretty great.

Huh... That's not a bad idea at all. Regular search engines should provide this feature, it's very basic and doesn't need the LLM overkill.

Just checked in DuckDuckGo settings and there's nothing about it. Yes, you can set one specific region for results, but not even a specific language.
I remember seeing multilingual results popping up for searches in Google ~20 years (!!) ago, but nothing afterwards (??).

WAIT, just checked Google Search Settings. You can search in multiple languages at once, but you have to manually choose the languages first. Only after setting your preferred languages in the Search Settings you'll see multilingual results. This is very interesting.

AlbatrossMoss · Dec 26, 2025

SraCet said:
Not really. You ask an LLM to search for something and it can instantly run many searches on many combinations of phrases that you may not have thought of, collate all the results and present them to you in a nice way. This is pretty great.

Aren't regular search engines already doing this? They always transform the query you enter and, at the very least, they search for synonyms and usual related phrases as well.

Of course, they can't match the rephrasing functionality of an LLM, but in this case, even basic rephrasing is an improvement, and should already be available. I'd be surprised if it doesn't happen in Google or DDG.

SraCet · Dec 26, 2025

AlbatrossMoss said:
Aren't regular search engines already doing this? They always transform the query you enter and, at the very least, they search for synonyms and usual related phrases as well.

Of course, they can't match the rephrasing functionality of an LLM, but in this case, even basic rephrasing is an improvement, and should already be available. I'd be surprised if it doesn't happen in Google or DDG.

Of course search engines do a lot of processing on whatever you've entered in order to give you good search results.

I'm saying that what LLMs do (or at least what ChatGPT does) goes far beyond this in a way that is often useful. And they will collate the results to give you a summary that's immediately useful and good, rather than you having to click through a bunch of links to figure everything out for yourself. (Not to mention the tedium of ignoring sponsored links and whatever other superfluous nonsense the search engine is trying to sell you.)

It's a lot better. But there's no point in speculating or arguing with me about this. Try it out and find out for yourself.

SraCet · Dec 26, 2025

AlbatrossMoss said:
Huh... That's not a bad idea at all. Regular search engines should provide this feature, it's very basic and doesn't need the LLM overkill.

Just checked in DuckDuckGo settings and there's nothing about it. Yes, you can set one specific region for results, but not even a specific language.
I remember seeing multilingual results popping up for searches in Google ~20 years (!!) ago, but nothing afterwards (??).

WAIT, just checked Google Search Settings. You can search in multiple languages at once, but you have to manually choose the languages first. Only after setting your preferred languages in the Search Settings you'll see multilingual results. This is very interesting.

I'm not sure we're talking about the same thing here.

Translating search terms to a different language and searching for that is different from setting a search region or allowing multilingual results to be displayed.

cerberusTI · Dec 26, 2025

In terms of search, if it is the kind of thing I ideally want to describe in a paragraph or so rather than a few words, it goes to AI to do the search and is not a direct one.

For news and similar I can give it how much (or little) I currently know, and want to know, about whatever topic it is. It also links its sources, so I can go read those if I want, and often I do. I use it for both general summaries, and to get more information than a basic news article I have already read.

Finding papers which cover something specific is also a good use. There are even a few cases of it telling me to read something in full where I had dismissed it after reading the abstract, and being correct that what I wanted was in there.

I also gave the general instruction long ago to clearly separate things it can back up directly, from speculation. It mostly gets that right these days as well, and flags when it is guessing or putting together information from more than one source on its own.

For programming it is not something I would want to do without these days though. The moment it could take an excel sheet or an email and produce a JSON object of the contents with parameters as in an example it was useful, but these days it can do far more than that. My favorite use is having it sort out extremely messy and verbose code grown by a number of people over a lot of time, without a lot of ownership or cleanup, and rewrite it into something I am willing to read.

CommunityDoc · Dec 26, 2025

I think my post was misunderstood as vibe coded system. Its not one shoting a feature, but rather a structured approach. Just to clarify its not an EMR connected system in any way. And The whole point of the system is to ensure anonymisation of images can be done prior to attrbute grading using common ophthalmology techniques to study fundus images. A whole lot of security features are bullt in. And a blackbox pentest is planned.

CommunityDoc said:
I am a doctor with lots of hobbyist enthusiasm. My programming was typically done in Stata for data analysis. Additionally I used to study code written by others to understand how it was working for our research studies. However I understood the basics of web development, concepts of databases , ORMs, routing, proxying, setting up Wordpress blogs etc etc. Lots of good ideas, no support to implement them.
Now with the help of LLMs and coding agents, I feel so empowered. I have developed a fairly complex system of managing eye/ fundus images, done a custom fork of OpenDataKit Central and collect apps that implements short term logins, guided development of medical Research Publication repository, a school eye survey platform, set up a mail server and so on. Personally subscribing to three frontier models and have GLM plan as well. Used Codex, Qwen, Gemini, KiloCode, Opencode and recently AntiGravity. Last 6 months has seen all of this action.
So programmers may be taking more time but semi programmers like me are able to massively enhance productivity. Frankly a bad programmer would have written really bad code earlier will now actually be writing better code.
I continue to be amazed by the strides being taken in this space every week. FYI, my personal blog is at epidemiology dot tech and github handle is drguptavivek. Check fundus_image_xtract repo and Collect and Central repos in GitHub for some of the work done using the LLMs by a medical faculty.
Cheers
Vivek

hisnyc · Dec 26, 2025

VoterFrog said:
Now, the agents are leaps and bounds better and I can let them make changes, write tests, build and run the tests, and self-correct without intervention. I just come in afterwards and approve or reject or suggest improvements to the changes. It's easily an order of magnitude improvement in the AI's velocity. You need a new study.

I find this comment intriguing and a bit disturbing. I'm beginning to incorporate AI tools into my workflow. Finding where they help, where they are useless, etc.

Your comment implies that the tools may be getting better so fast that the decisions I've made recently have to be revisited and reevaluated way more frequently than ever before. I can't think of a precedent in my career where tools changed so fast that you might want to retool on a six month cycle.

Probably unnecessary, but I will pull out an example of how radical this seems to me. We might evaluate an IDE or a server configuration every few years. You would put some time into it, make a choice, and know it was 'good enough' for another few years---getting back to work. If your underlying tools are shifting faster...

richardbartonbrown · Dec 26, 2025

Great overview article -- just the right amount of tech details and a lot of process description, which is very important as these new tools are rolled out. ArsTech should do one of these every 6-12 months.

I really enjoyed the "here's how we're using coding agents" comments, and it was great to see more comments of the type "here's how to get good stuff while avoiding pitfalls". These are real-world experiences from a wide range of people...it gives a good feel for the state of the technology.

And it's always interesting to watch the haters versus true-believers in the comments. The hater numbers seem to be declining. Much of the hate was reaction to the ridiculous hype from Sam Altman et al, and as that noise has declined with the news cycle and bubble worries, the hater drumbeat has softened. It's still hype-ridden technology but millions of users are gradually sussing out the reality of its usefulness.

RoryEjinn · Dec 26, 2025

I think, regardless of your position on AI, there a few things people need to come to terms with that are showing up in this back and forth.

1. Even if those LLMs can make "working" code, it's not really "working" code because LLMs still don't understand access control and security concerns. Github recently highlighted this in their State of the Octoverse report where Access Control issues went up 173%.

2. Microsoft, Google, OpenAI, and Anthropic all say the best way to work with LLMs in regards to software engineering is to be able to describe what you want it to make, what technologies you want it to use, to have a back and forth conversation with it, and to then review it yourself. None of them suggest you simply tell it to make an app and take what it spits out wholesale though.

3. Even if you think the LLMs are good to have (I personally use them myself), there's definitely a conversation to be had about the feasibility of building infrastructure that has a major impact on the environment and our power grid at the detriment of everyone else.

My personal opinion is that LLMs are great at autocorrection, code suggestion, and working through problems. They are also great at exposing you to viewpoints you might not have since they consume so much media. When you give them an entire code base to pour through, they learn your style, pick up on common themes, and can provide much better completion suggestions than the traditional version.

That being said, I'd never take actual functions that AI wrote and implement them into my code. I'd take a suggestion and then review/rewrite it, but never copy and paste code. While I like them for autocompletion, I'm not sure that is worth destroying the environment, upending the consumer hardware market, and absolutely eviscerating entire jobs.

Bluck Mutter · Dec 26, 2025

SraCet said:
LOL, why in the world would anybody hire you if you're giving them that response.

LLMs are invaluable e.g.

As a replacement for documentation of programming languages, APIs

For anything you used to look up on Stack Overflow

To give you ideas for what might be causing (and how to fix) any particular bug or compiler/linker error

To help explain or give a quick overview of some unfamiliar code

To translate some code between programming languages (e.g. if you find a function for something you want to accomplish but it's written in a language that you're not very familiar with)

Frankly, if you're a developer who's not using LLMs for anything these days, I question whether or not you can be considered a professional developer. Or even a good one.

If I interviewed a developer who said that he refused to read any documentation because he already knew what he's doing, I wouldn't hire that person either.

I am retired (5 years) and despite programming everyday in my retirement, I won't be using AI for this... no interest as I am an old skool grey beard.

Maybe my situation is different but I spent the last 25 years of my career working on my own commercial software (used in mission critical deployments in large govt/corp orgs). While the application had a web based front end (in plain old HTML 4), the critical code was in 'C' for Unix and Linux platforms.

I would have program modules with 10,000+ lines of code.

Whenever I read of dev's talking about using AI, it tends to reference creating a small function or algorithm (i.e. short snippets of code).

So my question is: How would you prompt/use AI to generate 10,000 lines of mission critical code (total code base might be > 2 million lines of code), how do you get it to understand what code needs to be written when the product's use case is unique or not well understood (i.e. example code from it's training doesn't exit)?

If you generate 1000 lines of new code in a new module , do you need to retrain it on these lines so that it understands how to write the next 1000's in the same module?

Can you get it to test iteratively as you layer in new logic into the currently incomplete module (this was standard practice for me... test each new part as I went which might include creating tables and data as part of the sub-code test)?

How do I ensure code consistency (say in structure of variable names, favoured 'C' constructs etc)?

Can you use it to create the test plans, create test databases and data inside the databases to test the product end to end such that all edge cases are covered, to design the code and test the code for robustness, scalability and recoverability?

In the case of robustness, how do I prompt it to wrote code that validates the results of every action performed (especially database result sets)?. Being mission critical means a higher level of standard than say a shopping cart or a social media message.

Again, I am old skool so maybe in today's world, programs are devolved into 1000's of small snippets that are glued together into sequence and maybe AI works for this but I grew up in a "top down" world (but obviously my code was modular...it wasn't a single big arsed slab of code)

Thanks,

Bluck

Ozy · Dec 26, 2025

Bluck Mutter said:
I am retired (5 years) and despite programming everyday in my retirement, I won't be using AI for this... no interest as I am an old skool grey beard.

Maybe my situation is different but I spent the last 25 years of my career working on my own commercial software (used in mission critical deployments in large govt/corp orgs). While the application had a web based front end (in plain old HTML 4), the critical code was in 'C' for Unix and Linux platforms.

I would have program modules with 10,000+ lines of code.

Whenever I read of dev's talking about using AI, it tends to reference creating a small function or algorithm (i.e. short snippets of code).

So my question is: How would you prompt/use AI to generate 10,000 lines of mission critical code (total code base might be > 2 million lines of code), how do you get it to understand what code needs to be written when the product's use case is unique or not well understood (i.e. example code from it's training doesn't exit)?

If you generate 1000 lines of new code in a new module , do you need to retrain it on these lines so that it understands how to write the next 1000's in the same module?

Can you get it to test iteratively as you layer in new logic into the currently incomplete module (this was standard practice for me... test each new part as I went which might include creating tables and data as part of the sub-code test)?

How do I ensure code consistency (say in structure of variable names, favoured 'C' constructs etc)?

Can you use it to create the test plans, create test databases and data inside the databases to test the product end to end such that all edge cases are covered, to design the code and test the code for robustness, scalability and recoverability?

In the case of robustness, how do I prompt it to wrote code that validates the results of every action performed (especially database result sets)?. Being mission critical means a higher level of standard than say a shopping cart or a social media message.

Again, I am old skool so maybe in today's world, programs are devolved into 1000's of small snippets that are glued together into sequence and maybe AI works for this but I grew up in a "top down" world (but obviously my code was modular...it wasn't a single big arsed slab of code)

Thanks,

Bluck

There are a few ways to tackle this. Some agent-focused IDEs will have them essentally 'train' on your codebase to extract styles and patterns that they continue to emulate when generating new code.

Personally, I rely on style and architecture plans and guides. You can have agents refer to these documents before they generate any new code. Any code they write, stays in their context window for the session, so the next 1000 lines follows the logical flow and patterns of the first. The trick is what to do when it runs out of context windows, the choices being compaction->summarization->continue, or what I do is have it write handoff documents, with the critical concepts and filenames included for the next session to evaluate before it continues. This works quite well.

You can ask it to write tests just like any other code, and it often inserts debug statements, extracting intermediate values, etc... to do troubleshooting if it's having trouble identifying errors.

It reads code before it generates new code, so it does a pretty decent, though not perfect job of maintaining variable consistency. I've had issues when codes reach the multi-10k lines of code that you have to be pretty careful about what you feed it.

I've had it do test code, test databases, all of that. Those sorts of access patterns are not unusual, so it does a good job of all the usual stuff.

If you want to focus on testing every unit operation, just write that into the plan, it will do it. Even old-school code was 1000's of small snippets, we just called them functions instead of modules. I try to keep lines/file <500 in the belief that it's easier for the agents to ingest without overwhelming their context window. Everything they 'read' that doesn't contribute to the solution (irrelevant functions/lines of code) just clutters their context window. For reference, the largest project I've done is ~30k of backend code, ~100 files, ~300 lines/file on average with ~10k of frontend code ~500 lines/file.

By default, it won't necessarily focus on 'secure' code, but if you ask it to, it can do it. I've had it spin up both internal auth solutions as well as interacting with oauth2 clients. You can ask it to do a security audit for the code it writes, and it should point out the insecure portions of the code. If you do write a plan to do things like test, security, architecture patterns, etc...some agents are better at 'sticking to the plan' than others. The agent I use, Claude, tends to jump the rails on a semi-regular basis, so I have to slap it around sometimes. I just find it's use of tooling (command line scripts, bash commands, python script snippets) to accomplish the goals superior to its competitors.

hisnyc · Dec 26, 2025

Bluck Mutter said:
I am retired (5 years) and despite programming everyday in my retirement, I won't be using AI for this... no interest as I am an old skool grey beard.

I'm probably close in age, although I haven't worked in C for some time.

I presume, when you say modules, you mean groups of subroutines with exported functions available to use at the library level.

If you wanted to complete a project faster and move quickly through the parts you have less interest in, I'd figure out how to create some structure for the LLM and ask it to do smaller, well defined tasks. Review what it does, and move on.

If I were in your position, I like thinking about general structure and also enjoy the 'fiddly bits' of how things get implemented. I do not like dealing with test rigs or documentation. I would provide some guidance about what I wanted for both of those, but ask the LLM to fill in those things for me. A number of IDEs are also increasingly good at injecting boilerplate 'almost like I would write it'.

I mentioned earlier leaning on the LLM for code reviews. Essentially, a much better lint. I could see, if I was doing a pet project, leaning on that to try to find any stitches I might have missed. I'm not saying syntax (which the compiler will get), but actual issues. Recently, I ran a check on my code and it told me that it found what it thought was a cut and paste error. Sure enough, one of the switch statements I had yy'ed in VI and p, but not changed the text (probably distracted by something). It was a bug I would have eventually caught, but asking for a review got it sooner.

RoryEjinn · Dec 26, 2025

Ozy said:
By default, it won't necessarily focus on 'secure' code, but if you ask it to, it can do it.

You can ask it to do lots of things. That doesn't mean it can do them. Report after report has shown LLMs can't generate truly secure code. From github's octoverse report showing a 173% increase in access control issues to Coderabbit's own whitepaper showing that LLMs make 1.7x more mistakes across all aspects of the software engineering process - including security. This makes sense overall when you consider that LLMs don't actually understand anything about what they're writing, they're just making a guess based on weighting.

That doesn't make LLMs bad tools. It just makes them tools. Understanding how those tools work is part of using them correctly.

Ozy · Dec 26, 2025

RoryEjinn said:
You can ask it to do lots of things. That doesn't mean it can do them. Report after report has shown LLMs can't generate truly secure code. From github's octoverse report showing a 173% increase in access control issues to Coderabbit's own whitepaper showing that LLMs make 1.7x more mistakes across all aspects of the software engineering process - including security. This makes sense overall when you consider that LLMs don't actually understand anything about what they're writing, they're just making a guess based on weighting.

That doesn't make LLMs bad tools. It just makes them tools. Understanding how those tools work is part of using them correctly.

What's the most recent report you're referring to?

picklefactory · Dec 26, 2025

SraCet said:
Okay, I'm willing to learn. In what way have I misread your post?

Having read your posts and witnessed your ridiculous behavior in this and similar threads: no, you are not willing to learn, and I have so little respect for you that I am not going to waste any time explaining. Read it again real slow, I guess.

SraCet · Dec 26, 2025

picklefactory said:
Having read your posts and witnessed your ridiculous behavior in this and similar threads: no, you are not willing to learn, and I have so little respect for you that I am not going to waste any time explaining. Read it again real slow, I guess.

I dunno, man. I've read the post like 5 times.

You wrote "It will be your job, developers." and yet you're claiming that you somehow weren't warning of lost jobs.

Maybe somebody else on the thread can chime in and explain how I'm misunderstanding your post, but I think it's far more likely that everybody else thinks you're confused and wrong too.

pseudobscura · Dec 27, 2025

GaidinBDJ said:
Same thing with code. You just have to realize you're not getting code, you're getting something code-shaped.

Of course you're getting actual code. It may not always be bug-free if you're trying to one-shot something complex. But, if you're refactoring or adding to a mature project, and if you provide clear instructions in your prompt (just as you would to a human programmer) a decent model will almost always produce excellent code.

GaidinBDJ said:
As for me, I probably get the most use out of LLMs by dropping Excel formulas into CoPilot.

I'm not sure anyone should be talking authoritatively about either code or LLMs, when their tools of choice are Excel and Copilot ...

Trondal · Dec 27, 2025

holmes6 said:
I have been doing a bunch of sandbox experimenting with Claude code and it's incredibly capable. It's also very likely to have badly flawed and bugged result as the prompts or codebase get larger and/or less organized. I think the concept of this being powerful has been proven but implementation is kind of a can of worms still.

I ran over the below linked article recently, which sounds like a very solid approach to maximizing productivity from Claude Code.

The claims re: output and productivity sound impressive and legitimate to me.

The caveat is that I’m not a CS pro and my coding skills are largely limited to R (and my projects rarely go past 1,000 LOC).

In fact I’d be interested to hear reactions to this article from anyone that has read it and codes for a living.

https://www.cometapi.com/how-to-use-claude-opus-4-5-via-cursor-and-claude-code/

pokrface · Dec 27, 2025

RoryEjinn said:
1. Even if those LLMs can make "working" code, it's not really "working" code because LLMs still don't understand access control and security concerns. Github recently highlighted this in their State of the Octoverse report where Access Control issues went up 173%.

RoryEjinn said:
You can ask it to do lots of things. That doesn't mean it can do them. Report after report has shown LLMs can't generate truly secure code.

I mean, you can't just say "LLMs don't understand access control and security concerns" and "LLMs can't generate truly secure code" and have what you're saying be axiomatically correct, without first narrowing your definitions. (Let's ignore the semantic minefield of using the word "understand" in the context of anything an LLM does, because LLMs don't and can't "understand" anything.)

With human review, with appropriately scoped direction, and without trying to do more than can fit within the context window, the current crop of agentic coding LLMs (Claude Code, ChatGPT Codex, etc) are generally very aware of whatever constraints you're smart enough to give them, including and especially security constraints. They're quite capable of making sure your python binary read/write function is safe and doesn't contain a buffer overflow or whatever. They're also generally quite good at constructing and running test suites, with some oversight and guidance and checking. They are—at least anecdotally and IME—also quite smart about untangling complex piles of nested permissions, evaluating big overlapping sets of access control rules, and following authentication flows between layers and pointing out issues. (These are things that I understand even if the LLM doesn't, and I can and do check its work. It has been—again, IME only—generally excellent.)

Maybe by LLMs not being able to make "working" code, you meant that nothing an LLM puts out should be treated as production-ready (or even runnable at all) without first putting human eyes on it. Or that vibe-coding an entire application all at once without a proper spec and without taking stock of the security side of things is insane. If either of those is in fact the case, we're totally in agreement! But a big huge broad contention that "code LLMs make isn't 'working' because LLMs don't understand access controls and security concerns" simply doesn't appear to be supported by reality.

edited to add - Wanted to clarify that I'm not at all doubting the github data, and I absolutely believe their State of the Octoverse report. But at the risk of sounding like I'm making a steve jobs-esque "You're holding it wrong!" accusation, one must not blame the bow for the missed shot. One must blame the archer. (Especially if the archer is standing next to the bow, trying to get it to shoot itself with only vague shouted instructions.)

Trondal · Dec 27, 2025

flunk said:
I'm currently leaning towards telling everyone I know not to use LLMs for anything you're not personally an expert in, because otherwise how do you judge the validily of the generated output?

That’s a logical extreme. But I’d argue that it’s needlessly extreme when applied universally.

Practical example: I am not an expert at writing shell scripts of any sort. But I have enough working knowledge to tell if that PowerShell snippet is going to delete the targeted files instead of finding them and moving them.

Low stakes questions are fine too; if I ask it who Sidney Sweeney is dating and it’s wrong, then who cares.

But in other areas, where I have little or no knowledge and the stakes are high, I would never start with an LLM. I’d either pay an expert or at least find a soup to nuts tutorial from a credible human.

Astra Architect · Dec 27, 2025

Meh. I'm old enough that this thread reads a lot like the teeth gnashing of Unix/Linux greybeards as GUIs became the default. I was initially schooled by a number of the old "if you can't/won't/don't do it in command line, then you should never be allowed admin privileges" types. Yes, even today the CLI is better for some things if you know what you're doing. But there's plenty that's doable with GUI and in ways more intuitive for most people unless you trained from day one on CLI. So far as I could tell, a lot of the greybeard disdain for GUI was simply rooted in hatred of Microsoft/Bill Gates and him taking something they believed should be free and charging money for it.

So it will be with LLMs. LLMs right now are maybe the conceptual equivalent of Windows 3.1 - a GUI running on top of DOS rather than a standalone OS.

The 'trick' to coding more than individual functions with LLMs right now is you can't just wing it. You can't just dive in and start coding. You absolutely have to think through your architecture, invariants, etc., first - at least good enough to do a first dev version. You have to go Intent --> Mission Engineering/Analysis --> Stakeholder Needs & Requirements --> Logical Architecture --> System Requirements --> Test Plans. Then you can start coding with a LLM. And those system requirements have to be well written...which, historically, is a massive weakness of many people. You can iterate afterwards, but you have to start from a strong foundation.

If a requirement isn't written in a way that it can be unambiguously verified yes/no with no room for misinterpretation, using one or more verification methods of inspection/demonstration/analysis/test/similarity/sampling, and each requirement covers one and only one operational, functional, or design characteristic/constraint, if the requirement isn't consistent with others, feasible, traceable, and unique...then it's a poor requirement. Generally speaking, most people aren't trained in writing good requirements. Requirements management is one of our bread and butter elements in systems engineering, along with verification and validation, but most people haven't needed to internalize it like we do.

Context limitations are significant given the stateless nature of current LLMs. But as others have noted, careful management of handoffs between threads, using a System_Intent.md and other artifacts to provide the constraints/guiderails within threads, careful use of versioning, and other processes will greatly lessen the pitfalls.

However, integrating all of that into DevSecOps will take a while longer - like I said, I view current state as somewhat analogous to Win 3.1. LLMs seem to be optimized for being pretty, flashy, and looking useful rather than being designed from the ground up to say 'no' and prevent mistakes/errors. To be useful as infrastructure we need AIs that are stateful instead of stateless and can take user inputs and say "I can't do that - it violates X, Y, and Z" when you're trying to do something that can have genuine safety and/or security concerns. More like the industrial systems that we entrust to fly aircraft, manage power grids, etc., although that analogy has its flaws, too.

Astra Architect · Dec 27, 2025

pseudobscura said:
Of course you're getting actual code. It may not always be bug-free if you're trying to one-shot something complex. But, if you're refactoring or adding to a mature project, and if you provide clear instructions in your prompt (just as you would to a human programmer) a decent model will almost always produce excellent code.

I'm not sure anyone should be talking authoritatively about either code or LLMs, when their tools of choice are Excel and Copilot ...

Hey...I don't always do rocket science, but when I do, I do it in Excel. Or, sometimes, Matlab.

Trondal · Dec 27, 2025

Pino90 said:
we use Claude with API access,

Pino90 said:
Of course, it required A LOT of training on how to extract value from the tool.

I’m curious as to whether you’ve read the below and if it roughly aligns with what it looks like to maximize productivity from the tool:

https://www.cometapi.com/how-to-use-claude-opus-4-5-via-cursor-and-claude-code/

itsaunixsystem · Dec 27, 2025

The worst thing about AI is people who don't know what they are doing, think they're they are getting marvelous results. You are better off just learning how to code yourself, before you attempt to use these tools. The same is probably true for getting it to write fiction. If you don't know the basics of writing a story, you probably think you're getting great results too.

Trondal · Dec 27, 2025

UnicornsRule said:
No, they're not, and it's just as easy to simply go look at the docs for the language. That's actually better since you don't know when that data was last scraped and you might learn something new if the documentation has been updated.

I know how to code so I look up stuff on Stack Overflow maybe 1-2 times a day and those lookups take 3-5 min. There is no time saving by me asking the AI.

I know how to code so I know how to fix things and I know how to deal with errors.

Again I know how to code so I can simply read it and be done.

I know how to code and it all works basically the same. I also started in C/Java and since everything is based on C/C++ I've never had an issue porting things. The only language that reads different is assembly, otherwise it's all for/if/else, variables, and maybe classes if the language is OO. As I've been saying for many, many years for the most part it all fucks the same.

I never said that I don't read documentation. I have said that I do not want to become a copy paste bot that will eventually be replace. If you are letting the machine do your work and you're simply babysitting you are on the replacement chain, it's just a matter of time.

If you can look at your situation objectively, the response you’re quoting from seems like a good list of answers to draw from when interviewing.

I’d wager there are still jobs where you’re not just baby sitting Claude, and where you can do your job in whatever way works. None of us have guarantees for what the future brings, but such is life, unfortunately.

Now an employer might monitor LLM use to “prove” you’re maximizing your productivity, in which case you job now involves dealing with an additional layer of activity that you find unnecessary. But in my experience that’s par for the course; it’s just a matter of degree.

As you seem to allude to elsewhere, maybe you’d be happier somewhere other than tech, but I’d also say that getting any job is easier when you already have one, so getting another tech job for the time being is probably your best platform from which to make a career change.

Of course all of the above is just the opinion of an internet rando, but I’m genuinely trying to be helpful. Take it or leave it as you see fit, and best of luck.

CommunityDoc · Dec 27, 2025

AdamWill said:
So I guess the repo is https://github.com/drguptavivek/fundus_img_xtract ?

My immediate thought on that is "Good lord, I hope you ran it through the correct review processes". It appears to process and store patient information; I don't know where you are, but pretty much anywhere in the world, there's a ton of regulations about that, and I'm not at all sure you can rely on LLM-generated code meeting them. It also boasts of "a sophisticated hybrid access control model combining both Role-Based and Attribute-Based Access Control", which is the kind of thing you really need a professional to review for security.

Thanks for taking time out and visiting the
Repo. I absolutely agree that securing PHI/PII is critical. I have actually a few colleagues in security domain who are examining this aspect and doing an audit.

RoryEjinn · Dec 27, 2025

pokrface said:
Maybe by LLMs not being able to make "working" code, you meant that nothing an LLM puts out should be treated as production-ready (or even runnable at all) without first putting human eyes on it. Or that vibe-coding an entire application all at once without a proper spec and without taking stock of the security side of things is insane. If either of those is in fact the case, we're totally in agreement!

That is the overall point I'm making. I use LLMs myself when I code, but not without oversight. I think they have a particular value when it comes to large documentation sets/codebases and sifting through them. I just have other reasons (Mostly Environmental/Power Grid/Privacy Related) for not necessarily wanting them to exist/keep expanding.

pokrface said:
edited to add - Wanted to clarify that I'm not at all doubting the github data, and I absolutely believe their State of the Octoverse report. But at the risk of sounding like I'm making a steve jobs-esque "You're holding it wrong!" accusation, one must not blame the bow for the missed shot. One must blame the archer. (Especially if the archer is standing next to the bow, trying to get it to shoot itself with only vague shouted instructions.)

I think 100% that the github report only truly shows how blindly people trust the output. I also think that it's a reminder that the default output for an LLM is not necessarily secure or feature complete.

danan · Dec 27, 2025

VoterFrog said:
I might also recommend that your team elevates its planning documentation game. This was always good engineering practice but it's also extremely useful context for LLMs. At the very least, make sure you have detailed descriptions of the expectations in your stories or task breakdowns.

And for anything that's going to take a couple weeks or more of implementation work, a document describing the planned changes. Start solving some of the high level problems and revealing the unknown unknowns earlier in the process. Like I said, this is valuable - LLM or no - but LLMs are particularly clueless of these sorts of challenges so writing these things down well really help them.

Context, as described in the article, is a precious resource for LLMs but good context is extremely valuable to them. For me, it's often the difference between an agent's output being a useless waste of time and being functionally what I wanted.

So the same issue we’ve always had with computers: they will do exactly what you tell them to do. But you have to convert your thoughts into their language. In the AI case, it’s a higher level language that looks like natural language; the output is code, not machine code/assembly/intermediate tokens; and there’s a randomness element thrown in to make the process more difficult to predict.

danan · Dec 27, 2025

AlbatrossMoss said:
You're using it instead of a search engine: not a good use at all, because you're literally expecting to see what other people wrote on the Web (you're searching for info), but instead you're given an out-of-date probabilistic regurgitation of the Web. Searching with LLMs is just weird...

This is one area I’ve found AIs to be somewhat helpful. At times. Specifically, the AI summary at the head of DuckDuckGo results when searching for how to use an API I’m not familiar with has saved me time over sifting through the list of search result links, when the API looks similar to other, unrelated APIs on different platforms. DDG Search assistant seems to keep the context of which platform/library I’m looking for better than the classic search engine does.

Or maybe I’m just bad at writing search queries.

VoterFrog · Dec 27, 2025

danan said:
So the same issue we’ve always had with computers: they will do exactly what you tell them to do. But you have to convert your thoughts into their language. In the AI case, it’s a higher level language that looks like natural language; the output is code, not machine code/assembly/intermediate tokens; and there’s a randomness element thrown in to make the process more difficult to predict.

This is not a problem limited to computers. As I said, it's just following good planning and documentation practices that were already established but, let's be honest, most folks don't really follow.

But historically it wasn't at all about talking to a computer. It was about aligning the development team and the stakeholders.

danan · Dec 27, 2025

VoterFrog said:
This is not a problem limited to computers. As I said, it's just following good planning and documentation practices that were already established but, let's be honest, most folks don't really follow.

But historically it wasn't at all about talking to a computer. It was about aligning the development team and the stakeholders.

A big reason good planning and documentation processes aren’t followed is that humans are “good enough” at dealing with sparse information, either filling in something reasonable (experienced people), asking for clarification (junior people), or catching when less reasonable things are filled in (processes that expect humans to create errors). AI code generators sound like they mainly plow ahead producing garbage in the face of insufficient info, which means more precise specifications are needed than for humans, which equates to the old problem of doing more precisely what they’re told than humans. I’m not saying humans wouldn’t work better with precise specs, but that gap in the “good enough” aspect is still similar to every other major breakthrough in programming. At least it’s sufficiently similar to remark on.

Side note: I haven’t heard much at all about AI being decent at asking clarifying questions in anything but pretty trivial details. It’s still going to require an experienced human to devise specs when there are anything more than broad, general requirements (at least for a while). I also get the impression it’s harder to get people to review AI generated code as critically as they do junior developers’s code. The mindset of “this person doesn’t know what they’re doing, but will learn if I spend some time giving feedback,” isn’t there.

bugabuga · Dec 27, 2025

One of the common examples of schizophrenic nature of AI code agents is Cursor Agents versus Cursor Code Review Bot. One would presume that if you ask Cursor to do some coding task (make a patch, add some functionality, logic update, etc) then the result would certainly pass the Cursor Bot Code Review. Duh, left hand, right hand, same Cursor AI tooling, right? Wrong.

In a lot of cases Cursor agent would make a code change and then Cursor Code Review bot would scream at the result about missed edge cases, wrong logic, other bugs.

You wrote it, Cursor. YOU. Yes, you.

Why on Earth would you write code that doesn't pass your own code review tool? But happens all the time.

And this is on top of many models preferring very old versions of, say, Node. They're still trying to use Node 18 "by default" (unless you explicitly tell them not to do it) even though it reached end of life in April of 2025. Other funny date-related things are when Cursor complains during review that "this version isn't scheduled to be released until October 2025" and it's already December and sure enough, that's the current version

DDopson · Dec 27, 2025

Bluck Mutter said:
I am retired (5 years) and despite programming everyday in my retirement, I won't be using AI for this... no interest as I am an old skool grey beard.

Maybe my situation is different but I spent the last 25 years of my career working on my own commercial software (used in mission critical deployments in large govt/corp orgs). While the application had a web based front end (in plain old HTML 4), the critical code was in 'C' for Unix and Linux platforms.

I would have program modules with 10,000+ lines of code.

Whenever I read of dev's talking about using AI, it tends to reference creating a small function or algorithm (i.e. short snippets of code).

So my question is: How would you prompt/use AI to generate 10,000 lines of mission critical code (total code base might be > 2 million lines of code), how do you get it to understand what code needs to be written when the product's use case is unique or not well understood (i.e. example code from it's training doesn't exit)?

If you generate 1000 lines of new code in a new module , do you need to retrain it on these lines so that it understands how to write the next 1000's in the same module?

Can you get it to test iteratively as you layer in new logic into the currently incomplete module (this was standard practice for me... test each new part as I went which might include creating tables and data as part of the sub-code test)?

How do I ensure code consistency (say in structure of variable names, favoured 'C' constructs etc)?

Can you use it to create the test plans, create test databases and data inside the databases to test the product end to end such that all edge cases are covered, to design the code and test the code for robustness, scalability and recoverability?

In the case of robustness, how do I prompt it to wrote code that validates the results of every action performed (especially database result sets)?. Being mission critical means a higher level of standard than say a shopping cart or a social media message.

Again, I am old skool so maybe in today's world, programs are devolved into 1000's of small snippets that are glued together into sequence and maybe AI works for this but I grew up in a "top down" world (but obviously my code was modular...it wasn't a single big arsed slab of code)

Thanks,

Bluck

I used AI to help prepare refactoring edits in a fairly complicated library that's about 10,000 lines of compile-time C++ template magic to dynamically construct an adapter class based on reflection over the arguments for a supplied C++ function. Which sounds a lot like the sort of complicated module you are describing.

This is a library that has literally 1000's of downstream clients, so backwards-compatibility risks and maintainability are existential considerations, so reviewing the changes often takes more time than drafting them out. Problem was, I wasn't entirely clear what the cleanest implementation strategy would be. I needed to effect a behavior change that wasn't organic to how the code's state management was structured, so the existing implementation needed to be refactored in one of several ways. I tried one strategy, then changed my mind and tried a second, then finally settled on a third. It was helpful that the AI could sprint forward and make coherent syntactic changes at multiple different places in the code faster than I could have navigated to all of them. This let me quickly explore the design space and discover the gotcha sticking points of my first two approaches, which were mostly around long-term maintainability concerns. I understood every line of change that I proposed for submission, and rewrote all of the comments to be clearer. The AI was also pretty handy for updating several dozen unit tests that were impacted due to testing one of the internal abstractions that was being refactored.

The thing about a tool is that it's all about how you use it. There are probably people out there submitting code that they haven't read. Bad programmers have been screwing up software projects for a long time, and having access to AI that makes it easy to spew out superficially credible code does make it slightly harder to tell the competent programmers from the incompetent based on shallow signals alone. But suffice to say that the utility of AI coding isn't limited to toy problems. It's finding more and more use-cases in large-scale professional software work, by thoughtful professionals who think very deeply about how to keep systems maintainable, and very much aren't vibe-submitting the machine's suggestions. Even in thoughtful coding, it's handy to have a rapid prototyping tool to try out refactoring approaches.

itsaunixsystem · Dec 28, 2025

RoryEjinn said:
That is the overall point I'm making. I use LLMs myself when I code, but not without oversight. I think they have a particular value when it comes to large documentation sets/codebases and sifting through them. I just have other reasons (Mostly Environmental/Power Grid/Privacy Related) for not necessarily wanting them to exist/keep expanding.

I think 100% that the github report only truly shows how blindly people trust the output. I also think that it's a reminder that the default output for an LLM is not necessarily secure or feature complete.

This is how everyone gets hacked with some stupid Chinese/Russian virus because people are installing some backdoor from some library that doesn't exist, but AI thinks it does.

SpecTP · Dec 28, 2025

My issue is all the AI coding tools I used, including the ones mentioned here in the article keeps hallucinating and making up function calls that don't exist. And I have to tell it to scrutinize all of the syntax it uses and then it will at least flag those function calls as 'questionable'.

Also, what people call 'vibe coding' today, we used to call user functional description. I would write up a skeleton framework of functional and logic specifications and then turn it over to my programmers to actually code into the language I need. AI just does the second part of it for me.

pokrface · Dec 28, 2025

bugabuga said:
Why on Earth would you write code that doesn't pass your own code review tool? But happens all the time.

I think you're asking two different "whys" here — a technical "why," and then a much more heartfelt "why the fuck would they do this" kind of why.

To the first, it's fundamental: this happens because none of the agentic LLMs involved are possessed of anything like a mind, and their outputs are nondeterministic. The only way to reliably, deterministically "couple" them such that the code-generating agent's output is always good enough to pass the code-auditing agent's evaluation would be to do something like including "...and run your outputs through the code-auditing agent and accept its changes and repeat until the code passes the other LLM's review, and only then present your results to the user" in the code-generating agent's base prompt. Or, spend much more time on RAG-type training, using the auditing LLM's outputs to shape the generator's output. (The problem, at least AIUI, is that down that way lies the danger of over-fitting.)

To the second, it's even easier: because that would cost a lot more.

twilightomni · Dec 28, 2025

SraCet said:
I dunno, man. I've read the post like 5 times.

You wrote "It will be your job, developers." and yet you're claiming that you somehow weren't warning of lost jobs.

Maybe somebody else on the thread can chime in and explain how I'm misunderstanding your post, but I think it's far more likely that everybody else thinks you're confused and wrong too.

I’m not willing to write off SraCet, so I’ll take a shot at this swirling pit of pedantric disagreement.

The underlying reading of the post is not that developers will lose their jobs. It’s that their “job” (meaning responsibliity space) will increase because LLMs will replace all competent technical writing with slop that developers will have to triple-check.

I read it as “it will be your job (to be a technical writer as well), developers”.

Yes this is all pedantic circles but I just want to write out loud the exact phrasing of the disagreement.

How do AI coding agents work? We look under the hood.

Smack-Fu Master, in training

Ars Legatus Legionis

Smack-Fu Master, in training

Smack-Fu Master, in training

Ars Legatus Legionis

Ars Legatus Legionis

Ars Tribunus Angusticlavius

Smack-Fu Master, in training

Smack-Fu Master, in training

Wise, Aged Ars Veteran

Smack-Fu Master, in training

Smack-Fu Master, in training

Ars Tribunus Angusticlavius

Smack-Fu Master, in training

Smack-Fu Master, in training

Ars Tribunus Angusticlavius

Ars Praetorian

Ars Legatus Legionis

Ars Praetorian

Ars Scholae Palatinae

Senior Technology Editor

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Wise, Aged Ars Veteran

Ars Scholae Palatinae

Account Banned

Ars Scholae Palatinae

Smack-Fu Master, in training

Smack-Fu Master, in training

Ars Scholae Palatinae

Ars Scholae Palatinae

Smack-Fu Master, in training

Ars Scholae Palatinae

Ars Centurion

Ars Tribunus Militum

Account Banned

Ars Praefectus

Senior Technology Editor

Ars Centurion