Thanks for the feedback. I have updated the piece to specifically mention the METR study.
We plan to compare the performance of these agentic coding tools (Codex, Claude Code, Gemini CLI, maybe Mistral Vibe) in a future piece very soon, so stay tuned.
Yeah, ok. But try to give it a task that isn't asking it to write the same code everyone else has already written. If you can find an example of the code you're asking for online, it doesn't count asking the AI to regurgitate some (likely broken) form of it.
Claude recently offered to show me how smart it is by designing the architecture of a project of my choosing. When I asked it to come up with a Zephyr based IoT device that captured GPS location and pushed it to a web service, it happily drew an SVG diagram that was nothing short of hilarious. It had queues, ring buffers, and all kinds of random stuff. It was fully buzzword compliant, but nothing in the diagram made any sense because nothing was connected in any meaningful way. It was like a failing undergraduate student's project submission after they pulled an all-nighter.
About the only things I have found LLMs useful for, other than copying other people's work without knowing who the original authors were, is to summarize other information such as search results, and to pad things out with narrative fluff. Of course, the person reading my AI fluff is probably using an LLM to summarize it so they don't have to waste their time reading it.