Adventures coding with AI

benwiggy

Ars Scholae Palatinae
1,254
I've recently been using ChatGPT's "Code Copilot" to produce Swift scripts, and it's worked rather well.

I started off by just asking for a simple function, largely as a means to get a working example of some of Apple's APIs; but by gradually building it up, in the manner of a teacher ("Now let's add this to what we've already done"), I was able to create a really useful command-line utility, without a single line of my own code.

The generated code was always not without error, but feeding the error messages back, it was always able to work out what was wrong and what needed to be changed. You do have to check that its logic is what you wanted -- there was one point where a completion handler was always returning true, no matter the result. (But just saying: "it's reporting success here when it shouldn't", it was able to fix it correctly.)

All in all, I found the experience really enjoyable, instructing my virtual apprentice to produce work for me to inspect. Certainly much more enjoyable, less frustrating -- and faster -- than trawling through Apple's documentation and trying to knock something up myself.

Of course, this is exactly where AI ought to excel -- a large amount of data, easily absorbed, with logical rules to its application. I've also given it some code that I've written, which it improved and corrected. I'm looking forward to the planned addition of AI features to Xcode.

There's another thread somewhere asking what the perfect programming language would be. For me, it would be "Computer, run a simulation to improve warp core efficiency."
 
Last edited:
  • Like
Reactions: continuum
Despite the name "artificial intelligence", the currently hyped crop of large language models is not intelligent at all. Instead, it is a very large heap of statistical information about language, i.e. a set of numbers detailing correlations between words (or word-lets). The specific correlations have been "measured" by observing a huge body of existing text, and effectively counting occurrences.

The LLMs are indeed remarkable in that they are using clever tricks to bring this intractable problem down to manageable sizes (both in terms of storage and computation). In a very real sense, the LLMs are compressing and interpolating a ridiculously large set of input data, and they can decompress it with affordable compute power in more or less real time.

Nonetheless, the only "understanding" that LLMs can have is in terms of correlations between words. That is, some terms and phrases are so-and-so likely to occur with such-and-such distance in a text. LLMs are great for remixing commonly used text fragments, or for reproduction of commonly used structured text types. They are probably pretty good as a convenient interface to widely known encyclopedic knowledge.

LLMs contain no mechanism whatsoever to judge if a creative combination of words is a new insight (as in "not present in the training data") or a "hallucination" with no use or value. In fact, the term "hallucination" is a fairly euphemist way to admit that the LLM has no clue what it was talking about.


So why does programming with LLMs work relatively well? Because programming languages are much more strictly formalized. And are used with a much narrower scope. And most programs consist mostly of "boilerplate code", i.e. very repetitive chores like interfacing with widely used OSes or APIs or services. The correlations here are very strong and clear. Only a very small percentage of program code implements the actually new and interesting features; an even smaller percentage reflects actual ingenuity ...

By and large, software engineering is more craft than art. Not inherently, but there is a lot more money to be made with boring software than with programming as an art form. Those boundary conditions impose their very own set of correlations. :)
 

cburn11

Ars Praetorian
445
Subscriptor
I played with the gpt-3.5-turbo and gpt-4-turbo models through openai's rest api a few weeks ago. Played meaning I had no idea how to efficiently ask it to write code. But It's impressive that you can prompt it in a colloquial manner with "open a file passed on the command line," or rewite the open function to use istreambuf_iterator instead of read, and both models return actual code. Or ask it "is that an expensive operation," and it returns an answer that "understands" expensive in the context means "avoid unnecessary mallocs."

I incrementally asked both models to read a file into memory and then send that memory buffer as POST data to a specific url using libsoup. gpt-3.5 mixed code from versions 2 and 3 of the library. After several attempts, I gave up trying to untangle it. gpt-4 initially mixed calls from the library versions, but with a little prompting got it correct.

The 3.5 conversation didn't cost enough to register on the billing statement. It cost 45 cents for the 4.0 model to successfully write to the moral equivalent of curl -d @filepath http://example.com/cgi-bin/upload

While it initially impressed me. I suspect the end result is that I have $9.55 in credits at openai that will go unused for the foreseable future.
 
  • Like
Reactions: educated_foo

w00key

Ars Tribunus Angusticlavius
8,822
Subscriptor
Looks like it's time to revive this thread, the Boardroom one went off topic from economics of AI to Claude code tips and tricks.


I still feel that Claude code is like a beta product, things it does, like grepping to find a function, is sooooo outdated. It doesn't have a full language server so it can hit jump to class/function/symbol, like what I do with ctrl-shift-r/t or double shift (search anything) in my IDE.

Junie is better integrated. It wrote a test, showed me the diff in the native IDEA diff window, ran tests via JUnit / test runner tab, it feels so much more polished.

Gemini is the classic chat interface only, or in editor prompting via ctrl-\, but damn that context window is large. You never need to think about what to add, it just adds half your project, it feels. And it seems to have a language server running in the background giving it a good idea of what classes and methods are available.


It's a three way race and I think I'll keep using them all? It feels like I have a junior developer with me to handle dumb things like writing html, and I need to learn how to give it clear tasks to execute, and supervise it so it doesn't commit a monstrosity / tech debt.

And even if it can't generate what I want, it's basically rubber duck debugging, extreme edition. Describing a problem, possible solution and getting some feedback gets me most of the way to solving issues.
 

Dmytry

Ars Legatus Legionis
11,408
Despite the name "artificial intelligence", the currently hyped crop of large language models is not intelligent at all. Instead, it is a very large heap of statistical information about language, i.e. a set of numbers detailing correlations between words (or word-lets). The specific correlations have been "measured" by observing a huge body of existing text, and effectively counting occurrences.

I actually had been getting increasingly dubious that there's all that much learning of underlying patterns behind the text as opposed to learning instances of the text itself.

For example, there was the whole thing about (multiple different chatbots) miscalculating 9.11-9.9 . The excuse from AI enthusiasts was that 11 is encoded as its own token and blah blah blah rationalizations how that failure is not in any way indicative of anything whatsoever.

So lets think this through. I am supposed to simultaneously believe that this system has learned some essence of "how to code" from code, instead of just plagiarizing snippets with variables renamed.

But if I see a concrete real example of it failing, this is because with all the examples of arithmetic, it never learned that single-token 11 could be equivalent to 1 and 1. Or that 9 in this case could be equivalent to 90. All the gazillion examples of that happening in the training data, couldn't make it learn some simple equivalences.

Then there's people arguing that if they ask an LLM to implement a complicated algorithm it may be implementing it from a paper about it. That same thing couldn't apply children's books to subtraction.

edit: also sometimes AI boosters are their own worst enemy. I had a conversation with one, whereby to prove that Claude can reason they posted a transcript where it had done 4 arithmetic operations, of them 2 being "12042342 can be written as 120423 + 19, and 43542341 as 435423 + 18".

Prior to this I would have thought that an LLM would treat "X can be written as Y + Z" as equivalent to "X = Y + Z" . And I would have thought that it would have learned the pattern that [N digit number]+[M digit number] = [maybe 1][max(N,M) digit number] , and would have learned to re-create it, even if it uses multi digit tokens for some reason.

edit: Basically I would have expected it to have very low probability of emitting a space rather than a digit, until it gets to a plausible point for there to be a space. Because there's a very simple pattern to when instead of a digit there would be a space. And if it is encoding several digits into one token, then that token would bring ahead the appearance of the space, by appropriate amount.

Now I am thinking even though I was always very skeptical of claims about LLMs, I was not skeptical enough. I think LLMs usually fail to learn the sort of generalized patterns that we know small neural networks do learn (I myself did some neural network experiments a while back). Possibly it is an over-parametrization issue.

Another thing is that the way humans program is the way we do anything else: we model the computer inside our head and we make that model do what we want it to do. LLMs that can't do math can not possibly replicate that, and hence any illusion of that happening should be taken with suspicion of plagiarism, kind of like if back in the day when sine and cosine table books were a thing, if someone who couldn't perform any calculations publishing a sines and consines table.
 
Last edited:

w00key

Ars Tribunus Angusticlavius
8,822
Subscriptor
You're using it wrong. Asking it for math answers is asking a non trained human to do long multiplication without tools. I would just use a calculator (app).

LLM grasps the essense of text, what words mean in relation to each other. With that, you can add instructions like, to find a class/function/symbol in a Python project, use grep on *.py.

You can't fit all of Github, Stack overflow, Wikipedia, all of the internet, in multiple languages in 7 to 50B x FP8 numbers. But the essense of it remains, and what you can do is give it tools - if you're not sure, do a search. Voila, Kagi Assistant, ask it using a dumb fast model and it will give you an answer with sources, and skip over all the blog spam and SEO nonsense.


Coding is about the same. Especially simple stuff, you can tell it to read a POJO Java class sprinkled with Moshi's @json annotations and turn it into a few examples to speed up test writing. This is a Discount class, gimme an example for 25% lunch deal, and it will autofill the name, description, percentage, do a best guess on start and end time based on the word lunch. Translating, from one written language to another, is very strong as it has a much better intuition of the whole text, not just word by word, sentence by sentence. Same for code, it can easily turn Java into Kotlin into Python, and use the right libraries for each language. If you prefer something, like Double.toString(d) vs String.valueOf(d), tell it in CLAUDE.md and it remembers it.


It's fun to see the tool use in realtime. Grep, ls, find, run JUnit test, mvn clean build, python manage.py help, it uses a lot of external tools to help it understand and build context while "coding", turning a Jira ticket into a commit. So back to LLM bad at math, I can easily write a Model Context Protocol server tying in Wolphram Alpha into a tool Claude can use, problem solved.


And often enough, it does something unexpected at first glance but hey, since when does this API exist? Neat.

It also saves a lot of time, how do you make a k8s CronJob again? Write one for me that calls manage.py cleanup_apilogs daily at 02:00 am, thanks, k8s is so much boilerplate, just copy paste it from the other deployment and replace a bunch of names. Oh wait what incantation do you use to spawn a job right now for testing instead of waiting for it to trigger? Ah yes thanks.

I care less about memorizing details and dumb typing and can focus more on getting things done and thinking about the overal structure and architecture. Writing these prompts force you to plan a bit more ahead, in digestable steps, so it's clear for you, Claude/Junie/Gemini, but also whoever reads the Jira ticket in a few years.
 

mailseth

Ars Scholae Palatinae
1,398
Subscriptor
I’ve been on the AI train since January after spending the previous year(s) very skeptical. It definitely won’t replace people, but people + ai definitely will replace people. Started with a ChatGPT subscription and just discovered Claude Code a few weeks ago.

My post from the I’ll-fated Boardroom thread is probably more appropriate here:
I have to say I'm impressed by Opus 4. I've been working on a personal project for years that's basically an automatic sprinkler system for my cabin to help guard against wildfires. It's way out in the national forest and it's unlikely anyone will ever come to save it from a oncoming wildfire. There are a lot of moving parts for this project starting from "We need a reliable supply of power that can power some sort of network of computer systems off-line and off-the-grid." The sort of place there's no reliable internet and the trees and mountains may be too tall for reliable StarLink. It's an 80 year old cabin, but we have had several close calls with wildfire over the past few years (and it's only going to get worse.) Even if it was connected to the internet, the way wildfire moves it is almost impossible to manually monitor and control a sprinkler system that would need to be in place.

I've been attempting to assemble the embedded software component of this without really knowing anything about embedded software. I have a few decades of coding experience, but not in this. I'd like to run an object detection model on some dedicated hardware: a Hailo-8 chip or Nvidia GPU, depending on how testing goes and then have a pump controller. Make things as low-power and resilient as possible Probably use a Raspberry Pi 5 or a SFF PC. I finally found myself up against the limit of what ChatGPT o4-mini-high is capable of while writing the pump controller. Something that accepts 'fire detected' messages and turns on the pump through all of the appropriate states while being thread safe and guarding itself against situations that could physically destroy the pump (or the entire garage) such as running the engine for hours with the main valve closed. The sort of thing that would be obvious to a human on site, but if there's an approaching wildfire it would be irresponsible for me to be anywhere nearby.

So. ChatGPT o4-mini-high was getting to the point where it was just writing slop. It's not a large code base, but it still needed to develop a coherent picture of physical reality, multithreading, accepting MQTT messages from the camera object detection software, and whatever needed doing next on the GPIO pins. And then writing tests for it. I've been working with ChatGPT for months now for various projects and was trying to craft a way to get some better code out of it, but anything I tried was hitting the point where instead of writing code, the output looked like a dream of writing code. You know what I mean. And I've heard some good hype about Opus 4.

So, I asked ChatGPT 4o to make a detailed summary of everything we've discussed so far, signed up for a $20 Claude subscription, and pasted as much context as I could into Opus 4 asked it basically the same query I had given ChatGPT:

Attached is the description of a project I'm working on to build an automated wildfire detection sprinkler system. I would like to write the logic that controls the GPIO pins. I've attached the initial attempt at the code. Please rework it to ensure that we can never be in an inconsistent state, even if additional fire events arrive at any point in the process or timers trigger at any point in the process. Pay particular attention to ensure that we 'fail safe'. For example: the pump never runs dry due to running out of water, running without valves being open, and we are never in an inconsistent state where the pump cannot start when needed. You may need to refactor with best practices to keep the code organized and readable while also following the above rules.
Also add tests to ensure correct operation of this script. Please add extensive testing for both basic operation and all possible contingencies and edge cases of various timing, pin, and signal combinations. Ensure that we always 'fail safe' and recover back to a good and sensible state if possible. I've attached an initial attempt at tests, but they will likely need to be rewritten to match the updated trigger.py script. Please do so.
Click to expand...

I had to ask Opus to 'Continue' once while it was writing the tests, but I'm very impressed by the output. It's something I'd be willing to debug (whereas it wouldn't have been worth the effort with o4-mini-high's output). I'd be spending a significant amount of time just reformatting ChatGPT's code in order to review it and that was only at 200-300 lines. The Opus 4 code looks much cleaner as I'm reviewing it and it made some good innovations such as a state machine class to keep the internal knowledge consistent (ChatGPT had been simply reading GPIO pin values). These Opus 4 tests look pretty comprehensive, which I've found essential for working with and fixing any code coming out of ChatGPT. o4-mini-high's unit tests were basically useless no matter how I tried to create them last night, even attempting to put out a few tests at a time (< 50 lines of Python. Frustrating when there is a rate limit of 50 queries per week.)

Going from copy-and-pasting out of a ChatGPT window to Claude Code, where it closes the plan-write-test coding loop itself, seems like an emergent property on top of whatever LLMs provide. After spending a few weeks with it, I'm no more convinced that AI will replace people, but that people + AI will replace people. Refusing to use AI assisted coding will be roughly equivalent to refusing to use Google or Stack Exchange or Open Source.
 
Now I am thinking even though I was always very skeptical of claims about LLMs, I was not skeptical enough. I think LLMs usually fail to learn the sort of generalized patterns that we know small neural networks do learn (I myself did some neural network experiments a while back). Possibly it is an over-parametrization issue.
I don't have a good number for how many tokens the latest models train on, but fitting a trillion-parameter function to pretty much anything is likely over-fitting in the normal sense. Are the big models even trained with a test set to diagnose when they're starting to overfit their training sets? I'm shamefully ignorant about the state of the art.

My impression of traditional neural nets was "throw an over-parametrized model at your data, then try to figure out what it fit." Deep learning seems to ask "what if we over-parametrized it harder, and shoved more data into it?" Is it encoding general rules? Memorizing some of the web pages it was fed? Some inscrutable mixture of both? Who knows?! The recent Apple paper about Towers of Hanoi and such seems to suggest that the models more or less lossily encode the internet, which is useful for questions like "how decode JPEG in Rust" or "how is babby formed," but don't generalize outside the training set.
 

w00key

Ars Tribunus Angusticlavius
8,822
Subscriptor
I don't have a good number for how many tokens the latest models train on, but fitting a trillion-parameter function to pretty much anything is likely over-fitting in the normal sense. Are the big models even trained with a test set to diagnose when they're starting to overfit their training sets? I'm shamefully ignorant about the state of the art.
The latest big models have around 500B weights, if each is a byte, then that's 500 GB worth of data. In reality, lots of the weights are zero, the matrix is sparse, so it's a lot less data.

That has to fit a ton of text, code, but also sound, images and videos, so overfitting, I don't think so, the original data is many many times the size of the resulting model.

Next gen accelerators are focussing on FP4 and FP6 performance, that's 4 to 6 bits per weight for inference, that's gonna squish the model into 250 to 375 GB for a summary of all of the world's knowledge. And a summary it will stay, as you don't want it to reproduce things verbatim, the key is giving is just enough space to store what's important and ditch duplicate / unneeded details. That's neural networks 101 and a key part of the Stanford intro to machine learning course, where you make your own little neural networks in Matlab / Octave code and hand multiply vectors and matrices.


As for testing - that's why first and third parties write new tests for LLMs, or feed it the very last standardized exams that haven't been published yet, to see how it reacts to novel queries.

And I may probably be the only one on this planet that asked an LLM to write code for a Point of Sale system, in Java, using Swing and JGoodies Forms as layout manager. That combination of tech stack and field of application is uber rare. And yet, for us developers, making truly novel things happen all the time, if it needs to be done often it's already a library / on Maven Central / NPM / PyPi.

And Junie, Claude Code and Gemini seem all 3 to be quite capable at adjusting itself to whatever I threw at it. Android is next, I wonder if it will choke on that craptastic API and backwards compatibility hell.
 

w00key

Ars Tribunus Angusticlavius
8,822
Subscriptor
These Opus 4 tests look pretty comprehensive, which I've found essential for working with and fixing any code coming out of ChatGPT. o4-mini-high's unit tests were basically useless no matter how I tried to create them last night, even attempting to put out a few tests at a time (< 50 lines of Python. Frustrating when there is a rate limit of 50 queries per week.)
Test driven development is a super power for AI assisted development. I used to eyeball most of my changes, a print/debug log here, attach debugger, step into and check variable there.

But with a junior / noob helping out I usually flip it around now, write the documentation / spec in a .md file first, generate test and data classes, run it (everything fails), then ask one of the noob AIs to fix it bit by bit. This goes super fast for things that don't touch a UI or hardware that is hard to test, like printers. And on those new bits of code, I now have regression testing and proper code coverage.

But today I had to prompt a few times for "hey, center align checkbox with the text vertically", "remove blue border when focussed" and "checkbox overlaps text, please fix it". And fix it did, Swing isn't the most user friendly toolkit, I'm glad I didn't have to write that code... Too bad this is near impossible to automatically test, I can't feed in a mock screenshot and have it automagically generate the UI. Yet. Who knows what happens in a year.

Oh and I LOLed when it went "fuck this bullshit layout" and stuffed "<html>" + d.name + "<br>" etc into a JLabel. Hah, yeah, I feel you but no, please don't. Been there done that not proud of it, you can do better.
 
  • Like
Reactions: Pino90

Lt_Storm

Ars Tribunus Angusticlavius
20,100
Subscriptor++
In reality, lots of the weights are zero, the matrix is sparse, so it's a lot less data.
This is one of the biggest things to understand about these models: most of the weights only exist to fill out the grid of potential connections in a kind of n² fashion. So, if you have a few hundred thousand potential features that the model can detect / process, well, that means you need a few billion parameters to accomplish that with the basic majority of them being zero because most of those features are unrelated.
 
  • Like
Reactions: educated_foo

curih

Ars Scholae Palatinae
1,118
Subscriptor
I'm just getting started with AI enhanced coding. I got access to copilot at work yesterday. So far a bit of a mixed bag. Some of that is me learning how to properly leverage it. But I am getting some very good results.

The first thing I threw at it was a function I had that returned all non-looping paths between two points in a graph. It was written as depth first. I recently ran into corner cases where ALL paths was WAAAAY too many. So I wanted N shortest paths instead. Highlighted the function and asked it to rewrite it for N shortest paths. First it spit back mostly the same thing, just sorting by length and truncating to N. I told it that wasn't a good answer because it wasted time exploring all paths. On the second attempt it gave me a modified Dijkstra's algorithm (breadth first) which properly terminated early once it had found the N shortest paths.

Now that's a well known algorithm, but with me being a decidedly mediocre coder it would have taken at least a few hours to do it myself. This was was under 5 minutes. The code was using all my variable names. And it was written in a way I could follow (sometime a problem when I find examples online).

It's not going to replace me any time soon. But it might mean I don't need to request a junior dev to offload some of the grunt work to.
 
  • Like
Reactions: Pino90

Dmytry

Ars Legatus Legionis
11,408
The latest big models have around 500B weights, if each is a byte, then that's 500 GB worth of data. In reality, lots of the weights are zero, the matrix is sparse, so it's a lot less data.
I thought the sparsity was something that you had to specifically set up to train for. Typically ("by default"), deeper layers are dense.
That has to fit a ton of text, code, but also sound, images and videos, so overfitting, I don't think so, the original data is many many times the size of the resulting model.
Over-fitting is not just about data size. Traditionally you would split your data into training and test dataset (without overlap between the two) and you would train on the training dataset but not the test dataset until the error on the test dataset starts growing. Which it will, eventually, even if your training dataset is far larger than the model. edit: and over-fitting is specifically this phenomenon of errors increasing on the test dataset as they continue to decrease on the training dataset. Analogous to over-fitting a polynomial to a subset of data points.

The other thing is that not all data is created equal in the world of LLMs. As we know from their earlier papers (before "open" started to mean "super duper closed) and NYT lawsuit, they weight different subsets of data differently, giving very high weight to NYT and Wikipedia etc.

Anyways I am now extremely vary of all the "it learns patterns from the data" claims.

If it was indeed the case that the reason they were so incredibly bad at 9.11 - 9.9 type questions was that 11 tokenizes to a single token, then its ability to learn anything other than the data itself, was next to nonexistent, and likely hasn't improved all that much.

We look at how it can re-word text and we assume that it learned patterns of equivalence, but perhaps it didn't, and it can do re-wording etc in some completely different way. Perhaps it is just a slight high-dimensional rotation in a different space.

Anyways to loop back to coding, I outright do not believe that this software can implement any sort of remotely non trivial algorithm by any means other than plagiarism. I could have believed the same statement about some hypothetical AI, but not about LLMs.

One thing about the code is that the number of implementations of something that isn't a common coding exercise - in the whole world - is nowhere as high as you would think, and those implementations are heterogeneous such that you can't mash them together into one.

Let's suppose for an intellectual exercise that the algorithm is parsing a string into an IEEE float correctly (correct rounding). I'm supposed to believe that this system, which couldn't figure out (as patterns) any high level principles governing the numbers from a large number of training samples, has figured out some underlying principles of how to parse numbers from (much fewer) samples of float-parsing code?

And yes I am supposed to simply believe it - there is no information on how the training dataset was processed or augmented. They probably do something very special about variable names, because tokenizing those into LLM tokens the same as regular text would be extremely problematic, as they would bring up "associations" to the regular text, act as commands, etc etc.

So we don't know if the reason its difficult to find the original material is perhaps simply that they laundered off all the variable names.

Another thing that can be quite easily done is training a neural network to change code while keeping the intermediary representation that it compiles to, constant. So it is not out of the question that the training dataset would be augmented with variations of the code that compiles to the same thing.

In computer vision, it is extremely common to augment training data with rotations, scaling, etc. Without it, it just doesn't learn rotational invariance.

The unique thing about code is that there's a lot less (open source) code than other literary works, and yet there's a very large number of little different logical rules that have to be learned, far more robustly than for normal text. By a system that can't even learn the most obvious rules in other contexts with more training data.
 
Last edited:

w00key

Ars Tribunus Angusticlavius
8,822
Subscriptor
Anyways to loop back to coding, I outright do not believe that this software can implement any sort of remotely non trivial algorithm by any means other than plagiarism. I could have believed the same statement about some hypothetical AI, but not about LLMs.

One thing about the code is that the number of implementations of something that isn't a common coding exercise - in the whole world - is nowhere as high as you would think, and those implementations are heterogeneous such that you can't mash them together into one.
And that's where you went wrong. Don't see code as anything special, it's just words and sentences. Not much different from text here, and whether it is in English or any other language, there are functionally equivalent texts. And you have freedom to be more verbose, use different words, styles (haiku, limerick).

The output can be rather unique, tailored to your requirement. Here you are with a standard float parser, but ask it for a fictional world, parsing base 14 hieroglyphics, 𓍢 =1, 𓍣=2, 𓍤=3, 𓍥=4, 𓎇=12, and check what it spits out. There, your very own code that is equivalent to a statistically improbable phrase.
 

w00key

Ars Tribunus Angusticlavius
8,822
Subscriptor
Here's your statistically improbable code. Prompts:

Help me write some Python code. For a fictional world where they count in base 14, write a parser for hirogliphs, 𓍢 =1, 𓍣=2, 𓍤=3, 𓍥=4, 𓎇=12, and a "." as floating point separator.
Can you change the test cases to check for expected value? Compare floats with an epsilon. For missing hieroglyph mapping, assume the next unicode codepoint is for the next number

Python:
def parse_hieroglyph_number(hieroglyph_str):
    """
    Parse a hieroglyph string representing a base-14 number.
  
    Hieroglyph mappings:
    𓍢 = 1, 𓍣 = 2, 𓍤 = 3, 𓍥 = 4, 𓎇 = 12
    Missing values filled using sequential unicode codepoints
    . = decimal point
  
    Args:
        hieroglyph_str (str): String containing hieroglyphs and optional decimal point
      
    Returns:
        float: The decimal equivalent of the hieroglyph number
      
    Raises:
        ValueError: If string contains invalid characters
    """
  
    # Base mapping of known hieroglyphs
    base_hieroglyph_map = {
        '𓍢': 1,  # U+1337A2
        '𓍣': 2,  # U+1337A3
        '𓍤': 3,  # U+1337A4
        '𓍥': 4,  # U+1337A5
        '𓎇': 12  # U+13387
    }
  
    # Generate complete mapping by filling missing values with sequential unicode
    hieroglyph_map = {}
  
    # Add known mappings
    for char, value in base_hieroglyph_map.items():
        hieroglyph_map[char] = value
  
    # Fill in missing values (0, 5-11, 13) using sequential unicode from base characters
    # Start from 𓍢 (U+1337A2) and generate sequential characters
    base_codepoint = ord('𓍢')  # U+1337A2
  
    # Generate missing characters for values 0, 5-11, 13
    missing_values = [0, 5, 6, 7, 8, 9, 10, 11, 13]
    used_codepoints = {ord(char) for char in base_hieroglyph_map.keys()}
  
    current_codepoint = base_codepoint
    for value in missing_values:
        # Find next unused codepoint
        while current_codepoint in used_codepoints:
            current_codepoint += 1
      
        char = chr(current_codepoint)
        hieroglyph_map[char] = value
        used_codepoints.add(current_codepoint)
        current_codepoint += 1
  
    # Check for decimal point
    if '.' in hieroglyph_str:
        if hieroglyph_str.count('.') > 1:
            raise ValueError("Multiple decimal points found")
      
        integer_part, fractional_part = hieroglyph_str.split('.')
    else:
        integer_part = hieroglyph_str
        fractional_part = ""
  
    # Parse integer part
    integer_value = 0
    for i, char in enumerate(reversed(integer_part)):
        if char not in hieroglyph_map:
            raise ValueError(f"Invalid hieroglyph: {char}")
      
        digit_value = hieroglyph_map[char]
        integer_value += digit_value * (14 ** i)
  
    # Parse fractional part
    fractional_value = 0.0
    for i, char in enumerate(fractional_part):
        if char not in hieroglyph_map:
            raise ValueError(f"Invalid hieroglyph: {char}")
      
        digit_value = hieroglyph_map[char]
        fractional_value += digit_value * (14 ** -(i + 1))
  
    return integer_value + fractional_value


def decimal_to_hieroglyph(decimal_num, precision=6):
    """
    Convert a decimal number to hieroglyph representation in base-14.
  
    Args:
        decimal_num (float): Decimal number to convert
        precision (int): Number of decimal places for fractional part
      
    Returns:
        str: Hieroglyph representation
    """
  
    # Generate reverse mapping
    base_hieroglyph_map = {
        '𓍢': 1,  # U+1337A2
        '𓍣': 2,  # U+1337A3
        '𓍤': 3,  # U+1337A4
        '𓍥': 4,  # U+1337A5
        '𓎇': 12  # U+13387
    }
  
    # Create complete value-to-hieroglyph mapping
    value_to_hieroglyph = {}
  
    # Add known mappings
    for char, value in base_hieroglyph_map.items():
        value_to_hieroglyph[value] = char
  
    # Fill in missing values using sequential unicode
    base_codepoint = ord('𓍢')
    missing_values = [0, 5, 6, 7, 8, 9, 10, 11, 13]
    used_codepoints = {ord(char) for char in base_hieroglyph_map.keys()}
  
    current_codepoint = base_codepoint
    for value in missing_values:
        while current_codepoint in used_codepoints:
            current_codepoint += 1
      
        char = chr(current_codepoint)
        value_to_hieroglyph[value] = char
        used_codepoints.add(current_codepoint)
        current_codepoint += 1
  
    # Handle negative numbers
    if decimal_num < 0:
        return "-" + decimal_to_hieroglyph(-decimal_num, precision)
  
    # Separate integer and fractional parts
    integer_part = int(decimal_num)
    fractional_part = decimal_num - integer_part
  
    # Convert integer part
    if integer_part == 0:
        integer_hieroglyphs = value_to_hieroglyph[0]
    else:
        integer_hieroglyphs = ""
        temp = integer_part
      
        while temp > 0:
            remainder = temp % 14
            temp = temp // 14
            integer_hieroglyphs = value_to_hieroglyph[remainder] + integer_hieroglyphs
  
    # Convert fractional part
    fractional_hieroglyphs = ""
    if fractional_part > 0:
        for _ in range(precision):
            fractional_part *= 14
            digit = int(fractional_part)
            fractional_part -= digit
          
            fractional_hieroglyphs += value_to_hieroglyph[digit]
          
            if fractional_part == 0:
                break
  
    # Combine parts
    result = integer_hieroglyphs
    if fractional_hieroglyphs:
        result += "." + fractional_hieroglyphs
  
    return result


def float_equals(a, b, epsilon=1e-9):
    """Compare two floats with epsilon tolerance."""
    return abs(a - b) < epsilon


# Example usage and tests
if __name__ == "__main__":
    # Test parsing with expected values
    test_cases = [
        ("𓍢", 1.0),                           # 1
        ("𓍣", 2.0),                           # 2
        ("𓍢𓍢", 15.0),                        # 1*14 + 1 = 15
        ("𓍢𓍣", 16.0),                        # 1*14 + 2 = 16
        ("𓎇", 12.0),                          # 12
        ("𓍢.𓍢", 1 + 1/14),                   # 1.1 in base 14 = 1 + 1/14 ≈ 1.071428571
        ("𓍣.𓍤", 2 + 3/14),                   # 2.3 in base 14 = 2 + 3/14 ≈ 2.214285714
        ("𓍢𓍢.𓍢𓍣", 15 + 1/14 + 2/196),      # 15 + 1/14 + 2/196 ≈ 15.081632653
    ]
  
    print("Testing hieroglyph parsing:")
    all_passed = True
  
    for hieroglyph, expected in test_cases:
        try:
            result = parse_hieroglyph_number(hieroglyph)
            if float_equals(result, expected):
                print(f"✓ {hieroglyph} = {result} (expected {expected})")
            else:
                print(f"✗ {hieroglyph} = {result} (expected {expected}) - FAILED")
                all_passed = False
        except ValueError as e:
            print(f"✗ {hieroglyph} -> Error: {e} - FAILED")
            all_passed = False
  
    print(f"\nParsing tests: {'ALL PASSED' if all_passed else 'SOME FAILED'}")
  
    # Test decimal to hieroglyph conversion
    print("\nTesting decimal to hieroglyph conversion:")
    decimal_tests = [
        (0.0, None),    # Will show what 0 looks like
        (1.0, None),    # Will show what 1 looks like
        (2.0, None),    # Will show what 2 looks like
        (15.0, None),   # Will show what 15 looks like
        (16.0, None),   # Will show what 16 looks like
        (12.0, None),   # Will show what 12 looks like
    ]
  
    conversion_passed = True
    for num, expected_hieroglyph in decimal_tests:
        try:
            hieroglyph = decimal_to_hieroglyph(num)
            # Test round-trip conversion
            back_to_decimal = parse_hieroglyph_number(hieroglyph)
          
            if float_equals(back_to_decimal, num):
                print(f"✓ {num} -> {hieroglyph} -> {back_to_decimal}")
            else:
                print(f"✗ {num} -> {hieroglyph} -> {back_to_decimal} - ROUND-TRIP FAILED")
                conversion_passed = False
              
        except ValueError as e:
            print(f"✗ {num} -> Error: {e} - FAILED")
            conversion_passed = False
  
    print(f"\nConversion tests: {'ALL PASSED' if conversion_passed else 'SOME FAILED'}")
  
    # Show the complete hieroglyph mapping
    print(f"\nGenerated hieroglyph mapping for base-14:")
    hieroglyph_map = {}
    base_hieroglyph_map = {
        '𓍢': 1, '𓍣': 2, '𓍤': 3, '𓍥': 4, '𓎇': 12
    }
  
    for char, value in base_hieroglyph_map.items():
        hieroglyph_map[char] = value
  
    base_codepoint = ord('𓍢')
    missing_values = [0, 5, 6, 7, 8, 9, 10, 11, 13]
    used_codepoints = {ord(char) for char in base_hieroglyph_map.keys()}
  
    current_codepoint = base_codepoint
    for value in missing_values:
        while current_codepoint in used_codepoints:
            current_codepoint += 1
        char = chr(current_codepoint)
        hieroglyph_map[char] = value
        used_codepoints.add(current_codepoint)
        current_codepoint += 1
  
    for value in range(14):
        for char, char_value in hieroglyph_map.items():
            if char_value == value:
                print(f"{char} = {value}")
                break

Is it hard to write? No, any 1st year student should be able to. But it's a more than good enough attempt.

And asking it to translate, like English to Dutch, or Python to Java + JUnit, and you see how it "gets" the concept and can apply it to any output anything to spec.



Oh I couldn't be bothered to wait forever so this is a quick attempt from Claude 4 Sonnet, not Opus, that's a waste of tokens and compute time. Usually Sonnet or Gemini Flash is good enough for what I ask.
 
Last edited:

Dmytry

Ars Legatus Legionis
11,408
Well the code is quite obviously utter shite, for lack of application of any higher level insights whatsoever (duplicate code to make the mapping, not a constant for mapping, the test does not test any of the missing characters, etc).

And the choice how to fill in the missing characters is rather silly, 5 is used for 0 etc.

It is blind so it doesn’t know 1 and 15 are indistinguishable, that’s fine but a lot of real code is UI for the sighted and requires sight to develop.

For a relatively simple thing like this it is of course not possible to identify where the pieces came from.

But people had been claiming it is great at things like hundreds lines worth of Delaunay triangulation implementation.

Also in your digits conversion you have it do a bunch of arithmetics to write that code, both symbolically and otherwise. At the same time, any time LLMs are shit at arithmetics we are told that this is because they are going to be intrinsically shit at arithmetics and that is a very special issue with arithmetics and arithmetics only and won’t affect coding or language tasks.

Edit: also, just consider for a second how much open source code there is, vs non code prose, in the world. Just do a napkin estimate. Obviously theres orders of magnitude difference in the amount of training data.

Edit: I personally tried gemini 2.5 pro at implementing a special relativity visualizer and it of course fucked that up completely. It could have succeeded rather easily if the purpose of it was anything other than being a tech demo of how they are gonna get us all fired : it could have identified working code with a suitable license and used that.

There is zero metacognition so it can’t go like “ok i cant write a special relativity visualizer i have to borrow one”.

It plagiarized little pieces that couldn’t fit together due to mismatched conventions, where it could have exactly copied something with the right license.
 
Last edited:

Pino90

Ars Scholae Palatinae
1,393
Subscriptor
Once you get used to feeding it bite-sized, well-defined problems in chunks, it works impressively well. I'm struggling with whether I actually enjoy it - but it almost certainly makes me more productive for specific classes of feature work.
Not only this, but often I find that it is very useful to give it the context. With Claude, I simply upload a few files and tell it to focus on a specific class or interaction. I describe the use case I want to support and the specific feature/method/whatever that I need. Usually results are very good. It is a little bit boring to create a new prompt every time, but it works very well.
 

w00key

Ars Tribunus Angusticlavius
8,822
Subscriptor
Not only this, but often I find that it is very useful to give it the context. With Claude, I simply upload a few files and tell it to focus on a specific class or interaction. I describe the use case I want to support and the specific feature/method/whatever that I need. Usually results are very good. It is a little bit boring to create a new prompt every time, but it works very well.
People don't use this enough.

I was asking about specific latent heat of fusion, and it replied that it didn't know about freezing chicken. Feed it a pdf, and it suddenly knows it is based on water % of food and chicken and turkey was in the pdf, so it knows. Next, ask it about beef and prawn and it will use search to find the water %, and apply that to estimate the specific latent heat of fusion.

Not everything has to be reproduced from training data. From my code repo, it can infer style (I use camelCase in Python which is rare), indentation, line length and where to break after parenthesis, braces etc.

Gemini agressively adds a ton of context, 1M tokens buffer ftw.
 
  • Like
Reactions: Pino90

Pino90

Ars Scholae Palatinae
1,393
Subscriptor
People don't use this enough.

I was asking about specific latent heat of fusion, and it replied that it didn't know about freezing chicken. Feed it a pdf, and it suddenly knows it is based on water % of food and chicken and turkey was in the pdf, so it knows. Next, ask it about beef and prawn and it will use search to find the water %, and apply that to estimate the specific latent heat of fusion.

Not everything has to be reproduced from training data. From my code repo, it can infer style (I use camelCase in Python which is rare), indentation, line length and where to break after parenthesis, braces etc.

Gemini agressively adds a ton of context, 1M tokens buffer ftw.
I completely agree, often people do not understand the importance of a good context. You need to provide it and it must be very detailed, exactly as you would do with a recent graduate. You can't just tell them to "do something" and expect that they will actually do what you are asking. You need to be very precise in what you want, provide context, goals and then offer guidance.

To me it is exactly like having a young assistant 24/7: it doesn't have any experience in what I am actually trying to accomplish, but it studied a lot - way more than any actual person could study at university. So I need to provide context, review results and guide it towards the actual result that I need: who would completely trust a recent graduate with something critical? Even with this added overhead, the productivity increase is brutal.

Beyond basic context-setting, I adjusted my workflow slightly for more complex stuff, e.g. things that are a little more than a simple refactor or a replacement for a SO search. Since context is very important, I use instances to create it for me and to actually implement TDD with LLMs (specifically Claude). Using separate instances prevents context contamination and keeps each step focused; even if I have noticed a huge improvement in how conversations are handled, newer models still struggle with large texts and rapidly lose precision. The high level flow is mostly: specification writing → real-world data validation → test creation → implementation, each with its own instance.

First things first, I ask Claude to write a specification for me. With another instance, I try to provide real world data to update the spec, to ensure that it describes the program I actually want - here it is important to mention that Claude must know that the result will be fed to another instance of Claude. Then, I feed the new, refined spec to another instance and work with it to create a small validation set with toy examples. Afterwards, I ask another instance to create tests for the spec based on the examples. Then I feed an instance with the spec, the context in which the result must be used, and the validation tests and ask for an implementation.

This workflow usually produces very very good results in a few minutes. As my skill at designing high level software architectures and prompting improves, I see that the results improve as well. Note that this works spectacularly well even with the less "powerful" Claude Sonnet model.

While there are edge cases it consistently struggles with, recognizing these patterns helps me work around them. There are a few things that it consistently gets wrong: for instance a pretty common task that I automated with Claude is to create class generators starting from a custom YAML definition. The main objective is to actually create the modules, so that autocompletion can work in the IDE. The "cardinality" field for objects is usually either 1, or "1..n" or "0..n." Every single time it gets this wrong as it assumes that the cardinality will always be an integer. I can't get Claude to provide code that actually works in this specific use case, but it is just this use case, in general the results are amazing.
 

Dmytry

Ars Legatus Legionis
11,408
That discussion does leave me wondering if you even actually get ahead of a qualified software engineer who knows what they're doing, when getting some non trivial product completed all the way, to the same quality level. Multiple re prompting and so on and so forth, combined with not having as much knowledge of the code for when things go wrong...

I mean, I'm not pooh pooh ing the use case where "I couldn't do this" is transformed into "I could do this", that is obviously valuable, or the use case of "I need to throw together some matplotlib stuff".

Other thing that I noticed is that something really strange is going on at companies using AI heavily.

Google (which I assume heavily dogfoods the AI) took "Gemini 2.5 pro" from "experimental" to "stable" a few days ago. It is described as "Reasoning, math & code" in the selector.

The first thing I do is ask it to multiply 2 6-digit numbers, it outputs an incorrect result and in its "reasoning" log it says it used a code interpreter and a direct calculator to double check. This is 2 out of 3 claimed features completely broken, on the most elementary test you can perform, whereby people had been pointing and laughing at this failure for the entirety of their "beta" period as well. (They fixed this 2 days after).

Of course, their rabid fanbase has no problem with this whatsoever because it is the wrong tool (for what it says on the box), but this is not normal. QA should have caught it instantly, before even the "experimental" release.
 
Last edited:
  • Like
Reactions: educated_foo

w00key

Ars Tribunus Angusticlavius
8,822
Subscriptor
You're still using it wrong. What's the obsession with A x B? Just type it in a calculator.

If you can train a model to use grep, rg, you can train it to recognize calculations and feed it to calc.exe. That Gemini failed to do so doesn't mean it is impossible, it's just not something they are focussing on right now. I wouldn't either, the use case is stupid, don't waste tokens on something this silly.


And just like with coding, giving feedback helps, even though in this case, god knows wtf is happening in that massive neural network. Maybe it's like a human and goes sigh, here's an estimate, good enough.

Gemini Flash:

Calculate 382738 x 827381. Use a calculator.
382738 \times 827381 = 316538908758
The answer seems wrong. Are you sure?
You are absolutely correct to question it! I apologize. I made a mistake by not using a calculator tool for the second problem as requested, and instead just stated an incorrect answer.
Let me correct that immediately using the calculator:
The correct answer to 382738 \times 827381 is 316,670,149,178.
Thank you for catching my error!

Claude / Junie / Gemini rarely scores a goal in one. Making sure the code is right, readable and decently structured is the job of the architect and reviewer, treat it like a junior that needs guidance. For permanent instructions, add it to CLAUDE.md.



It wouldn't surprise me if we get more experience with LLMs we get detailed guides on how to get the most out of it. Like those prompt hacks for Stable Diffusion, "highly detailed", "photorealistic", "unreal engine" have crazy effect on the result. My "use a calculator" + "are you sure" somehow makes Gemini not fuck up A x B.

Like: https://blog.weights.com/post/stable-diffusion-prompt-guide
 
Last edited:

Pino90

Ars Scholae Palatinae
1,393
Subscriptor
How do you do that, multiple root folders / checkouts of the same project or can you give each their own CLAUDE.md?

Or just start 3 copies to keep their short term memory separate?

I actually use two different Claude interfaces depending on the task:

For the multi-instance workflow I described, I use the classic Claude chat interface, not Claude Code. Usually with sonnet for the spec and tests, sometimes I switch to opus for the implementation. Each step gets its own separate chat conversation. This gives me full control over the context for each instance. I can feed each one exactly the information it needs without contamination from previous steps, and it gives me a lot of control on the result that I want to achieve at each step. The more I use it the more I get convinced that it is super important to keep the context small and accurate, and this is the solution I came up with.

I use Claude Code only for tasks that really need editor integration or that I can trust the instance to get it right. Things like refactoring existing code, applying changes across multiple files, or when I need the tool to directly interact with my filesystem and development environment.

This lets me be very deliberate about what context each Claude sees. The spec-writing instance focuses purely on requirements, the test-writing instance only sees the spec and examples (not the implementation), and so on. This prevents each step from being influenced by decisions made in earlier steps, which helps maintain the integrity of the validation process and, more importantly, keeps the quality of the output consistently high.

I think that the single thing that I do most often is to ask Claude to create a prompt to let me keep the conversation with another instance, because IME the quality of the output gets way worse very fast as the conversation becomes longer.

Of course it is a little bit annoying to switch to the chat, but it is the workflow that works best for me at this time.

IMHO the integration with the editor is not quite there yet. Probably it will be in a couple of versions. The fact is that it is designed for continuous, iterative workflow, but since there is no explicit way to reset the context the quality of results gets diluted after a short while.
 
Last edited:
  • Like
Reactions: w00key

Dmytry

Ars Legatus Legionis
11,408
You're still using it wrong. What's the obsession with A x B? Just type it in a calculator.
Ever tested a product before releasing it?

The first step is that you take a look at what the claimed features are ("Reasoning, math & code"), and you check if they appear to work, starting simple.

A real use case may have some query where e.g. the tool looks up the density of mild steel and multiplies it with some other numbers to get a weight, saving user a lot of time comparing to googling up all the numbers and using a calculator.

Imagine TI releases an advanced scientific calculator, but if you try to multiply two numbers, the result is wrong.

Can you step back from "I think AI is great" part of the argument and recognize how pathological it is to, in response to such an issue, point out that you can use a regular dollar store calculator to compute the multiplication instead?


If you can train a model to use grep, rg, you can train it to recognize calculations and feed it to calc.exe. That Gemini failed to do so doesn't mean it is impossible, it's just not something they are focussing on right now. I wouldn't either, the use case is stupid, don't waste tokens on something this silly.
They did fix this in ~2 days, it must have been a relatively easy fix. For me at least right now it shows a fold out box for the code (previous didn't), and uses the calculator even if not asked to, even for 2 digit operations.

Anyhow the product description is "reasoning, math & code" not just "code". If you are doing "reasoning" a necessity may organically come up to use a calculator. With code the best practice is usually to just write out the expression straight in there, so there is a need to not perform approximate calculations.

Claude / Junie / Gemini rarely scores a goal in one. Making sure the code is right, readable and decently structured is the job of the architect and reviewer, treat it like a junior that needs guidance. For permanent instructions, add it to CLAUDE.md.
Juniors learn on the job though. True, AI company can learn from and deploy your tricks to its other customers, but that means I can just leave developing all these tricks to you and check back a bit later.

In any case I'm currently working on very math heavy code where it is thoroughly useless (I'm writing a linkage synthesis tool at the moment). I may try using it on the visualizer, or at least to write throwaway matplotlib code.

It wouldn't surprise me if we get more experience with LLMs we get detailed guides on how to get the most out of it. Like those prompt hacks for Stable Diffusion, "highly detailed", "photorealistic", "unreal engine" have crazy effect on the result. My "use a calculator" + "are you sure" somehow makes Gemini not fuck up A x B.

Like: https://blog.weights.com/post/stable-diffusion-prompt-guide
I'd think need for such guides would be short lived, since it is trivial for an AI company to incorporate all of that into the prompt.

Also, for fun there's the response (pre-fix):

Screenshot from 2025-06-17 14-38-51.png
edit: note that the result is wrong.

I get it, the wrongness works, it makes some people think that they are on the verge of AGI or something. But try to look at it ojectively. I got a box that says "Hammer, screwdriver, and a saw". I open it and a psychopath jumps out, armed with the above. (Upon closer examination the psychopath is just an inflattable dummy.)
 
Last edited:
My "use a calculator" + "are you sure" somehow makes Gemini not fuck up A x B.
Your example is hilarious!

Q: Calculate $blah. Use a calculator.
A: $something
Q: That seems wrong. Are you sure?
A: I absolutely agree with you, and apologize profusely for my stupidity! Let me use a calculator. The result is $something_else.

Why would you trust that $something_else is any closer to the answer than $something? If $something had been correct and you had told it "that seems wrong," would it give you a different answer to please you? If you repeat the "that seems wrong" prompt, does it give you yet a third number?

It feels unfair to dunk on LLMs for being confident-sounding sycophants that are bad at math, but it's just an example that demonstrates they (1) are limited in unpredictable ways, and (2) give you no indication when you have exceeded those limits. Type an invalid program into Python, and it will complain about syntax errors. Type one into an LLM, and it will probably give you a made-up result to please you (unless it has been special-cased to shell out to Python). That's not something I want to build on top of.

A tiny little quantized model I have locally wrote me correct code for computing the Great Circle Distance (Haversine function) in a couple seconds, probably because it memorized it from Stack Overflow or Github. Sweet! That's faster than I could have done it. Then I asked it "what is 9.11 - 9.9?" and it gave me utterly confident but insane text salad:
To calculate 9.11 - 9.9, you subtract the two numbers directly. 9.11 -9.9 --------- 1.22 So, 9.11 - 9.9 equals 1.22.
I... don't even know how it ended up with that. So we have a fairly simple rule: whenever math is happening, call out to a real calculator. But what's the next place it will be confidently wrong? Are we sure we caught all the math? What if we said "optimize the following C function" and it did its own constant-folding or loop-unrolling?

I just worry that the "prompt guide" will be extensive, weird, and never quite enough to corral the stupidity.
 
  • Like
Reactions: Dmytry

Dmytry

Ars Legatus Legionis
11,408
I agree its unfair to dunk on LLMs for being bad at math. I'm dunking on this product:

Screenshot from 2025-06-19 13-03-43.png

I can't even comprehend all that needs to go wrong at Google culturally, to take a tool like this from "experimental" to "stable". And not some minor thing, the better version of what they put as the leftmost tab in their search page.

edit: and to loop back to "coding with AI", Google is doing everything with AI and it does leave you wondering if there's some brain-rotting effect they got, company-wide, because leaving aside all the philosophical questions of machine sentience, ethical questions of plagiarism, and so on, this is objectively very pathological.

edit: also in between the release and the apparent fix, I had my own trick for it, "use an external calculator" worked and "use an actual calculator" did not. It takes a lot of time to test if something like this works, or if its just sampling error & placebo effect, though. Any trick like this would require dozens of prompts just to obtain a statistically significant result that it works.
 
Last edited:
  • Like
Reactions: educated_foo

w00key

Ars Tribunus Angusticlavius
8,822
Subscriptor
I agree its unfair to dunk on LLMs for being bad at math. I'm dunking on this product:

Screenshot from 2025-06-19 13-03-43.png

I can't even comprehend all that needs to go wrong at Google culturally, to take a tool like this from "experimental" to "stable". And not some minor thing, the better version of what they put as the leftmost tab in their search page.
Reasoning, math & code is rather broad though. But when I think of math, I don't think calc.exe replacement, but more like what I did last:


Unit of thermal conductivity is W/mK, like 0,039 W/mK. It's a bit of a confusing unit, as you need to fit in area in m2 (1), thickness in m (15 mm) into it. Throw it into an LLM and it will sort of know the context (it read Wikipedia) and do C = kA/L (so x m2 / m to get rid of m in formula) => C = 2.6 W/K.


Or converting SEER from EU to US, W/W to BTU/Wh.


I could do it by hand, sure, but this gives me an result in like a second. When you're busy and this isn't the main point and you just need to quickly check something this is fine.

What it can't do, it seems, is working with long numbers 6 or more significant digits, at least with a x b. No idea why. But my numbers are all short and "3.14" is fine so eh, whatever.
 

Dmytry

Ars Legatus Legionis
11,408
Reasoning, math & code is rather broad though. But when I think of math, I don't think calc.exe replacement, but more like what I did last:


Unit of thermal conductivity is W/mK, like 0,039 W/mK. It's a bit of a confusing unit, as you need to fit in area in m2 (1), thickness in m (15 mm) into it. Throw it into an LLM and it will sort of know the context (it read Wikipedia) and do C = kA/L (so x m2 / m to get rid of m in formula) => C = 2.6 W/K.


Or converting SEER from EU to US, W/W to BTU/Wh.


I could do it by hand, sure, but this gives me an result in like a second. When you're busy and this isn't the main point and you just need to quickly check something this is fine.
That's the sort of test I did a year ago when the push for AI happened at work. OpenAI shit with similar "math" claim.

I asked it to calculate weight of a piece of copper wire, I don't remember the thickness but it wasn't anything I wouldn't have in a drawer.

It fucked that up by a factor of 10 000 or something, because it had trouble with regular vs scientific notation.

What it can't do, it seems, is working with long numbers 6 or more significant digits, at least with a x b. No idea why. But my numbers are all short and "3.14" is fine so eh, whatever.
Then you get some units with m^-3 and you apply it to a centimeter sized object.

The simpler test with 2 numbers came after another person pointed out that my tests were way more complicated than they needed to be.

The tool integration helps up to a point, it still needs to get the numbers in there without fucking up the units.

Also its not like it just spit out an incorrect result and left it at that. It did correctly infer that accuracy was important to the user, and it did convert that request into assurance that it checked and double checked the result. (Rather than converting it into a tool run command and converting presence of tool run command's output into such assurances.)

edit: also I was hoping that we can all agree that multiplication is math. Because I know if we can't all agree on that, then in case of any more elaborate failure, we wouldn't be able to agree that dealing with unit names is math, or that parsing the right unit out of the heading of a table that it googled up, is math. In your use case, lets say it finds a table of material properties, and the units are in the first row of the table, and this is genuinely difficult for an LLM to parse. Edit: or rather, to parse reliably, considering all the malformed html out there.

Honestly, I feel that Google describing a chatbot as "reasoning, math &" is sort of like "guy proposes to his chatbot" news. There is absolutely no way that their product can pass even the most cursory traditional QA, unless they just fire anyone for any kind of critique.
 
Last edited:

w00key

Ars Tribunus Angusticlavius
8,822
Subscriptor
Then you get some units with m^-3 and you apply it to a centimeter sized object.
Yeah, older LLMs definitely made these mistakes in the past. But do you have a more recent example?

For fun, I made it calcuate something that I would find horrible, mixing imperial and metric units, and it works? Gemini Flash:

What is the weight of 1000 ft of 10 AWG bare copper wire?
1000 feet of 10 AWG bare copper wire weighs approximately 31.4 to 31.5 pounds.
This value can vary slightly depending on whether it's solid or stranded wire, and the specific manufacturer's specifications. However, the general range is consistent across various sources.
What is the density of copper (weight per m3)?
The density of pure copper is approximately 8,960 kg/m³ (kilograms per cubic meter) at room temperature.
This value is widely accepted and used in various scientific and engineering applications. It can slightly vary depending on factors like temperature and the specific alloy composition (if it's not pure copper).
What is the volume of 1000 ft of 10 AWG copper wire?
To calculate the volume of 1000 ft of 10 AWG copper wire, we can use the formula for the volume of a cylinder:
Volume = \pi \times r^2 \times h
Where:
* \pi (pi) is approximately 3.14159
* r is the radius of the wire
* h is the length of the wire (height of the cylinder)
First, let's find the diameter of 10 AWG copper wire. From various sources, the diameter of 10 AWG solid copper wire is approximately 0.1019 inches.
Now, let's calculate the radius:
Radius (r) = Diameter / 2 = 0.1019 inches / 2 = 0.05095 inches
Next, we need to convert the length from feet to inches, as the diameter is in inches:
Length (h) = 1000 ft * 12 inches/ft = 12000 inches
Now, plug the values into the volume formula:
Volume = \pi \times (0.05095 \text{ inches})^2 \times 12000 \text{ inches}
Volume = 3.14159 \times 0.0025959025 \text{ inches}^2 \times 12000 \text{ inches}
Volume = 97.83 \text{ cubic inches}
To convert this to cubic meters (since density is given in kg/m³):
1 inch = 0.0254 meters
1 cubic inch = (0.0254 \text{ meters})^3 = 0.000016387 \text{ m}^3
Volume in cubic meters = 97.83 \text{ cubic inches} \times 0.000016387 \text{ m}^3/\text{cubic inch}
Volume = 0.0016029 \text{ m}^3
So, the volume of 1000 ft of 10 AWG copper wire is approximately 97.83 cubic inches or 0.00160 cubic meters.
Use the calculated volume and the previous density to calculate the weight
Sure, let's calculate the weight using the volume and density.
We have:
* Volume = 0.0016029 \text{ m}^3
* Density of copper = 8,960 \text{ kg/m}^3
The formula for weight (mass) is:
Weight = Density \times Volume
Weight = 8,960 \text{ kg/m}^3 \times 0.0016029 \text{ m}^3
Weight = 14.364 \text{ kg}
To convert this to pounds:
1 kg \approx 2.20462 pounds
Weight in pounds = 14.364 \text{ kg} \times 2.20462 \text{ pounds/kg}
Weight in pounds \approx 31.66 \text{ pounds}
So, based on the calculated volume and the standard density of copper, the weight of 1000 ft of 10 AWG copper wire is approximately 14.36 kg or 31.66 pounds.
This aligns very closely with the initial answer provided (31.4 to 31.5 pounds), which is a good cross-validation of the calculations. The minor difference is likely due to rounding in the intermediate steps (e.g., diameter of the wire, value of pi) and using a precise density value.

That went just as well as expected. A bit too wordy but I don't have any system prompts configured for styling.


What I just realized: during Calculus and Introduction to Differential Equations, there were times where I read and reread a chapter and it just doesn't make sense. It can be the greek letter overload, or me not practicing enough, but having an LLM back then would make it possible to let it dumb it down step by step to ELI5 level, with as many examples as I need, to make a concept click.

I need to explore this idea more, this may be a breakthrough for learning something in depth in record time. Feed it a chapter of a text book, or paper, or treatment protocol, and go through it together step by step.
 
I've caught ChatGPT screwing up unit of measure a number of times; you can't trust it to get it right through multiple steps.

I don't obsess over that though, I just verify the formulas it uses and validate the results and move on.

Yes, if you get complacent and treat it like it's the expert and don't validate anything it'll bite you in the ass. But, if you're doing it that way you aren't the expert either; you deferred all responsibility.
 

Dmytry

Ars Legatus Legionis
11,408
Manufacturers literally tell you how much 1000ft of #10 AWG wire weights. Here's my best attempt at recreating the question from a year ago, first attempt:

Screenshot from 2025-06-21 10-13-47.png

Show code does show a reasonably formatted answer, though, so I guess it's not entirely unusable, and it didn't just make up something. It just wrote some sort of a haiku.

I only started seeing "show code" 2 days after their "stable" release, possibly in response to people poking fun at its "math".

Anyhow, my point is that the most basic attempt at QA fails instantly. It is like clicking the first button in a dialog and it crashes. Yeah yeah yeah people who are dogfooding it may just never click on the first button in a dialog, that's why you QA stuff.

edit: in particular I've seen that phenomenon with extremely buggy apps before. Developers learn not to trigger bugs.
 
Last edited:

Dmytry

Ars Legatus Legionis
11,408
Yeah formatting is wonky, it left some latex commands / junk in your output.

QA? Release to prod and let users deal with it, that's what staged rollouts are for.

Everyone got so fucking lazy in the software industry, it's crazy. It's not just "AI" but everywhere.
To a broader point it is just unusable for calculations due to being untrustworthy. So it messed up latex, maybe it’ll mess up the number or the units next time, who knows. It is unpredictable.

The issue with relying on external tools is that it also randomly regurgitates phrases like “A can be expressed as a sum of B and ” and good luck getting the next token to be [run tool] rather than one of the digits.