It is important to note that a large part of the reason that a mathematical LLM like this or a protein LLM or any other science-based LLM works is because the data set has been scrupulously cleaned and QCd. For example, if someone had slipped π=3 into the training data set, the output would have had quite a few errors in it.
In contrast, the average LLM is trained on all sorts of nonsensical data (see: the internet) and so the LLM outputs all sorts of nonsense (GIGO, as we used to say back in the day when we carved the symbols by hand on clay tablets).
And, unlike a person, a LLM is incapable of deleting training data that is erroneous. As a result, those bad inputs end up creating bad outputs; sometimes in obvious ways, sometimes in not so obvious ones.
And that is why LLMs are good as research tools but not for much more. Because in research, the user is usually smart enough to know the limitations of the LLM and wise enough not to take its advice about using glue to hold the cheese on pizza. But in general use, those two qualifiers are more the exception than the rule.