LLMs believe false statements even after explicit warnings that they’re false

SraCet · 2026-06-01T15:56:11-0400

graylshaped said:
Who said that matters, true or otherwise? I'm fascinated watching the strange maneuvering you are doing to pretend these models are not affected by probability and statistical methods.

Well, I listed some stuff that would definitely be statistics, and LLMs don't do that stuff.

So that seems pretty relevant to an argument about whether or not LLMs "do statistics," wouldn't you think?

If you think they are "affected by probability and statistical methods" then you're welcome to state your case.

graylshaped · 2026-06-01T16:02:19-0400

SraCet said:
Your text from post #439:

As long as we're waving degrees around, my undergraduate degree was in semiotics, and as one of those whose motives in expressing our point of view you call into question in your post, can tell you that "LLMs simply could not be as good as they are at language without at least close analogues of 'concepts' and 'understanding' " says more about your own superficial understanding of the relationships between sign and signifier and how that affects language and communication than it does about the capabilities of LLMs.

Saying that somebody is wrong about something seems like just as much of a claim as anything else?

I didn't say they were wrong. I said the statement expressed a superficial understanding of the relationship between parsing a language and what that implies about "understanding" and "concepts." Would you like to get into a discussion about the role of implicature and pragmatics without you googling to know what those terms mean in this context? Language involves far more than syntax and vocabulary. I'll even tell you up front that current LLMs do apply some of the principles I'm discussing, and continue to improve. The imitation improves incrementally as every day passes and every investment round funds.

Hey, aren't you one of the ones saying "understanding" wasn't binary?

SraCet · 2026-06-01T16:04:17-0400

JohnDeL said:
Yes, I have. I even wrote a couple of papers on how to integrate their results with other methods to give more realistic geological simulations.

If your probability distribution doesn't relate to the actual distribution, then you've fucked up. That is axiomatic to statistics.

Sure, it's going to end up being an approximation, influenced by your learning rate, batch size, and what training data you've trained on most recently. And also how well your model can identify the training data specifically.

But even if you correlate the output probability distribution with the distribution of your training data and there's good correlation, aren't you the entity that's doing the "probability" and the "statistics" in that case?

SraCet · 2026-06-01T16:06:39-0400

graylshaped said:
I didn't say they were wrong. ...

It's rare that I see a post that says "you have a superficial understanding of what you're talking about, and you're right!"

JohnDeL · 2026-06-01T16:09:40-0400

SraCet said:
Sure, it's going to end up being an approximation, influenced by your learning rate, batch size, and what training data you've trained on most recently. And also how well your model can identify the training data specifically.

So your argument is that instead of training on the entire population, it only used a subpopulation? And you don't see how that doesn't help you?

SraCet said:
But even if you correlate the output probability distribution with the distribution of your training data and there's good correlation, aren't you the entity that's doing the "probability" and the "statistics" in that case?

No. I am the entity that is using the LLMs output to see if the model is giving reliable results. And a LLM that says a given token should be present 75% of the time in a given set of circumstances and then doesn't present that token 75% of the time in that set of circumstances is a bad LLM that should not be used.

wildsman · 2026-06-01T16:24:04-0400

graylshaped said:
You are really, really struggling with the reality that I am not the one making a claim here.

No I completely understand - just wanted to be sure...

We have been discussing 'understanding' this entire thread and when asked: 'Well let us nail down what 'understanding' means - can you name something that can understand? How about humans? Do you think humans understand?'

You respond: 'I'm not the one making a claim here'

You take potshots at people instead of putting forward a positive position because god-forbid you have to actually defend it. Boy I gotta admire the game. You're an expert level troll.

graylshaped · 2026-06-01T16:28:27-0400

SraCet said:
Sure, it's going to end up being an approximation, influenced by your learning rate, batch size, and what training data you've trained on most recently. And also how well your model can identify the training data specifically.

But even if you correlate the output probability distribution with the distribution of your training data and there's good correlation, aren't you the entity that's doing the "probability" and the "statistics" in that case?

Let me stipulate, for the sake of this discussion, your implication is correct. Using methods prosaic or arcane, these models contain within themselves a representation of Knowledge and the World that allows them to Understand Concepts.

We have had this specific discussion before, but I'll point out again: The map is not the territory. Before, you wanted to insist that a map that is good enough makes that irrelevant, and I even agree with that up to recognizing that "good enough" isn't, as it were, a good enough specification for many, many tasks these tools are being promoted to perform.

Let's stipulate all that as having been covered. How is that internal map created and represented if statistical techniques are not carrying a big part of that load?

SraCet · 2026-06-01T16:28:44-0400

JohnDeL said:
So your argument is that instead of training on the entire population, it only used a subpopulation? And you don't see how that doesn't help you?

Nowhere did I say that you only use part of your training data, it's just that you don't train on all your data at the same time.

But, sure, if your DNN is big enough, your batch size is big enough, and you train long enough, then your output probability distribution for any particular item of training data should match the frequency of outputs in the training data.

I can understand why you would want to think of those as probabilities, but all the neural network is trained to do, and is "trying" to do, is get the right answer for any particular given input.

SraCet · 2026-06-01T16:30:49-0400

graylshaped said:
...
Let's stipulate all that as having been covered. How is that internal map created and represented if statistical techniques are not carrying a big part of that load?

What statistical technique(s)? It'd be cool if you named one.

graylshaped · 2026-06-01T16:32:29-0400

SraCet said:
It's rare that I see a post that says "you have a superficial understanding of what you're talking about, and you're right!"

And it isn't rare that I see a post from you that doesn't cherry-pick quotes and twist data to suit your whim.

DeeplyUnconcerned · 2026-06-01T17:05:33-0400

SraCet said:
Another equally valid way of looking at LLMs is that they're functions that approximate intelligence.

They're a representation of the entire space of intelligent things that could be said in any possible situation.

Of course, we don't have a corpus of training data that only includes intelligent stuff, so we train them with all the text that we do have and hope for the best.

But, intelligence is what we've been aiming at.

That's at least arguably true, sure. I think I'd replace "intelligent" with "intelligible" in your second sentence, but "LLMs are an attempt to create a function that approximates intelligence" isn't totally off the mark. "Artificial intelligence is the goal of AI research" seems like a reasonable statement.

DeeplyUnconcerned · 2026-06-01T17:06:49-0400

JohnDeL said:
If ChatGPT's model gives a probability distribution value of 75% for a particular token in a given set of circumstances and that token doesn't show up 75% of the time in those circumstances, then that means that ChatGPT is borked. Because a PDV of 75% literally means that the event should happen 75% of the time given a specific set of circumstances.

Ah, that's not strictly true, because of temperature (which is in-framework but post-model).

JohnDeL · 2026-06-01T17:07:37-0400

SraCet said:
Nowhere did I say that you only use part of your training data, it's just that you don't train on all your data at the same time.

Yes, that is a standard training method which allows you to QC the results by compating them to the data that wasn't used.

But both the training data and the reserved data should have the same characteristics. If they don't the you don't get reliable or useful results.

SraCet said:
But, sure, if your DNN is big enough, your batch size is big enough, and you train long enough, then your output probability distribution for any particular item of training data should match the frequency of outputs in the training data.

And if it doesn't the you haven't trained enough or your training data was not representative of the data overall.

In short, your LLM is borked and should not be used.

SraCet said:
I can understand why you would want to think of those as probabilities, but all the neural network is trained to do, and is "trying" to do, is get the right answer for any particular given input.

And "right" is defined by a fitting function that is based on those statistics.

JohnDeL · 2026-06-01T17:09:06-0400

DeeplyUnconcerned said:
Ah, that's not strictly true, because of temperature (which is in-framework but post-model).

Yes, you can adjust the biases. But that just means you are screwing with the weights on an ad hoc basis which is best done with great reluctance.

DeeplyUnconcerned · 2026-06-01T17:09:08-0400

SraCet said:
Nowhere did I say that you only use part of your training data, it's just that you don't train on all your data at the same time.

I mean, you don't train on all of your training data - typically you keep part of it back for validation (which yes is explicitly not part of the "training" dataset).

DeeplyUnconcerned · 2026-06-01T17:16:56-0400

JohnDeL said:
Yes, you can adjust the biases. But that just means you are screwing with the weights on an ad hoc basis which is best done with great reluctance.

My understanding is that temperature isn't to do with the in-model biases, it's to do with the way the probabilities output by the model are used to select a token, and it's pretty standard practice.

Like, GPT2 implementation, in body starting at line 62, we go up to step at 50 which gets the model output at 51, and then divide by temperature at 64. It's not in the model itself, it's a post-model modification to the output probabilities to (typically) add more variety to the output of the framework by biasing token selection slightly towards lower-probability tokens.

It's a dumb technical point that doesn't really matter, but if you want to be strictly accurate, GPT is the model, ChatGPT is the framework, and it's expected that the probabilities of the framework don't exactly match the probabilities of the model except where temperature=1 (which it is by default in the code here, but isn't typically called that way).

JohnDeL · 2026-06-01T17:35:32-0400

DeeplyUnconcerned said:
My understanding is that temperature isn't to do with the in-model biases, it's to do with the way the probabilities output by the model are used to select a token, and it's pretty standard practice.

Which is another type of bias. Think of it as adjusting the phase or the color of a TV; it changes the signal to make it "better" (for some value of better).

I'm pretty sure we're in agreement on the idea but our different terminologies may be getting in the way.

graylshaped · 2026-06-01T18:09:21-0400

SraCet said:
Well, I listed some stuff that would definitely be statistics, and LLMs don't do that stuff.

What, exactly, methods to assess options in predicting the next token do you think they employ in doing that other stuff?

DeeplyUnconcerned · 2026-06-01T18:11:11-0400

JohnDeL said:
Which is another type of bias. Think of it as adjusting the phase or the color of a TV; it changes the signal to make it "better" (for some value of better).

I'm pretty sure we're in agreement on the idea but our different terminologies may be getting in the way.

Possibly yeah. “Bias” in ANN terminology is a modifier to each parameter inside the model (if memory serves).

Edit: weights are also values internal to the model, and I’m not sure you’re taking about those either, which is likely the origin of the confusion.

dadsfolk · 2026-06-01T18:24:45-0400

crmarvin42 said:
The first thing that comes to mind is the writing of a business case, for which my boss leaned heavily on an LLM to do the writing to match his outline. No person who was at all knowledgeable about the market segment and application would have gotten so many fundamental things wrong.

That's a mistake a lot of people make - assuming a language engine is a knowledge engine.

First, why in the world would your boss assume that a general LLM (chatbot) would be "knowledgeable about the market segment and application"? Or recognize that without that, the business case couldn't be effective, regardless of how good it looked.

And does your boss know enough about the market segment and application to note that the output was garbage? This result should have been an eye-opener for him.

crmarvin42 said:
The prose it generated looked convincing, so long as you were not actually familiar with the subject matter. Of the 12 citations it made that I bothered to check, all of them were wrong in some fundamental way. The cited works did not, in fact, support the assertions they were being linked to. Often times, a specific number or phrase was indicated, then a citation provided, and that specific phrase or number did not appear once at the location indicated. Ironically, I usually knew where that number or phrase did originate, but the LLM did not proceed from reference article to specific citation from that article. Instead it pulled the specific citation from one location, and then dropped in a largely unrelated reference article to use as evidence. The structure of cause and effect, reference and citation, was there, but the two halves were not connected to each other in any meaningful way that might indicate it understood the point of that relationship.

Sounds like Chat-GPT, although I've no doubt others can have similar problems. This sort of writing is simply not something it's trained to do, and if you're going to ask an LLM to do something like this, you should select an LLM that's been trained in this sort of writing, and primed with market data, market reports, business cases, and whatever else it needs to accomplish the task. You wouldn't ask a totally untrained intern to perform such a task. They wouldn't necessarily make the same mistakes, but the result would be equally unusable.

Chatbots being chatbots, of course it's going to do its best to slap something together which appears to meet the requirements.

SraCet · 2026-06-01T18:28:26-0400

DeeplyUnconcerned said:
That's at least arguably true, sure. I think I'd replace "intelligent" with "intelligible" in your second sentence, but "LLMs are an attempt to create a function that approximates intelligence" isn't totally off the mark. "Artificial intelligence is the goal of AI research" seems like a reasonable statement.

The point being, that LLMs aren't intended to produce an approximation of their training data.

Yes, training data is used to train them, but, you know, duh.

SraCet · 2026-06-01T18:35:52-0400

JohnDeL said:
Yes, that is a standard training method which allows you to QC the results by compating them to the data that wasn't used.
...

No, you're not understanding what I'm saying. You don't train on all your training data all simultaneously. You train on batches, in steps. Whatever you trained on last is going to have an outsized impact on the output that the model generates.

SraCet · 2026-06-01T18:38:34-0400

graylshaped said:
What, exactly, methods to assess options in predicting the next token do you think they employ in doing that other stuff?

Uhh, do you really need me to explain how LLMs work from first principles... ?

Briefly, they perform a bunch of additions, multiplications, and nonlinear functions (activation functions) in a particular sequence.

From the words in that sentence, where are you seeing "statistics"?

SraCet · 2026-06-01T18:39:44-0400

dadsfolk said:
That's a mistake a lot of people make - assuming a language engine is a knowledge engine.
...

Do you think LLMs are "language engines" because of the words in the acronym... ?

dadsfolk · 2026-06-01T19:03:56-0400

arsisloam said:
Counting letters in words, displaying empathy and compassion when a person is in crisis, and not hallucinating sources.

Alleged tests of "Intelligence/understanding".

Counting letters in words: I used to get periodic emails, often titled "IQ tests", with a paragraph of text - you were supposed to count the number of times the letter 's' (or some other) occurred in the paragraph. Most people got it wrong at least initially, and sometimes it was a struggle to find that last instance they claimed was there. The problem was that we don't, as Gemini suggested, parse words letter by letter; we use other cues such as the shape of words or components of them or expectations based on context as short cuts, and often bleep articles and conjunctions - it can be difficult to slow all the way down and force yourself to work letter by letter.

displaying empathy and compassion when a person is in crisis: So Trump, who doesn't seem to have an ounce of empathy, is not intelligent?

not hallucinating sources: Fox News.

dadsfolk · 2026-06-01T19:08:25-0400

SraCet said:
Do you think LLMs are "language engines" because of the words in the acronym... ?

Ummm... yes? What reason do you give for them being called Large Language Models?

Wikipedia:

A large language model (LLM) is a neural network trained on a vast amount of text for natural language processingtasks, especially language generation.

That's what they're trained on, and that's the design principle behind them.

Was that a serious question?

Sgtkeebler · 2026-06-01T19:15:56-0400

This is what forced me to stop using Google and switch to DDG. Google's AI began making up things in a subject that I was knowledgeable about. I kept telling Google's AI that it was completely wrong, and why. Instead of admitting it was wrong... it doubled down. It double down and began making up even more incorrect nonsense. It told me I was wrong, and that the data I was looking at was from 5yrs ago and it had changed or was related to an old exploit. I kept telling it how wrong it was, it kept doubling down, and once I said "well I am currently doing what you told me I couldn't do successfully," it then answered "I have been aggressively hallucinating false information".

crmarvin42 · 2026-06-01T19:57:06-0400

dadsfolk said:
That's a mistake a lot of people make - assuming a language engine is a knowledge engine.

First, why in the world would your boss assume that a general LLM (chatbot) would be "knowledgeable about the market segment and application"? Or recognize that without that, the business case couldn't be effective, regardless of how good it looked.

And does your boss know enough about the market segment and application to note that the output was garbage? This result should have been an eye-opener for him.

Sounds like Chat-GPT, although I've no doubt others can have similar problems. This sort of writing is simply not something it's trained to do, and if you're going to ask an LLM to do something like this, you should select an LLM that's been trained in this sort of writing, and primed with market data, market reports, business cases, and whatever else it needs to accomplish the task. You wouldn't ask a totally untrained intern to perform such a task. They wouldn't necessarily make the same mistakes, but the result would be equally unusable.

Chatbots being chatbots, of course it's going to do its best to slap something together which appears to meet the requirements.

He was using Microsoft CoPilot, because it came free with our corporate office subscription at the time. He knows the market in general terms, but not the technical stuff. He was leaning on the LLM to write the technical stuff, based on the assumption that the scientific papers in its training set would enable it to do a reasonable job, and save him from having to ask me to write that section. Of course, we know how badly that turned out.

wildsman · 2026-06-01T20:02:06-0400

crmarvin42 said:
He was using Microsoft CoPilot, because it came free with our corporate office subscription at the time. He knows the market in general terms, but not the technical stuff. He was leaning on the LLM to write the technical stuff, based on the assumption that the scientific papers in its training set would enable it to do a reasonable job, and save him from having to ask me to write that section. Of course, we know how badly that turned out.

Oh boy! Copilot is AWFUL!

My wife used to use copilot for work (to summarise research papers/articles/create presentations) and she told me 'AI is so dumb'.

Now I showed her Claude, and she's addicted.

graylshaped · 2026-06-01T20:03:11-0400

SraCet said:
What statistical technique(s)? It'd be cool if you named one.

I'm sorry. I thought you had a fucking clue what you were saying.

Toodles!

Voix des Airs · 2026-06-01T20:31:27-0400

SraCet said:
I haven't read every single word of every single post but so far it seems like wildsman is the only person here who's trying to pin anybody down to a concrete, formal definition of what "understanding" means.

Did you post a definition of "understanding"?

I haven't used the word at all. Mostly because I don't know wtf posters arguing that LLMs "understand" things mean by that... but you've used it at least 30 (I stopped counting at 30 in the search results because I got bored) times in this thread. I don't care how you (meaning anyone, not just you) define it as long as the definition is rigorous enough to actually be useful in a discussion.

dadsfolk · 2026-06-01T21:07:05-0400

crmarvin42 said:
He was using Microsoft CoPilot, because it came free with our corporate office subscription at the time. He knows the market in general terms, but not the technical stuff. He was leaning on the LLM to write the technical stuff, based on the assumption that the scientific papers in its training set would enable it to do a reasonable job, and save him from having to ask me to write that section. Of course, we know how badly that turned out.

Ah. My sole experience with Copilot was when I was discussing specs with a friend who was going to buy a tractor, and wanted to be able to use it to do timbering on the 90-acre farm he'd just bought (from my family). We were discussing hydraulic specs for lift capability of the loader, and I was showing him The Wood Database. He pulled out his phone and asked Copilot, "What's the weight of a green pine log 28 inches in diameter and 8 feet long?" Copilot cited the formula for the volume of a cylinder, derived the radius in feet from the diameter in inches, calculated the volume of the log, multiplied it by 25.2 lb./cubic foot, which it cited as the weight of pine, and replied, "The weight of a green pine log 28" in diameter and 8' long is 862 pounds."

Having seen some examples of Chat-GPT5 trying to do arithmetic, I was frankly impressed. It knew what it was doing, explained it accurately, did the math correctly (I checked), and produced a prompt response.

Unfortunately, it was wrong. I happened to have the page for White Pine open, and noted that 25 lb./cu.ft. is the Average dried weight, which is the weight at 12% moisture content - nominal air-dry moisture content (MC is the weight of water in the wood divided by the weight of oven-dried wood - nominal 8% moisture). MC of freshly-cut wood can vary from 35% to 200%, depending on the species and a few other factors. Since water is 62.4 lb./cu.ft., it has a disproportionate effect on the weight of a green log - so the answer was at least 25% low.

Also, there are 29 species of pine used in the US, most of which grow here. It picked the spec for White Pine or a similar soft pine, but my friend's land has more Scots Pine than White Pine, and the average dried weight of that is 34 lb./cu.ft. - 36% heavier, before correcting for moisture content (the Red Pine on my lot is also 34 lb./cu.ft.).

It's impossible to know what source it used for its figure, and whether the reference was simplistic, or if it didn't parse "green" correctly, or it had no referents to interpret the question, but it was a clear example of the dangers of an apparently authoritative answer if you don't know enough to evaluate it. It's possible I could have coaxed a correct answer out of it with successive prompts that took it through the above data, but it would have been a great deal faster to just run the numbers myself.

wildsman · 2026-06-01T21:42:48-0400

JohnDeL said:
If ChatGPT's model gives a probability distribution value of 75% for a particular token in a given set of circumstances and that token doesn't show up 75% of the time in those circumstances, then that means that ChatGPT is borked. Because a PDV of 75% literally means that the event should happen 75% of the time given a specific set of circumstances.

Ooh I missed this thread. I didn't realise we got into an actual technical discussion for a change.

LLMs do not store empirical frequencies in a lookup table: the probabilities are emergent from the latent world model. In many cases they deviate substantially from raw training frequencies while still making better predictions.

So it's more accurate to say ' 75% means that across many equivalent circumstances a (well-calibrated) model would produce that outcome about 75% of the time'.

Why is this important? Because overfitting degrades generalisation. You have to remember that you're compressing the training data into a lower dimensional space so that the models generalise.

SraCet · 2026-06-01T23:07:38-0400

dadsfolk said:
Ummm... yes? What reason do you give for them being called Large Language Models?

Wikipedia:

That's what they're trained on, and that's the design principle behind them.

Was that a serious question?

It was a serious question.

Because, yes, sure, they're called "language models" and "language" data is used to train them.

But, that language data also contains a bunch of knowledge.

So if you're going to pick a new phrase to describe them, it's odd that you would declare that they're language "engines" but not knowledge "engines."

SraCet · 2026-06-01T23:10:36-0400

graylshaped said:
I'm sorry. I thought you had a fucking clue what you were saying.

Toodles!

Of course you're willing to write paragraphs and paragraphs about how LLMs rely on "statistical techniques" but when you're asked to name a single statistical technique that's used, you fold immediately.

Really inspires a lot of confidence in your position. LOL.

graylshaped · 2026-06-02T00:53:50-0400

SraCet said:
Of course you're willing to write paragraphs and paragraphs about how LLMs rely on "statistical techniques" but when you're asked to name a single statistical technique that's used, you fold immediately.

Really inspires a lot of confidence in your position. LOL.

I’m not here to inspire confidence. I’m fine with showing the cracks in your veneer. It rarely takes more than a few sentences, let alone “paragraphs and paragraphs.”

DeeplyUnconcerned · 2026-06-02T01:53:20-0400

SraCet said:
The point being, that LLMs aren't intended to produce an approximation of their training data.

Yes, training data is used to train them, but, you know, duh.

I mean… the intent is to produce an approximation of intelligence, but the method is to produce an approximation of the training data. You can tell because the training process iteratively adjusts the weights to produce outputs that match the training data. The implicit assumption is that a system which can perfectly encode and replicate the training data will approximate intelligence, but that is let’s say “unproven” at this point.

graylshaped · 2026-06-02T03:32:41-0400

DeeplyUnconcerned said:
I mean… the intent is to produce an approximation of intelligence, but the method is to produce an approximation of the training data. You can tell because the training process iteratively adjusts the weights to produce outputs that match the training data. The implicit assumption is that a system which can perfectly encode and replicate the training data will approximate intelligence, but that is let’s say “unproven” at this point.

I really wonder what a field of study that analyzes a population in order to describe and predict elements of it be called, were such a thing to exist...

dadsfolk · 2026-06-02T09:08:10-0400

SraCet said:
It was a serious question.

Because, yes, sure, they're called "language models" and "language" data is used to train them.

But, that language data also contains a bunch of knowledge.

So if you're going to pick a new phrase to describe them, it's odd that you would declare that they're language "engines" but not knowledge "engines."

My point is that they're not trained to evaluate data, search out and associate related principles, resolve contradictions, develop knowledge structures with associated probabilities of being true, etc. etc. I can ask it to summarize a psychological study (easy, the authors have already done it), and it may do a good job, but it doesn't tell me whether the experiment actually tests the hypothesis, whether the sample selection is biased, whether the statistical methods used were appropriate for the sample and data, and whether the results actually justify the conclusions.

It can make associations and select examples, but that doesn't mean it's "knowledgeable about the market segment and application", as in crmarvin42's example.

Just because it has masses of printed data in its training.set doesn't mean it knows how to evaluate and apply it to a given situation/request.

It can produce plausible answers - that's what it's trained to do. However, researchers are working furiously to figure out how to get it to give accurate and reliable answers - or to confess it doesn't know the answer.

Their training data contains facts, opinions, representations of knowledge, speculations, hypotheses, theories, parodies, sarcasm, etc. etc. Do we assume it reliably knows the difference?

dadsfolk · 2026-06-02T09:16:06-0400

dadsfolk said:
Ah. My sole experience with Copilot was when I was discussing specs with a friend who was going to buy a tractor, and wanted to be able to use it to do timbering on the 90-acre farm he'd just bought (from my family). We were discussing hydraulic specs for lift capability of the loader, and I was showing him The Wood Database. He pulled out his phone and asked Copilot, "What's the weight of a green pine log 28 inches in diameter and 8 feet long?" Copilot cited the formula for the volume of a cylinder, derived the radius in feet from the diameter in inches, calculated the volume of the log, multiplied it by 25.2 lb./cubic foot, which it cited as the weight of pine, and replied, "The weight of a green pine log 28" in diameter and 8' long is 862 pounds."

Having seen some examples of Chat-GPT5 trying to do arithmetic, I was frankly impressed. It knew what it was doing, explained it accurately, did the math correctly (I checked), and produced a prompt response.

Unfortunately, it was wrong. I happened to have the page for White Pine open, and noted that 25 lb./cu.ft. is the Average dried weight, which is the weight at 12% moisture content - nominal air-dry moisture content (MC is the weight of water in the wood divided by the weight of oven-dried wood - nominal 8% moisture). MC of freshly-cut wood can vary from 35% to 200%, depending on the species and a few other factors. Since water is 62.4 lb./cu.ft., it has a disproportionate effect on the weight of a green log - so the answer was at least 25% low.

Also, there are 29 species of pine used in the US, most of which grow here. It picked the spec for White Pine or a similar soft pine, but my friend's land has more Scots Pine than White Pine, and the average dried weight of that is 34 lb./cu.ft. - 36% heavier, before correcting for moisture content (the Red Pine on my lot is also 34 lb./cu.ft.).

It's impossible to know what source it used for its figure, and whether the reference was simplistic, or if it didn't parse "green" correctly, or it had no referents to interpret the question, but it was a clear example of the dangers of an apparently authoritative answer if you don't know enough to evaluate it. It's possible I could have coaxed a correct answer out of it with successive prompts that took it through the above data, but it would have been a great deal faster to just run the numbers myself.

Serious question: can you teach an LLM? If I had asked Copilot to summarize a page on "Wood and Moisture", then walked it through the process of determining and identifying a species, then worked to approximate a better answer, would that have gone into its reference material to make it better answer such questions in the future, or just ended up as notations in my friend's chat history.

LLMs believe false statements even after explicit warnings that they’re false

Ars Legatus Legionis

Ars Legatus Legionis

Ars Legatus Legionis

Ars Legatus Legionis

Ars Tribunus Angusticlavius

Ars Tribunus Militum

Ars Legatus Legionis

Ars Legatus Legionis

Ars Legatus Legionis

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Tribunus Angusticlavius

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Tribunus Angusticlavius

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Legatus Legionis

Ars Legatus Legionis

Ars Legatus Legionis

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Scholae Palatinae

Smack-Fu Master, in training

Ars Praefectus

Ars Tribunus Militum

Ars Legatus Legionis

Ars Praefectus

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Legatus Legionis

Ars Legatus Legionis

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Scholae Palatinae