Fine-tuning tests show "bias ... toward confidently representing the claims as true."
See full article...
See full article...
Well, I listed some stuff that would definitely be statistics, and LLMs don't do that stuff.Who said that matters, true or otherwise? I'm fascinated watching the strange maneuvering you are doing to pretend these models are not affected by probability and statistical methods.
I didn't say they were wrong. I said the statement expressed a superficial understanding of the relationship between parsing a language and what that implies about "understanding" and "concepts." Would you like to get into a discussion about the role of implicature and pragmatics without you googling to know what those terms mean in this context? Language involves far more than syntax and vocabulary. I'll even tell you up front that current LLMs do apply some of the principles I'm discussing, and continue to improve. The imitation improves incrementally as every day passes and every investment round funds.Your text from post #439:
As long as we're waving degrees around, my undergraduate degree was in semiotics, and as one of those whose motives in expressing our point of view you call into question in your post, can tell you that "LLMs simply could not be as good as they are at language without at least close analogues of 'concepts' and 'understanding' " says more about your own superficial understanding of the relationships between sign and signifier and how that affects language and communication than it does about the capabilities of LLMs.
Saying that somebody is wrong about something seems like just as much of a claim as anything else?
Sure, it's going to end up being an approximation, influenced by your learning rate, batch size, and what training data you've trained on most recently. And also how well your model can identify the training data specifically.Yes, I have. I even wrote a couple of papers on how to integrate their results with other methods to give more realistic geological simulations.
If your probability distribution doesn't relate to the actual distribution, then you've fucked up. That is axiomatic to statistics.
It's rare that I see a post that says "you have a superficial understanding of what you're talking about, and you're right!"I didn't say they were wrong. ...
Sure, it's going to end up being an approximation, influenced by your learning rate, batch size, and what training data you've trained on most recently. And also how well your model can identify the training data specifically.
But even if you correlate the output probability distribution with the distribution of your training data and there's good correlation, aren't you the entity that's doing the "probability" and the "statistics" in that case?
No I completely understand - just wanted to be sure...You are really, really struggling with the reality that I am not the one making a claim here.
Let me stipulate, for the sake of this discussion, your implication is correct. Using methods prosaic or arcane, these models contain within themselves a representation of Knowledge and the World that allows them to Understand Concepts.Sure, it's going to end up being an approximation, influenced by your learning rate, batch size, and what training data you've trained on most recently. And also how well your model can identify the training data specifically.
But even if you correlate the output probability distribution with the distribution of your training data and there's good correlation, aren't you the entity that's doing the "probability" and the "statistics" in that case?
Nowhere did I say that you only use part of your training data, it's just that you don't train on all your data at the same time.So your argument is that instead of training on the entire population, it only used a subpopulation? And you don't see how that doesn't help you?
What statistical technique(s)? It'd be cool if you named one....
Let's stipulate all that as having been covered. How is that internal map created and represented if statistical techniques are not carrying a big part of that load?
And it isn't rare that I see a post from you that doesn't cherry-pick quotes and twist data to suit your whim.It's rare that I see a post that says "you have a superficial understanding of what you're talking about, and you're right!"
That's at least arguably true, sure. I think I'd replace "intelligent" with "intelligible" in your second sentence, but "LLMs are an attempt to create a function that approximates intelligence" isn't totally off the mark. "Artificial intelligence is the goal of AI research" seems like a reasonable statement.Another equally valid way of looking at LLMs is that they're functions that approximate intelligence.
They're a representation of the entire space of intelligent things that could be said in any possible situation.
Of course, we don't have a corpus of training data that only includes intelligent stuff, so we train them with all the text that we do have and hope for the best.
But, intelligence is what we've been aiming at.
Ah, that's not strictly true, because of temperature (which is in-framework but post-model).If ChatGPT's model gives a probability distribution value of 75% for a particular token in a given set of circumstances and that token doesn't show up 75% of the time in those circumstances, then that means that ChatGPT is borked. Because a PDV of 75% literally means that the event should happen 75% of the time given a specific set of circumstances.
Nowhere did I say that you only use part of your training data, it's just that you don't train on all your data at the same time.
But, sure, if your DNN is big enough, your batch size is big enough, and you train long enough, then your output probability distribution for any particular item of training data should match the frequency of outputs in the training data.
I can understand why you would want to think of those as probabilities, but all the neural network is trained to do, and is "trying" to do, is get the right answer for any particular given input.
Yes, you can adjust the biases. But that just means you are screwing with the weights on an ad hoc basis which is best done with great reluctance.Ah, that's not strictly true, because of temperature (which is in-framework but post-model).
I mean, you don't train on all of your training data - typically you keep part of it back for validation (which yes is explicitly not part of the "training" dataset).Nowhere did I say that you only use part of your training data, it's just that you don't train on all your data at the same time.
My understanding is that temperature isn't to do with the in-model biases, it's to do with the way the probabilities output by the model are used to select a token, and it's pretty standard practice.Yes, you can adjust the biases. But that just means you are screwing with the weights on an ad hoc basis which is best done with great reluctance.
body starting at line 62, we go up to step at 50 which gets the model output at 51, and then divide by temperature at 64. It's not in the model itself, it's a post-model modification to the output probabilities to (typically) add more variety to the output of the framework by biasing token selection slightly towards lower-probability tokens.My understanding is that temperature isn't to do with the in-model biases, it's to do with the way the probabilities output by the model are used to select a token, and it's pretty standard practice.
What, exactly, methods to assess options in predicting the next token do you think they employ in doing that other stuff?Well, I listed some stuff that would definitely be statistics, and LLMs don't do that stuff.
Possibly yeah. “Bias” in ANN terminology is a modifier to each parameter inside the model (if memory serves).Which is another type of bias. Think of it as adjusting the phase or the color of a TV; it changes the signal to make it "better" (for some value of better).
I'm pretty sure we're in agreement on the idea but our different terminologies may be getting in the way.
That's a mistake a lot of people make - assuming a language engine is a knowledge engine.The first thing that comes to mind is the writing of a business case, for which my boss leaned heavily on an LLM to do the writing to match his outline. No person who was at all knowledgeable about the market segment and application would have gotten so many fundamental things wrong.
Sounds like Chat-GPT, although I've no doubt others can have similar problems. This sort of writing is simply not something it's trained to do, and if you're going to ask an LLM to do something like this, you should select an LLM that's been trained in this sort of writing, and primed with market data, market reports, business cases, and whatever else it needs to accomplish the task. You wouldn't ask a totally untrained intern to perform such a task. They wouldn't necessarily make the same mistakes, but the result would be equally unusable.The prose it generated looked convincing, so long as you were not actually familiar with the subject matter. Of the 12 citations it made that I bothered to check, all of them were wrong in some fundamental way. The cited works did not, in fact, support the assertions they were being linked to. Often times, a specific number or phrase was indicated, then a citation provided, and that specific phrase or number did not appear once at the location indicated. Ironically, I usually knew where that number or phrase did originate, but the LLM did not proceed from reference article to specific citation from that article. Instead it pulled the specific citation from one location, and then dropped in a largely unrelated reference article to use as evidence. The structure of cause and effect, reference and citation, was there, but the two halves were not connected to each other in any meaningful way that might indicate it understood the point of that relationship.
The point being, that LLMs aren't intended to produce an approximation of their training data.That's at least arguably true, sure. I think I'd replace "intelligent" with "intelligible" in your second sentence, but "LLMs are an attempt to create a function that approximates intelligence" isn't totally off the mark. "Artificial intelligence is the goal of AI research" seems like a reasonable statement.
No, you're not understanding what I'm saying. You don't train on all your training data all simultaneously. You train on batches, in steps. Whatever you trained on last is going to have an outsized impact on the output that the model generates.Yes, that is a standard training method which allows you to QC the results by compating them to the data that wasn't used.
...
Uhh, do you really need me to explain how LLMs work from first principles... ?What, exactly, methods to assess options in predicting the next token do you think they employ in doing that other stuff?
Do you think LLMs are "language engines" because of the words in the acronym... ?That's a mistake a lot of people make - assuming a language engine is a knowledge engine.
...
Alleged tests of "Intelligence/understanding".Counting letters in words, displaying empathy and compassion when a person is in crisis, and not hallucinating sources.
Ummm... yes? What reason do you give for them being called Large Language Models?Do you think LLMs are "language engines" because of the words in the acronym... ?
That's what they're trained on, and that's the design principle behind them.A large language model (LLM) is a neural network trained on a vast amount of text for natural language processingtasks, especially language generation.
He was using Microsoft CoPilot, because it came free with our corporate office subscription at the time. He knows the market in general terms, but not the technical stuff. He was leaning on the LLM to write the technical stuff, based on the assumption that the scientific papers in its training set would enable it to do a reasonable job, and save him from having to ask me to write that section. Of course, we know how badly that turned out.That's a mistake a lot of people make - assuming a language engine is a knowledge engine.
First, why in the world would your boss assume that a general LLM (chatbot) would be "knowledgeable about the market segment and application"? Or recognize that without that, the business case couldn't be effective, regardless of how good it looked.
And does your boss know enough about the market segment and application to note that the output was garbage? This result should have been an eye-opener for him.
Sounds like Chat-GPT, although I've no doubt others can have similar problems. This sort of writing is simply not something it's trained to do, and if you're going to ask an LLM to do something like this, you should select an LLM that's been trained in this sort of writing, and primed with market data, market reports, business cases, and whatever else it needs to accomplish the task. You wouldn't ask a totally untrained intern to perform such a task. They wouldn't necessarily make the same mistakes, but the result would be equally unusable.
Chatbots being chatbots, of course it's going to do its best to slap something together which appears to meet the requirements.
Oh boy! Copilot is AWFUL!He was using Microsoft CoPilot, because it came free with our corporate office subscription at the time. He knows the market in general terms, but not the technical stuff. He was leaning on the LLM to write the technical stuff, based on the assumption that the scientific papers in its training set would enable it to do a reasonable job, and save him from having to ask me to write that section. Of course, we know how badly that turned out.
I'm sorry. I thought you had a fucking clue what you were saying.What statistical technique(s)? It'd be cool if you named one.
I haven't read every single word of every single post but so far it seems like wildsman is the only person here who's trying to pin anybody down to a concrete, formal definition of what "understanding" means.
Did you post a definition of "understanding"?
Ah. My sole experience with Copilot was when I was discussing specs with a friend who was going to buy a tractor, and wanted to be able to use it to do timbering on the 90-acre farm he'd just bought (from my family). We were discussing hydraulic specs for lift capability of the loader, and I was showing him The Wood Database. He pulled out his phone and asked Copilot, "What's the weight of a green pine log 28 inches in diameter and 8 feet long?" Copilot cited the formula for the volume of a cylinder, derived the radius in feet from the diameter in inches, calculated the volume of the log, multiplied it by 25.2 lb./cubic foot, which it cited as the weight of pine, and replied, "The weight of a green pine log 28" in diameter and 8' long is 862 pounds."He was using Microsoft CoPilot, because it came free with our corporate office subscription at the time. He knows the market in general terms, but not the technical stuff. He was leaning on the LLM to write the technical stuff, based on the assumption that the scientific papers in its training set would enable it to do a reasonable job, and save him from having to ask me to write that section. Of course, we know how badly that turned out.
Ooh I missed this thread. I didn't realise we got into an actual technical discussion for a change.If ChatGPT's model gives a probability distribution value of 75% for a particular token in a given set of circumstances and that token doesn't show up 75% of the time in those circumstances, then that means that ChatGPT is borked. Because a PDV of 75% literally means that the event should happen 75% of the time given a specific set of circumstances.
It was a serious question.Ummm... yes? What reason do you give for them being called Large Language Models?
Wikipedia:
That's what they're trained on, and that's the design principle behind them.
Was that a serious question?
Of course you're willing to write paragraphs and paragraphs about how LLMs rely on "statistical techniques" but when you're asked to name a single statistical technique that's used, you fold immediately.I'm sorry. I thought you had a fucking clue what you were saying.
Toodles!
I’m not here to inspire confidence. I’m fine with showing the cracks in your veneer. It rarely takes more than a few sentences, let alone “paragraphs and paragraphs.”Of course you're willing to write paragraphs and paragraphs about how LLMs rely on "statistical techniques" but when you're asked to name a single statistical technique that's used, you fold immediately.
Really inspires a lot of confidence in your position. LOL.
I mean… the intent is to produce an approximation of intelligence, but the method is to produce an approximation of the training data. You can tell because the training process iteratively adjusts the weights to produce outputs that match the training data. The implicit assumption is that a system which can perfectly encode and replicate the training data will approximate intelligence, but that is let’s say “unproven” at this point.The point being, that LLMs aren't intended to produce an approximation of their training data.
Yes, training data is used to train them, but, you know, duh.
I really wonder what a field of study that analyzes a population in order to describe and predict elements of it be called, were such a thing to exist...I mean… the intent is to produce an approximation of intelligence, but the method is to produce an approximation of the training data. You can tell because the training process iteratively adjusts the weights to produce outputs that match the training data. The implicit assumption is that a system which can perfectly encode and replicate the training data will approximate intelligence, but that is let’s say “unproven” at this point.
My point is that they're not trained to evaluate data, search out and associate related principles, resolve contradictions, develop knowledge structures with associated probabilities of being true, etc. etc. I can ask it to summarize a psychological study (easy, the authors have already done it), and it may do a good job, but it doesn't tell me whether the experiment actually tests the hypothesis, whether the sample selection is biased, whether the statistical methods used were appropriate for the sample and data, and whether the results actually justify the conclusions.It was a serious question.
Because, yes, sure, they're called "language models" and "language" data is used to train them.
But, that language data also contains a bunch of knowledge.
So if you're going to pick a new phrase to describe them, it's odd that you would declare that they're language "engines" but not knowledge "engines."
Serious question: can you teach an LLM? If I had asked Copilot to summarize a page on "Wood and Moisture", then walked it through the process of determining and identifying a species, then worked to approximate a better answer, would that have gone into its reference material to make it better answer such questions in the future, or just ended up as notations in my friend's chat history.Ah. My sole experience with Copilot was when I was discussing specs with a friend who was going to buy a tractor, and wanted to be able to use it to do timbering on the 90-acre farm he'd just bought (from my family). We were discussing hydraulic specs for lift capability of the loader, and I was showing him The Wood Database. He pulled out his phone and asked Copilot, "What's the weight of a green pine log 28 inches in diameter and 8 feet long?" Copilot cited the formula for the volume of a cylinder, derived the radius in feet from the diameter in inches, calculated the volume of the log, multiplied it by 25.2 lb./cubic foot, which it cited as the weight of pine, and replied, "The weight of a green pine log 28" in diameter and 8' long is 862 pounds."
Having seen some examples of Chat-GPT5 trying to do arithmetic, I was frankly impressed. It knew what it was doing, explained it accurately, did the math correctly (I checked), and produced a prompt response.
Unfortunately, it was wrong. I happened to have the page for White Pine open, and noted that 25 lb./cu.ft. is the Average dried weight, which is the weight at 12% moisture content - nominal air-dry moisture content (MC is the weight of water in the wood divided by the weight of oven-dried wood - nominal 8% moisture). MC of freshly-cut wood can vary from 35% to 200%, depending on the species and a few other factors. Since water is 62.4 lb./cu.ft., it has a disproportionate effect on the weight of a green log - so the answer was at least 25% low.
Also, there are 29 species of pine used in the US, most of which grow here. It picked the spec for White Pine or a similar soft pine, but my friend's land has more Scots Pine than White Pine, and the average dried weight of that is 34 lb./cu.ft. - 36% heavier, before correcting for moisture content (the Red Pine on my lot is also 34 lb./cu.ft.).
It's impossible to know what source it used for its figure, and whether the reference was simplistic, or if it didn't parse "green" correctly, or it had no referents to interpret the question, but it was a clear example of the dangers of an apparently authoritative answer if you don't know enough to evaluate it. It's possible I could have coaxed a correct answer out of it with successive prompts that took it through the above data, but it would have been a great deal faster to just run the numbers myself.