Hobbyist training AI on Victorian texts gets an unexpected history lesson from his own creation.
See full article...
See full article...
The problem is that amount you’d have to read and remember - the 6GB of text input into the LM is about 1.2 billion words. I read fairly quickly (about 900 wpm for technical material, slower for not) and it would take me at least 2.5 years of non-stop reading (and much longer if I actually work, eat and sleep) and I doubt if I would remember enough to come to any conclusions.Sure, you can get some information about how historical figures thought from their writing, but you could also get that information from just reading it?
" ...I’m not sure if just scaling the data up will ever result in reasoning but even now it kinda feels like digital time travel."
"If I train from scratch the language model won't pretend to be old, it just will be."
"This shows the model is beginning to remember things from the dataset."
"Training AI language models on period texts may allow for the creation of interactive period linguistic models that offer a researcher a chance to converse with a simulated speaker of an extinct vernacular or language of the past. The results would not necessarily be factually rigorous due to confabulations, but they could be stylistically illuminating for someone studying antique syntax or vocabulary in use."
That's the problem. You can't "trust it to dig out facts." You can ask it to dig out facts, and then you have to go verify every single part of its answer. If you're looking for a specific topic or event and not "what happened in 1832," you're going to be able to search and vet your information faster and easier with traditional methods, by precisely the amount of effort you put into interacting with the LLM.... Knowing that a model can trusted to dig out facts after being fed copious amounts of text makes it much easier to analyze said texts, especially if it can cite the excerpts it used to come up with a particular inference.
Thank you. It's infuriating that anyone is taking this output seriously. Fortunately, most of the people here at Ars are not.There are some Victorian-sounding bits in there, but mostly the text is gobbledygook. It switches halfway down from past tense to future tense. Parts are nonsensical and ungrammatical ("was not bound in the way of private", "who first settled in the Gospel at Jerusalem", "a record of the prosperity and prosperity"), not to mention bad punctuation ("re counted", "be'known"). Other than the mention of Lord Palmerston, there is no indication of what happened in that year. It's word salad. The problem is that the remoteness of that era tempts one to excuse the nonsense as archaic. But no, "the day of law" is not Victorian English.
It would probably appear coherent without saying anything
But what is the point of this? You aren't reading 6GB of text to understand a thought pattern. If you want to train your LLM to find you examples of a particular thought pattern then maybe you're on to something, but once this LLM is trained you can't exactly interrogate the weights to tease out a specific thought pattern. I don't mean to drop the "I'm a professional" so I'm infallible card, but I do sit here as a data scientist with a strong interest in historical linguistics wondering what on earth is the point of this?The problem is that amount you’d have to read and remember - the 6GB of text input into the LM is about 1.2 billion words. I read fairly quickly (about 900 wpm for technical material, slower for not) and it would take me at least 2.5 years of non-stop reading (and much longer if I actually work, eat and sleep) and I doubt if I would remember enough to come to any conclusions.
It is, because it highlights a case where these things can fill a useful niche. The next step is a network of like-minded people that, importantly, have some level of trust in the others and share ideas, etc.In defense of the article, I submit that the notable part is that an individual hobbyist trained, not just ran, an LLM on a specific and completely legit corpus and got an interesting result. That's usually considered the purview of huge corporations and institutions. I think that's a pretty big deal, actually, and I encourage more people to experiment with this tech in this way.
I'd think the point is to extend the technology to allow it to do useful things, which is the opposite of what the big AI companies are doing with brute-force scaling, and failing miserably trying.But what is the point of this? You aren't reading 6GB of text to understand a thought pattern. If you want to train your LLM to find you examples of a particular thought pattern then maybe you're on to something, but once this LLM is trained you can't exactly interrogate the weights to tease out a specific thought pattern. I don't mean to drop the "I'm a professional" so I'm infallible card, but I do sit here as a data scientist with a strong interest in historical linguistics wondering what on earth is the point of this?
I understand why you might think something published in the Lancet would be a reliable source (Wakefield notwithstanding) but this is unreadable word salad which expends reams of text on saying nothing and was probably written by an LLM.See also: Large language models for the mental health community: framework for translating code to care [https://www.thelancet.com/journals/landig/article/PIIS2589-7500(24)00255-3/fulltext]
Mental health conditions are often both identified and treated through language, making them an ideal target for LLMs
"Mental health conditions are often both identified and treated through language, making them an ideal target for LLMs"I understand why you might think something published in the Lancet would be a reliable source (Wakefield notwithstanding) but this is unreadable word salad which expends reams of text on saying nothing and was probably written by an LLM.
Christ on a bike.
This struck me at once. It's exactly like Nostradamus. It tells us absolutely nothing about Palmerston or the nature of the protests, and in a garbled version of the newspaper syntax of the era. It's utterly useless."appears to have surprised him by reconstructing a coherent historical moment"
Er, but it really didn't? What it wrote is nonsensical garbage. It contains the words "Palmerston" and "protest", yes. Everything beyond that is the reader doing a lot of interpretation. What's actually written is junk.
It's going to be stuck at basically this level of (non)functionality until someone develops a rudimentary memory or some other bio-inspired addition to the system. Users have to be aware that the output cannot be relied upon for factual answers, and some amount of human cognitive oversight is required. That's why I've been skeptical about their ability to offload any meaningful amount of work, on balance.This struck me at once. It's exactly like Nostradamus. It tells us absolutely nothing about Palmerston or the nature of the protests, and in a garbled version of the newspaper syntax of the era. It's utterly useless.
Modern history has been patiently uncovering more and more details of past events and linking them, often causing us to re-evaluate past events and identify the myths behind, say, the Spanish Armada or the conquest of what is now Texas. It looks as if "AI" is just going to come up with exactly the same kind of smearing of history because it does not have the ability to weight data.
This isn't strictly a news site.
That assumes that the desire is to produce accurate answers. It's not that at all. Corporate overlords view LLMs as 1) a way to justify firing employees and using LLMs as half-ass replacements, and 2) engagement engines driven by the content of the Internet, probably the worst possible feedback loop imaginable.IMO one of the big issues with LLMs is that the corporate types that fund development for them use any content under the sun, including social media, fictional books, etc.
My long standing opinion is that this is a bad approach, and this just proves my point.
LLMs aren't bad. The training approach popular with big tech is the issue. Stop feeding LLMs irrelevant social media stuff. Start tailoring both the model and training toward specific areas. There are already a ton of projects that are doing this and seeing success. Because they aren't backed by VC, they just quietly slip by and do their thing.
It would be if they weren't or couldn't be supplanted with something better, and that's where these smaller systems come in. They are determined to fire people, and they likely will. People should boycott their shitty 'customer service' replacements.That assumes that the desire is to produce accurate answers. It's not that at all. Corporate overlords view LLMs as 1) a way to justify firing employees and using LLMs as half-ass replacements, and 2) engagement engines driven by the content of the Internet, probably the worst possible feedback loop imaginable.
It's not even that. Just throw some random words into Google and you can likely find an event you didn't know about.Why is this noteworthy or unexpected? An LLM got trained on a bunch of data, then regurgitated that data in response to queries. Isn't this exactly what LLMs are expected to do? I don't see any surprising "accident" here at all. The only mildly surprising bit is that the experimenter, with some alleged deep knowledge of the time period in question, was surprised by the output because he had never heard of it before, indicating that maybe his depth of knowledge was shallower than he presumed.
Or it may be surprising that an LLM actually managed to spit out a correct response, which isn't particularly common.
I think we're all really clear that LLMs do not produce factual output, except as a lucky dice roll. There are still two scenarios (at least) that this writeup provides a jumping off point for:It's not even that. Just throw some random words into Google and you can likely find an event you didn't know about.
Let's try this game... '1756 massacre'. Search for those words and ooh look there's a Wikipedia article about an event I'd never heard of. See, my 'LLM' (diceware) can teach me about history.
I'm afraid we're still at the monkeys with typewriters stage and looking for patterns in the clouds.
As I've said before, LLMs are basically Magic Eight Balls that require terawatts of power to operate.It's not even that. Just throw some random words into Google and you can likely find an event you didn't know about.
Let's try this game... '1756 massacre'. Search for those words and ooh look there's a Wikipedia article about an event I'd never heard of. See, my 'LLM' (diceware) can teach me about history.
I'm afraid we're still at the monkeys with typewriters stage and looking for patterns in the clouds.
It seems the writer hasn't noticed that Queen Victoria ascended in 1837... incidentally after the period the text is purportedly about (which was in the Regency era as William IV was on the throne)This is completely unhinged. Completely false.
"...developer Hayk Grigorian trains his hobbyist AI models from scratch using exclusively Victorian-era sources—over 7,000 books, legal documents, and newspapers published in London between 1800 and 1875
What would that system be? Crypto-grift?"I have, through most extraordinary means, journeyed unto the year of our Lord two thousand and twenty-five, and beheld with mine own eyes the establishment of a most sublime and faultless system of public welfare in that distant land known as the United States of America. Verily, I say unto you, we are most urgently called upon to emulate and enact such noble reforms within our own age and realm..."
--- 19th-century time traveler and MP, upon his return from the future
Err no the Regency lasted from 1811 till 1820.The future George IV acted as Regent for his Father during his father's final mental illness.It seems the writer hasn't noticed that Queen Victoria ascended in 1837... incidentally after the period the text is purportedly about (which was in the Regency era as William IV was on the throne)
But what makes this episode especially interesting is that a small hobbyist model trained by one man appears to have surprised him by reconstructing a coherent historical moment from scattered references across thousands of documents, connecting a specific year to actual events and figures without being explicitly taught these relationships.
Well, it's not like this software could process 6GB in a few seconds and deliver something useful either.The problem is that amount you’d have to read and remember - the 6GB of text input into the LM is about 1.2 billion words. I read fairly quickly (about 900 wpm for technical material, slower for not) and it would take me at least 2.5 years of non-stop reading (and much longer if I actually work, eat and sleep) and I doubt if I would remember enough to come to any conclusions.
My understanding is that nowhere in the 6.25GB corpus was the protest ever described 'as a single thing' - there wasn't a news piece covering it, and no text ever described it in much detail - maybe it wasn't even mentioned by name; rather, there were oblique references and "ambient patterns" in the way people wrote during that time period, that the LLM 'knew' to combine into one single event.reconstructing a coherent historical moment from scattered references across thousands of documents, connecting a specific year to actual events and figures without being explicitly taught these relationships. Grigorian hadn't intentionally trained the model on 1834 protest documentation; the AI assembled these connections from the ambient patterns in 6.25GB of Victorian-era writing.
Right. The article is written so poorly that it's pretty easy to misunderstand this point. And it's still not clear from the article how much information about the event there actually was or wasn't in the corpus....
My understanding is that nowhere in the 6.25GB corpus was the protest ever described 'as a single thing' - there wasn't a news piece covering it, and no text ever described it in much detail - maybe it wasn't even mentioned by name; rather, there were oblique references and "ambient patterns" in the way people wrote during that time period, that the LLM 'knew' to combine into one single event.
...
No, it's a bunch of scattered facts creating a statistical likelihood that those similar facts will get regurgitated more than other, less frequently occurring facts.I can't tell from the article and didn't look into the references, but I think commenters here are misunderstanding what's been done.
It's not fact in => fact out; it's tangentially related facts in => fact out.
In visual terms, it's like training a model on a bunch of pictures taken near something but not of it and then having the model spit out an accurate picture of that thing based on, for example, people's reactions to it in other pictures.
I'm not saying the LLM 'figured out' anything but rather that the output was the most likely input to have caused the patterns in the data it was trained on. I expect it works better with text than other data because there's a lot of redundancy built into text.
Even so, "fact out" would be more a fortunate occurrence, but the quality and specificity of the training data matters.I can't tell from the article and didn't look into the references, but I think commenters here are misunderstanding what's been done.
It's not fact in => fact out; it's tangentially related facts in => fact out.
In visual terms, it's like training a model on a bunch of pictures taken near something but not of it and then having the model spit out an accurate picture of that thing based on, for example, people's reactions to it in other pictures.
I'm not saying the LLM 'figured out' anything but rather that the output was the most likely input to have caused the patterns in the data it was trained on. I expect it works better with text than other data because there's a lot of redundancy built into text.
There is a tendency to refer to the whole post-Napoleon era as being "Victorian", there's a lot of overlapping threads which make it difficult to define a start point. Riots in 1834 were very much part of the post-Napoleon demands for more democracy, but the themes of the Victorian era - Industrial Revolution, India, railways - had been developing fast since the 1820s.It seems the writer hasn't noticed that Queen Victoria ascended in 1837... incidentally after the period the text is purportedly about (which was in the Regency era as William IV was on the throne)
Just to make this easier for other readers who might not know the dates - the Victorian era is a common name for the period when Queen Victoria ruled the UK, which was between 1837 and 1901. So fully half of the period this guy used wasn't Victorian (and he's missing half of the era on the later part).haven't read the comments yet, and yeah there's a lot to criticize here, but even I, an idiot in Norway had to pause and point out that this is straight up factually effing wrong for effs sake what are we even doing here there are no truths anymore and here Ars is tilling more lies into the already lie-filled substrate that feeds the nothing-but-hallucination softwares - get a grip Ars, get a grip... take some time... re-examine your purpose...
This is completely unhinged. Completely false.
"...developer Hayk Grigorian trains his hobbyist AI models from scratch using exclusively Victorian-era sources—over 7,000 books, legal documents, and newspapers published in London between 1800 and 1875