Microsoft removes guide on how to train LLMs on pirated Harry Potter books

WereCatf

Ars Tribunus Militum
2,830
Probably the most upsetting part about all this is the massive hypocrisy: all the companies fully knowingly violate copyright laws left, right and center in massive quantities while at the same time attacking consumers for even just one, single picture or a song or whatever. And the companies practically never face any consequences for any of it. No, rather they tend to be rewarded for it!
 
Upvote
228 (230 / -2)

DNA_Doc

Ars Scholae Palatinae
904
“No one wants to write fan fiction about books that are in the public domain.”

Tell that to Virgil.
Or to Gregory Maguire (Wicked, based off of Wizard of Oz), or to Margaret Atwood (The Penelopiad, retelling of Homer's Odyssey), or to any of the authors that have been engaged in modern retellings of Grimms' fairy tales, or to the many authors that have expanded the Sherlock Holmes "universe", or...

(edited to fix typos as I noticed them)
 
Last edited:
Upvote
129 (129 / 0)

MilanKraft

Ars Tribunus Angusticlavius
6,711
Or to Gregory Maguire (Wicked, based off of Wizard of Oz), or to Margaret Atwood (The Penelopiad, retelling from Homer's Odyssey), or to any of the authors that have been engaged in modern retellings of Grimm's fairy tales, or to the many authors that have expanded the Sherlock Holmes "universe", or...
Haven't you heard? At this "evolved" stage in human history, "all art works are derivative by definition" so shameless borrowing of characters, plots, imagery, components of ostensibly "original" music, et al are fair game. "Originality is a lie, bro." Everyone should just be able to make whatever they want, from whatever source material they want, and profit.
 
Upvote
11 (21 / -10)

Mocker

Wise, Aged Ars Veteran
107
Man, as a professional technical writer, the idea that you would let people publish stuff without even a basic knowledge of copyrights and trademark is mind blowing. I recently used a stanza from Jabberwocky and, even though I was fairly certain it was in the public domain, I verified it just to make sure.
 
Upvote
118 (119 / -1)

graylshaped

Ars Legatus Legionis
67,692
Subscriptor++
Harry meets a new friend on the Hogwarts Express train who tells him all about Microsoft’s Native Vector Support in SQL “in the Muggle world.”
And my daughter thought the originals were delightful at the time. If only we had "AI" to find even more exciting paths...
 
Upvote
12 (14 / -2)

graylshaped

Ars Legatus Legionis
67,692
Subscriptor++
Man, as a professional technical writer, the idea that you would let people publish stuff without even a basic knowledge of copyrights and trademark is mind blowing. I recently used a stanza from Jabberwocky and, even though I was fairly certain it was in the public domain, I verified it just to make sure.
Them slithy toves gonna mess you up good! Make you all mimsy.
 
Upvote
57 (58 / -1)

pokrface

Senior Technology Editor
21,512
Ars Staff
The archive link at the start of the article is redirecting to RT news
This appears to be a somewhat common problem—if you do a web search for "archive redirecting to RT" you'll see many reports of it on reddit and hackernews and other sites. Turning off your VPN reportedly fixes the issue; it's also possible that switching your DNS will fix it.

(FWIW, the link works fine for me—screenshot).
 
Upvote
33 (33 / 0)
My mind often goes to Disney, how they've weaponized copyright laws, while making shorts and movies off of old fairy tales, some in the public domain.

However, this isn't it, and I find it wild that going "oh, I didn't know when of the most popular book series in the past 30 years isn't in the public domain!" flies. Plenty of well-known real public domain works to test as data sets.

Must be nice to be too big to touch.
 
Upvote
85 (85 / 0)

Willie McBride

Smack-Fu Master, in training
57
Or to Gregory Maguire (Wicked, based off of Wizard of Oz), or to Margaret Atwood (The Penelopiad, retelling of Homer's Odyssey), or to any of the authors that have been engaged in modern retellings of Grimms' fairy tales, or to the many authors that have expanded the Sherlock Holmes "universe", or...

(edited to fix typos as I noticed them)

In 1897 Jules Verne published a sequel to Edgar Allan Poe's 1838 novel The Narrative of Arthur Gordon Pym of Nantucket. For some reason the idea of one of the most prolific and influential writers of the 19th century, whose works were and still are regularly adapted into movies and tv shows, continuing a story written by one of the giants of literature always amuses me.
 
Upvote
41 (41 / 0)

Fatesrider

Ars Legatus Legionis
24,977
Subscriptor
What better way to show “engaging and relatable examples” of Microsoft’s new feature that would “resonate with a wide audience” than to “use a well-known dataset” like Harry Potter books, the blog said.

This woman should be struck repeatedly about the legal head and shoulders with a nice, thick tome of copyright laws. It isn't a fucking "data set". It's copyrighted material, and you have to pay for using it, you stupid fuck.
 
Upvote
64 (67 / -3)

Steve austin

Ars Scholae Palatinae
1,752
Subscriptor
Haven't you heard? At this "evolved" stage in human history, "all art works are derivative by definition" so shameless borrowing of characters, plots, imagery, components of ostensibly "original" music, et al are fair game. "Originality is a lie, bro." Everyone should just be able to make whatever they want, from whatever source material they want, and profit.
The examples given (the originals of Wizard of Oz, Grimm’s fairy tales, Sherlock Holmes, Homer’s works, Alice in Wonderland, and others) really are public domain, and contain characters and stories that are still very popular, so could have been freely used for this. Using things that are so obviously still under copyright, and using them so blatantly (with instructions on how to download them) should make it hard for a court to buy the “derivative” argument, and deservedly so. It would be nice for the courts to stomp hard on all the AI companies - unfortunately for us, given the money involved, the current administration, and how much of the judiciary appears beholden to the administration, it seems unlikely to happen.
 
Upvote
38 (40 / -2)
“if a company is risk averse, this would probably be flagged.”

Problem is (for the public), that risks for corporations, generally, are massively diminished with the current administration.
(having a convicted felon as a POTUS used to be an unfathomably ludicrous idea)
Copyright infringement is a civil law matter, not a criminal law one. It would be pursued by copyright holders and is not affected by attorney general's competence or attitude
 
Upvote
-3 (5 / -8)
Copyright infringement is a civil law matter, not a criminal law one. It would be pursued by copyright holders and is not affected by attorney general's competence or attitude
But it is being impacted by the Trump packed SCOTUS and increasingly the federal courts that view the oligarchy and corporations as having more rights than those of us who create.
 
Upvote
10 (14 / -4)

Cthel

Ars Tribunus Militum
9,639
Subscriptor
Just in case anyone is under the mistaken impression that copyright terms are reasonable, I’ll note for the record that the Harry Potter books will not enter the public domain for another CENTURY. If we’re lucky.
I thought US copyright terms were currently death-of-the-author-plus-70-years?
 
Upvote
8 (8 / 0)

Wheels Of Confusion

Ars Legatus Legionis
75,398
Subscriptor
This appears to be a somewhat common problem—if you do a web search for "archive redirecting to RT" you'll see many reports of it on reddit and hackernews and other sites. Turning off your VPN reportedly fixes the issue; it's also possible that switching your DNS will fix it.

(FWIW, the link works fine for me—screenshot).
It's not a VPN issue. I've encountered it before without going through a VPN. In fact, it seems to happen a lot with links posted in Ars' own The Soapbox; more often than not I get the RT redirect rather than what the poster was trying to link to.
 
Upvote
13 (13 / 0)

Fred Duck

Ars Tribunus Angusticlavius
7,166
Ashley Belanger said:
To do this, he likened it to having a spell that helps you find exactly what you need among thousands of options, instantly...
For example, which public domain works are available for use as examples.

Ashley Belanger said:
Hacker News commenters suggested the blog could be considered fair use, since the training guide was for “educational purposes,” and Smith said that Microsoft could raise some “good arguments” in its defense.
Ashley Belanger said:
On Hacker News, some commenters defended Kamath’s blog, urging that it should be considered fair use since nonprofits and educational institutions could do the same thing in a teaching context without issue.
They can post links to pirate data sets without issue as long as it's in a teaching context?

I remember when the hit software iTunes existed. There was a famous campaign, "Rip, Mix, Burn" which referred to the process of copying (or "ripping") tracks of music from small discs called CDs, mixing them into playlists via iTunes, and finally "burning" the data with high-powered laser beams onto other small discs called CD-Rs or CD-RWs, which you could then use as coasters.

Michael Eisner of the Walt Disney Company was not amused and claimed the computer industry were fostering piracy. (So this is a tradition that spans decades.)

However, in that case, normal users likely had some of those small musical discs lying about and sharing the output was limited to only a low amount, intended as fair use for friends/family, whereas now-a-days, normal users are unlikely to have DRM-free copies of their favourite eBooks ready for ingesting and there are no limits on output.

Now back to my fan fiction starring characters from the public domain Halo experiences.
 
Upvote
13 (13 / 0)

pokrface

Senior Technology Editor
21,512
Ars Staff
It's not a VPN issue. I've encountered it before without going through a VPN. In fact, it seems to happen a lot with links posted in Ars' own The Soapbox; more often than not I get the RT redirect rather than what the poster was trying to link to.
Why is problem? Do you not want to read glorious news of Russia Today, tovarisch?! :D

(edit, more serious response: i've had issues with archive dot is and all its flavors before when I was using cloudflare DNS, and they went away when i switched to my own recursive resolver. I think the russian fellow running the site has beef with cloudflare—it wouldn't be the first time the site admin has done screwy things).
 
Upvote
24 (24 / 0)
The blog, which is archived here, was written in November 2024 by a senior product manager, Pooja Kamath. According to her LinkedIn, Kamath has been at Microsoft for more than a decade and remains with the company.
A ‘Senior Product Manager’ would know that the HP books are not public. Hell, a ham sandwich would know that JKR is still around and still owns her books. That there were no repercussions for Pooja Kamath says a lot about Microsoft. This was deliberate. It was the product of a ‘steal first and if we get caught deny, delay, and excuse’ culture that is the core of Microsoft, the Tech Bro psyche, and all AI systems.
 
Upvote
44 (44 / 0)

Abby Tangential

Smack-Fu Master, in training
47
A ‘Senior Product Manager’ would know that the HP books are not public. Hell, a ham sandwich would know that JKR is still around and still owns her books. That there were no repercussions for Pooja Kamath says a lot about Microsoft. This was deliberate. It was the product of a ‘steal first and if we get caught deny, delay, and excuse’ culture that is the core of Microsoft, the Tech Bro psyche, and all AI systems.

The problem is there aren't any ham sandwiches working at Microsoft. At least, none that I know of.
 
Upvote
16 (16 / 0)

marsilies

Ars Legatus Legionis
24,392
Subscriptor++
It's not a VPN issue. I've encountered it before without going through a VPN. In fact, it seems to happen a lot with links posted in Ars' own The Soapbox; more often than not I get the RT redirect rather than what the poster was trying to link to.
It might be Archive.is / Archive.today doing the redirecting, since the link seems fine. I copied and pasted it into another window, and got a reCaptcha page. It may be some way to mitigate high traffic.

It's a bit interesting it was an Archive.is / Archive.today link in the first place, considering the new Ars posted a few days ago.

https://meincmagazine.com/tech-policy...ve-today-after-site-maintainer-ddosed-a-blog/
 
Upvote
13 (14 / -1)