Why The New York Times might win its copyright lawsuit against OpenAI

ranthog

Ars Legatus Legionis
15,455
I don't see the difference. We research chemical and biological weapons of mass destruction. What they do we absolutely do need to match, not to be able to use it, but in order to be able to defend against it.
We absolutely do, but we put ethical and legal limitations on the research we do.

The fact that China is doing some things that would very much be unethical and illegal here in the name of medical research mean that the US should allow companies here to do so?

Literally no one is saying the US should shut down AI research.
 
Upvote
18 (18 / 0)
The DoD research chemical and biological weapons. Not Pfizer. OpenAI isn't the DoD or a part of government.
No, OpenAI is arguably more capable than the DoD in this matter. The former has a chance of aiding the DoD in defending against automated foreign electoral interference, the latter, by itself, has none.
 
Upvote
-17 (2 / -19)

Sysosmaster

Smack-Fu Master, in training
6
Subscriptor++
Yes, but how? And how long can you predict? Right now we're up to 128k tokens with GPT-4 Turbo. That's a book. It doesn't matter if the machine is parroting if it's parroting coherent thought, and if it's parroting coherent thought, how is that different from thinking?
How long…. Depands on memory size and processing capacity. In theory it’s unlimited, vendors cap you though. But the system could go on and on.

And to put it into perspective, for an image each pixel is “a token”….
 
Upvote
1 (2 / -1)

Aurich

Director of Many Things
41,441
Ars Staff
It reminds me of what Oppenheimer said in the recent biopic. He didn't know whether the allies could be trusted with the bomb, but he knew the Nazis could not be.

What I find compelling is what the Russians did in 2016 with troll farms. If they can automate that, and if we cannot detect it, because of the way our electoral system works, they will choose our leaders.

If they can choose our leaders they can destroy NATO. If they can destroy NATO they have more countries they can start to chop pieces off. If they go far enough, we'll have a new world war. That's what I find compelling. And that's ignoring the "cyber" warfare capabilities of AI.

Edit: And I shouldn't need to mention that if there is another world war, in an age of nuclear weapons, there is a good chance none of us survive. That's it. World over. Everything dead. That's compelling to me.
Troll farms are already automated, using technology from US companies, like OpenAI.

This nation state nuclear weapon analogy really doesn't hold up to any scrutiny.

I'm not ready to throw our laws out over FUD. That's how we get Trumps.
 
Upvote
22 (23 / -1)
Post content hidden for low score. Show…
Post content hidden for low score. Show…

Aurich

Director of Many Things
41,441
Ars Staff
I'm gonna start a new company called hellafreemovies.com where I will upload and distribute movies I torrent freely to everyone.

When the copyright holders come after me I will patiently explain that Russian and China don't give a fig about their copyrights, and thus in order for me to compete with them I don't have to follow the laws.

Checkmate Hollywood!
 
Upvote
31 (34 / -3)
so if you make a work that then looks a lot like a copyrighted work then you are leaving fair use
thas not AI's fault it is you the user
just as would be case of you drawing it or using a film clip and id argue we then have rules against and for that...
BUT it allows me ot locally and privately make up neat stuff that i dont share for fun
once one distributes that stuff you run afoul of copyright law
end of story are we going to sue pencils and all the software that can clip things?

this is just hollywood and in this case a news paper all whiney, they cant nickel and dime us all for anything and everything.

ONE also forgets that copyright is supposed ot aid htemin keeping ot make art
is news art ? id say no. its factual stuff.
the comics part might be artful but thats it
just look at fox news getting sued for 700+ million cause they tried the defense of they are not news but entertainment.

is new york times NEWS or entertainment?
 

Attachments

  • OIG (15).jpg
    OIG (15).jpg
    200.8 KB · Views: 8
Upvote
-15 (0 / -15)
Post content hidden for low score. Show…

Aurich

Director of Many Things
41,441
Ars Staff
You would have a point if those movies decided elections, but they don't. This is reductio ad absurdum.
Your existential crisis isn't an argument.

Sorry, but "we must cede all moral, ethical, and legal decisions to giant commercial entities for the sake of our freedom" isn't the side I want to be on. That, to me, is absurd.
 
Upvote
42 (42 / 0)
Post content hidden for low score. Show…

Baumi

Ars Tribunus Militum
2,475
BUT it allows me ot locally and privately make up neat stuff that i dont share for fun
If, as the lawsuit alleges, it regurgitates someone else’s copyrighted material, neither you nor it are “making up stuff”. And it wouldn’t matter whether you share it with anyone, because the problem would be that the AI company shared it with you without having the permission to do so.

ONE also forgets that copyright is supposed ot aid htemin keeping ot make art
is news art ? id say no. its factual stuff.
Facts can’t be copyrighted, but articles about them can. The authors of this article couldn’t successfully a copyright claim against someone repeating the facts of the case based on this article. They could, however, do so, if someone copied a sufficiently large amount of text verbatim from their article without getting permission to do so.
 
Last edited:
Upvote
15 (16 / -1)

ranthog

Ars Legatus Legionis
15,455
The New York Times is. That may not be their intent but if they win, that will be the effect. There will no longer be enough data to train a language model. How would you "teach" a robot to understand modern society if it could not learn from copyrighted sources? You could not. Meanwhile the Russians and Chinese would not give a fuck, and we would suffer the consequences.
Easy. The commercial organizations we're dealing with such as Google, Facebook, and Microsoft all have the pockets to invest in this, if it is anywhere near as important as you've indicated.

People should both be compensated for the use of their copyrighted material and have to agree to the use of it.
 
Upvote
10 (13 / -3)
Post content hidden for low score. Show…

ranthog

Ars Legatus Legionis
15,455
Like I said before. The amount of data necessary to train a language model does not exist without copyrighted data, just like you could not raise a child in modern society to understand modern society without exposure to copyrighted material.
Then pay for the right to use it. Respect that not everyone will agree to let you use it.
 
Upvote
15 (17 / -2)

fenris_uy

Ars Tribunus Angusticlavius
9,298
I don't think it's ceding ethical decisions. If a human can learn from books, a machine should be able to as well. A human can memorize books and articles too if they read one enough times. In the case of AI, when that happens, it is legitimately an error. There is no malice.

I am not making that argument because I think the existential one is of more importance, and yes, survival must come before copyright infringement because while the latter might save Mickey Mouse, the former will save lives.

I don't think what I am arguing is an "existential crisis". It's reality. If NYT wins, AI will be effectively crippled if not illegal in the US, leaving us with no defense against our enemies. I don't like it, but I don't see good options here.

Yeah, no options, surely a company that got a $10B investment from MS and is supposedly valued at $80B has no options about how to access copyrighted material, no options at all.

If having access to copyrighted material is the differenct between their $80B company existing or not, they can pay for that material. It's not a hard thing to do.

Also, if OpenAI goes belly up, I'm sure that $3T MS or $1.5T Alphabet (both make about $20B in profit per quarter) could adquire the company outright and have the deep pockets needed to license the training data.
 
Upvote
14 (15 / -1)
Post content hidden for low score. Show…

stdaro

Ars Scholae Palatinae
718
The New York Times is. That may not be their intent but if they win, that will be the effect. There will no longer be enough data to train a language model. How would you "teach" a robot to understand modern society if it could not learn from copyrighted sources? You could not. Meanwhile the Russians and Chinese would not give a fuck, and we would suffer the consequences.
how many billions of documents does it take for a child to learn a language? how many hours of audio?
The fact that these models 'need' the stolen creative output of millions of people should be a clue that they don't 'learn' at all. They are glorified compressed lookup tables, with no capability to analyze, learn or synthesize.
 
Upvote
23 (27 / -4)
Yeah, no options, surely a company that got a $10B investment from MS and is supposedly valued at $80B has no options about how to access copyrighted material, no options at all.

If having access to copyrighted material is the differenct between their $80B company existing or not, they can pay for that material. It's not a hard thing to do.
It's fair use to create statistics from copyrighted material. What word is most likely next is what's recorded, not the text itself. When you predict word after word and get it right, sometimes it matches the original, but generally when that happens it's undesirable.

Besides which, the amount of data necessary to train a language model, licensed to do so, would be impossible for any single entity to purchase. It's also likely not possible to filter the volume of text necessary to remove all copyrighted works.

The training works by feeding significant chunks of the internet into a model until it can predict the next word accurately. It's not possible to filter significant portions of the internet for copyrighted works completely. The technology doesn't exist. If that's the standard you have, then you outlaw AI (but only for the US, not for our enemies). This would be like chopping off your nose to spite your face.
 
Upvote
-18 (4 / -22)

ranthog

Ars Legatus Legionis
15,455
That is literally impossible. You are arguing impossible standards which would effectively outlaw AI, and at the worst possible time. Authors are not compensated when children read a library book. Nor should language models that learn from them. They do not copy them. Memorization indicates an error in training where it happens and I am not at all convinced that's what's happened here.
So you're saying a company worth 80 billion has no resources to pay for licenses for the material it is using?

I am certainly not suggesting that nonprofit research into AI that is going on in universities and the like be held to the same standard. Just those who are commercializing it.

A child is learning to read and understand. LLM's are not. Libraries also generally speaking pay publishers.
 
Upvote
19 (20 / -1)

fenris_uy

Ars Tribunus Angusticlavius
9,298
That is literally impossible. You are arguing impossible standards which would effectively outlaw AI, and at the worst possible time. Authors are not compensated when children read a library book. Nor should language models that learn from them. They do not copy them. Memorization indicates an error in training where it happens and I am not at all convinced that's what's happened here.

Last time I checked libraries don't steal books, they buy them. If you don't want your book to be available in a library put a claim in it saying so.
 
Upvote
17 (18 / -1)

Mr_AX

Smack-Fu Master, in training
60
I find the 'Italian plumber' example compelling, as an argument that generative AI shouldn't be allowed to use copyrighted materials for training without authorization. The generated images are clearly copies of the Nintendo Mario character. Yet there is no attribution, the AI company seems to want to profit by generating that image, and they got no prior authorization to use the original Mario images for training. All of that seems wrong and something that should be illegal. A human artist would not be allowed to do that, why should an AI company?

In similar vein, if I was an author, I would want to have control over whether my works were used to train AI. The training materials could be used to give the AI the ability to generate new text in my style. Maybe that's ok if I have approved using my works for training, but feels like theft if I did not. Personally I think I would exercise my control to disallow training on my works. But other possibilities are to allow it in exchange for compensation, or without. Anyway I think authors/rights holders should have that kind of control, instead of AI companies being allowed to hoover up anything copyrighted without consent and then profit from it in any way they like, including ones that could compete with or replace the author who actually created the 'training material'.
 
Upvote
22 (24 / -2)
Post content hidden for low score. Show…

veldrin

Ars Tribunus Militum
2,828
"Do whatever you want, ethically and legally, in a monomaniacal pursuit of intelligent machines" isn't a philosophy I find particularly compelling.

How about these companies, which are massive commercial ventures and not the saviors of humanity, follow the law?
As is often the case with copyright, it's not actually clear what the law is when your use doesn't involve mechanistically copying works verbatim. And even then, as noted by the article, it's still possible to copy millions of works verbatim and still fall under fair use.
 
Upvote
4 (5 / -1)

Aurich

Director of Many Things
41,441
Ars Staff
I don't think it's ceding ethical decisions. If a human can learn from books, a machine should be able to as well.
Machines aren't people. They're not children. I flat out reject any and all analogies that assume what's fine for one is fine for the other. And so does the law so far. A human can hold copyright, a machine cannot. Because they are not equivalent.

A human can memorize books and articles too if they read one enough times. In the case of AI, when that happens, it is legitimately an error. There is no malice.
There is no malice because machines are incapable of motives and emotions. Corporations, being made of people, are. I think it's abundantly clear that OpenAI is not an ethical organization, and you'd be as naive to trust them as to announce that Google is not going to do any evil, therefor should hold everyone's data.

I am not making that argument because I think the existential one is of more importance, and yes, survival must come before copyright infringement because while the latter might save Mickey Mouse, the former will save lives.
You've made no convincing argument that by allowing OpenAI AI to scrape the internet as they see fit that lives will be saved. "But Russia!" isn't actually an argument. It's FUD along the lines of saying "the only way to stop a bad guy with a gun is a good guy with a gun" when people are discussing sensible gun regulations.

I don't think what I am arguing is an "existential crisis". It's reality. If NYT wins, AI will be effectively crippled if not illegal in the US, leaving us with no defense against our enemies.
I'm sorry, but what?

Which part of OpenAI is the shield against our enemies exactly makes sense to you? They're a for profit corporation making money. Along with every other big company scraping all our data to grind in their black boxes.

This isn't Red Dawn. The LLMs are not going to defend us.
 
Upvote
35 (36 / -1)

Thegs

Ars Scholae Palatinae
911
Subscriptor++
I promise I read the whole article, and I apologize for going off on a tangent, but I am really curious about mp3.com and a product that seemingly did similar things, Google Play Music. A service Google Play Music offered is that you could upload copies of your own music to their server and it would act like an online music player. So if I had a copy of Tunak_Tunak_Tun.mp3 on my computer, I could upload it and stream it anywhere I could log in with my Google account.

However, they were also smart about storage, and if they already had a copy of Tunak Tunak Tun, it would skip the upload part and just add the song to my account, saving me upload time and them storage (both more pressing concerns in 2011). So I guess I'm curious, why was mp3.com assailed with lawsuits to shut it down, but Google Play Music wasn't? Or maybe it was, and I'm just unaware, and that's why Google quietly shut down the upload service?
 
Upvote
6 (6 / 0)

Xepherys

Ars Scholae Palatinae
962
Subscriptor
"Do whatever you want, ethically and legally, in a monomaniacal pursuit of intelligent machines" isn't a philosophy I find particularly compelling.

How about these companies, which are massive commercial ventures and not the saviors of humanity, follow the law?

legality != morality

morality != legality

While it's easy to throw ire at megacorps, the exact same argument is used by asshats when saying "why didn't they just follow the law?" in regards to young black men being sent to prison for 20 years for having four joints on them during an unlawful detention.

If "follow the law" is your answer to literally any question, then you're no better than the "Rule of Law" Republicans. I suspect you have much higher "real" moral ground, but the argument is a slippery slope, at best, and disingenuous at worst.

If we, as a society (Western society) actually adhered to the law on the whole, and those who created and enforced laws did so fairly and without bias, you might have a valid point. As it stands, in Western societies and doubly so in the United States, that's poor rhetoric.
 
Upvote
-10 (2 / -12)

Aurich

Director of Many Things
41,441
Ars Staff
As is often the case with copyright, it's not actually clear what the law is when your use doesn't involve mechanistically copying works verbatim. And even then, as noted by the article, it's still possible to copy millions of works verbatim and still fall under fair use.
Sounds like we should have a court case!
 
Upvote
9 (10 / -1)
So you're saying a company worth 80 billion has no resources to pay for licenses for the material it is using?
To do what you're asking they would need to license everything in the training which is a large part of the internet. Not everything is properly attributed so you'd need to factor in plagiarism as well. You'd need to accurately attribute everybody's words, contact everybody, negotiate with everybody, and only then begin training. The logistics alone of making all those requests makes it impossible.

And let's assume we figure out a way that we need less data. Maybe only what a child needs to learn. Children learn from copyrighted books. You would have to raise your hypothetical robot away from copyrighted sources. In modern society this isn't possible. Kids read books, watch movies, listen to radio. They don't memorize unless you make them read things many time and when you do so you have wasted that child's valuable time. I don't see how it is any different here. You would kill this technology because you're upset about the economic effects it might have. I get that, and there will be consequences. OpenAI admits it. But also we do need to do this before others do because there are worse things than us losing our jobs, believe it or not.
 
Upvote
-19 (1 / -20)
To do what you're asking they would need to license everything in the training which is a large part of the internet. Not everything is properly attributed so you'd need to factor in plagiarism as well. You'd need to accurately attribute everybody's words, contact everybody, negotiate with everybody, and only then begin training. The logistics alone of making all those requests makes it impossible.

And let's assume we figure out a way that we need less data. Maybe only what a child needs to learn. Children learn from copyrighted books. You would have to raise your hypothetical robot away from copyrighted sources. In modern society this isn't possible. Kids read books, watch movies, listen to radio. They don't memorize unless you make them read things many time and when you do so you have wasted that child's valuable time. I don't see how it is any different here. You would kill this technology because you're upset about the economic effects it might have. I get that, and there will be consequences. OpenAI admits it. But also we do need to do this before others do because there are worse things than us losing our jobs, believe it or not.
It’s weird that you think it’s bad to kill a technology that causes harm to people. Like you think technology is more important than actual human happiness
 
Upvote
11 (13 / -2)

Aurich

Director of Many Things
41,441
Ars Staff
legality != morality

morality != legality

While it's easy to throw ire at megacorps, the exact same argument is used by asshats when saying "why didn't they just follow the law?" in regards to young black men being sent to prison for 20 years for having four joints on them during an unlawful detention.

If "follow the law" is your answer to literally any question, then you're no better than the "Rule of Law" Republicans. I suspect you have much higher "real" moral ground, but the argument is a slippery slope, at best, and disingenuous at worst.

If we, as a society (Western society) actually adhered to the law on the whole, and those who created and enforced laws did so fairly and without bias, you might have a valid point. As it stands, in Western societies and doubly so in the United States, that's poor rhetoric.
"Laws can be unjust, therefore we should not advocate that companies should follow them" isn't an argument.

I hope The NY Times wins their case, because I believe their argument has merit. And I think OpenAI should be compelled to follow the law if they lose. That's my stance. Absurd comparisons to drug sentencing laws have no bearing on anything.
 
Upvote
18 (20 / -2)

fenris_uy

Ars Tribunus Angusticlavius
9,298
It's fair use to create statistics from copyrighted material. What word is most likely next is what's recorded, not the text itself. When you predict word after word and get it right, sometimes it matches the original, but generally when that happens it's undesirable.

Besides which, the amount of data necessary to train a language model, licensed to do so, would be impossible for any single entity to purchase. It's also likely not possible to filter the volume of text necessary to remove all copyrighted works.

The training works by feeding significant chunks of the internet into a model until it can predict the next word accurately. It's not possible to filter significant portions of the internet for copyrighted works completely. The technology doesn't exist. If that's the standard you have, then you outlaw AI (but only for the US, not for our enemies). This would be like chopping off your nose to spite your face.

You severelly understimate the ammount of money that big tech has available compared with license holders.

Apparently, all of Reddit data was just 60 Million a year. If Twitter and Facebook weren't trying to do their own models, you could probably buy both of theirs for less than 500M a year.

I doubt that you would need more than a couple of billions a year to entice Penguin, CondeNast, Harper, Pearson, etc to entice the authors that they publish to license their works to LLMs.

You don't need every author, you just need a lot of them. If GRRM doesn't agree, you don't care, that's only 5000 pages of work, that isn't going to break your LLM, you can probably get all of the authors that sell less than 10k copies for a few dollars.

It takes time, and it takes money and contracts. But it isn't something insurmountable for a big corporation.

And if the Feds think that it's imperative to have a LLM model of their own, they can do their own without those limits that we put on companies.

If you think that a Russian troll farm can create a LLM better that the US goverment, you have too little faith in the US.
 
Last edited:
Upvote
15 (15 / 0)
This is a really fantastic and illustrative treatise on the issues involved. I particularly loved the focus on previous cases which could be applicable. As a copyright wonk of sorts myself, it's really nice to see these kinds of works which seem designed to help explain some of the messy nuance of copyright to folks, because copyright is rarely as straightforward as people want it to be.
 
Upvote
9 (9 / 0)
Post content hidden for low score. Show…
Apparently, all of Reddit data was just 60 Million a year.
Because they're not paying the Redditors, duh. Imagine hunting down every author who has every put words on the internet. That's not possible. Everybody, even you and me, would have cause to sue. It would outlaw AI.
 
Upvote
-18 (1 / -19)

fenris_uy

Ars Tribunus Angusticlavius
9,298
Because they're not paying the Redditors, duh. Imagine hunting down every author who has every put words on the internet. That's not possible. Everybody, even you and me, would have cause to sue. It would outlaw AI.

No, the Ars tos probably gives them rights over what we post here. And Reddit TOS gives them rights about what we post there.

And if you don't know who the copyright holder of www.fenris_uy.com is, you just don't use that site in your training.
 
Upvote
12 (13 / -1)
It’s weird that you think it’s bad to kill a technology that causes harm to people. Like you think technology is more important than actual human happiness
I see it's potential as well as the harm it can cause. The harm we see right now exists because our society is not ready for the consequences, not that the technology itself is evil somehow. I these predictive models as tools. They can be used as weapons or shields, and we don't want to be in a position where the other guy has a capable weapon and we have no shield, and there can't be any human happiness if there are no humans left.
 
Upvote
-18 (1 / -19)
No, the Ars tos probably gives them rights over what we post here. And Reddit TOS gives them rights about what we post there.

And if you don't know who the copyright holder of www.fenris_uy.com is, you just don't use that site in your training.
There isn't enough to allow-list. Maybe Meta can do it with everything that's posted on Facebook but if they did such a model would only be able to generate Facebook posts.
 
Upvote
-17 (1 / -18)