OpenAI says it’s “impossible” to create useful AI models without copyrighted material

Jorgon

Wise, Aged Ars Veteran
168
Subscriptor++
google isn't creating new content based on what it crawls. Though it does do many illegal things like steal content from pages and show it in search results - meaning people don't click on the link. Or the entire news thing. If this means google gets punished and fined billions - well its about time.
Google's index is a new object. Not a trained Markov chain, mind you, but still creating data where none was before.
 
Upvote
5 (5 / 0)

ambivalent

Smack-Fu Master, in training
96
My understanding of how LLMs work is that they do not ultimately store the original text at all - it's effectively a probability ranking. Right now original text can be reproduced if the original text is unique enough or the prompt is specific enough, certainly, and I can imagine OpenAI being forced to implement measures to prevent GPT from reproducing original copyrighted texts verbatim. But reading a public website and building a probability model from its contents doesn't seem like it would be legally any different from Google reading public websites to build indexes and search snippets. As long as GPT's output isn't verbatim and is based on countless other sources as well, it'd be difficult-to-impossible to directly attribute it to any specific website.

Basically, based on my admittedly limited knowledge of precedent, I can see OpenAI being forced to prevent reproduction of copyrighted text but I'm not sure they can be stopped from using it for training at all. I'd love to be wrong on this, to be clear!
The problem with using this excuse is that it ultimately also precisely describes how many compression algorithms operate. Want to break copyright law? Just send me a jpg of it! The fair use exemption usually requires that some creative or transformative step has taken place. Except this isn't happening with LLMs - all they're doing is combining multiple bits of different existing copyrighted stuff, there's no "original" bit. Is copyright infringement still copyright infringement if it happens on an industrial scale? It's classic salami slicing. I'm reminded of the 1983 Superman III where they're only taking a fraction of a cent off everyones paycheck. Embezzelment isn't embezzelment in that case, right?
 
Upvote
-7 (2 / -9)
Without knowing the legalities, I would say it is already quite a different situation in that with AI you can do this on a superhuman scale, you can copy the trained AI, modify and adapt it, forever... If Joe Blogs used a one-month subscription to read and learn the entire NYT archive, and then made a million NYT-educated clones of himself that he rents out for all sorts of jobs, then that might be a problem, too.
A problem? For whom?..
NYT is still in the business of writing new content.
 
Upvote
-2 (1 / -3)

GrimPloughman

Wise, Aged Ars Veteran
158
LLMs are a statistical method that uses the data collected over the internet. But a human brain also collects data and then process them, the processing method is only more sophisticated than in LLMs. You also need to consume and analyze art made by other artists to become a good artist yourself. Writers use similar cliches, musicians use similar techniques and tricks. For example many fantasy books are similar, metal core bands sound similar, Hollywood blockbusters are similar.

A technology can develop fast only if there is a market demand and competition. This is why smartphones have been developed very fast and space rockets technology were stagnated until the era of SpaceX and commercial flights. AI is a key technology these days and the biggest problem with developing neural network models is to find a good data source to feed them. The technology can't progress fast without an enormous big free data source that internet is.
 
Last edited:
Upvote
-4 (4 / -8)

OOPMan

Ars Scholae Palatinae
1,412
I mean, the obvious answer is "pay people to use their work and let them opt-in for it". This isn't a hard problem, it just has a solution they don't like.

My business model of having a fast food restaurant where grocery stores give me ingredients for free and I have workers I don't have to pay also doesn't work without me being able to get free access to things other people have to pay for, but you don't see me crying to the EU about it.

But...but...but....

That means their ability to generate profit disappears!

Because, like all "good" start-ups, they basically only work because of VC capital being pumped in to shore up the money bleeding out.

If they have to start paying for all their stuff now, like some kind of "real" company, it won't work!

Oh wait...

You're telling me that's how it's supposed to go?

You're telling me that companies that can't find a balance between money in and money out should die?

Oooooooooooooooh!

XD
 
Last edited:
Upvote
11 (12 / -1)
So google books goes too then, right?

Because that’s the precedent OpenAI is alluding to when they say what they’re doing is fair use. To be sure, they’re overselling the amount of precedence, but also, it’s a serious argument.

It is possible to distinguish generative AI from google books, but you have to slice the salami pretty thin to do so, lean hard on brand new precedent from the Warhol case, and probably start with more than a smidge of motivated reasoning.

This forum was pretty pro-google and anti-authors guild when that all got litigated and I remain curious as to whether the zeitgeist has shifted away from defending Google’s open information pitch, or if tech communities genuinely support one but not the other. And if anyone is pro-google books and anti-OpenAI, are you comfortable with the apparent contradiction? Is it just results based?

(FWIW, I think both what google did with books and what OpenAI is doing with their model ought to be infringement.)
So basicly the argument is a big whataboutism?

Well Google does it so what cant we?
 
Upvote
5 (7 / -2)

awelux

Ars Scholae Palatinae
828
Maybe it's time to reduce the copyright timeframe to something reasonable.
With copyright the author get's protection for his work from the society.
Similar to patents it's reasonable to give the work back to society after some time.

The alternative to giving in a bit is a decision if copyright covering AI use or not is a greater benefit overall.
And I don't see copyright winning that one.
 
Upvote
-1 (2 / -3)

steelcobra

Ars Tribunus Angusticlavius
9,772
But...but...but....

That means their ability to generate profit disappears!

Because, like all "good" start-ups, they basically only work because of VC capital being pumped in to shore up the money bleeding out.

If they have to start paying for all their stuff now, like some kind of "real" company, it won't work!

Oh wait...

You're telling me that's how it's supposed to go?

You're telling me that companies that can't find a balance between money in and money out should die?

Oooooooooooooooh!

XD
There's also the evils that the people who are already rich are using it to screw over skilled workers with. Language translators are being told that instead of standard rates, they can "do a faster job" just cleaning up machine translated work at a reduced rate that usually requires more time to realize the autotranslation was junk, throw it out, and start from scratch while getting paid half as much or less. One of the plans of the movie/tv studios was AI generating scripts so that writers would be paid just for editing, not at full writers rates. (with a byproduct being that now nobody is apprenticing in the the writers room to showrunner pipeline)

Basically, people with no talent at actually creating anything but run the companies that do see AI as a way to take even more for themselves while bleeding the people who do the work harder than ever.
 
Upvote
11 (12 / -1)
Minority opinion here. This push to have AI companies pay royalties to train will result in AI being owned by a handful of truly massive companies. There will be no small players. I think this is going to backfire.
This is what I think too. What are the possible outcomes?
1) Only the company with the most data gets to create a coherent AI. Meta, Apple, Amazon create mediocre AIs with their own data troves. Along with whatever they can buy from publishers. This creates a new incentive for Big Tech mergers.
Or
2) You sign away your data, and then it's shared/licensed/sold by major players to other major players. Again, AI is controlled by billionaires only.
Or
3) We mostly just use illegal and/or foreign hosted LLMs? And responsibility falls on international law.

That's aside from the morality, philosophy, etc.
 
Last edited:
Upvote
-3 (1 / -4)

Stinkles

Ars Scholae Palatinae
812
I do not agree with most comments here.
There is a difference between training and distribution. Everyone reading NYTs public content is allowed to learn from that and profit from what you have learned. If you did that, did you steal their content?.....
So if I read a NYT article and learn something that is well within my rights (and the intention of the content). If I read a NYT article and copy it and claim it as my own that is plagiarism. You can use materials to inform and construct your own opinions but you need to make those words or content in your own voice, AI can't do that as it's literally designed to just analyze and replicate already existing content.
 
Upvote
4 (5 / -1)

WCR-790

Smack-Fu Master, in training
14
Copyright law has to be changed to catch up to modern times. For the past few decades "intellectual property rights" has been the buzzword. That has gotten out of hand. Copyright law has to be changed to reflect that in order to progress much material must be available in a different way than was possible before internet scraping, etc...
 
Upvote
-4 (2 / -6)

Gryphx

Ars Scholae Palatinae
715
I don't see a way that we shut down LLM technology because of copyright concerns. This horse has left the barn - LLM capabilities are too valuable for folks in power to walk away from. Do you really think the US Govt is going to say - OK, fair enough, let's pack this thing up - while China powers on full speed ahead? This is strategically significant technology that is potentially only the beginning of an exponential curve. And now that the technology to do this is open source, and scraping of public web content is free use - do we really want to setup constraints so that the only people with the power of frontier LLMs are those with the power and money to do it in secret? Guess what - the NSA has all the training data they could ever want (https://nsa.gov1.info/utah-data-center/) - and I for one want to make sure that EVERYONE has access to the productivity increases made possible by generative AI, not just those with power and influence to do what they desire in secret.
 
Upvote
-4 (4 / -8)

Gryphx

Ars Scholae Palatinae
715
So if I read a NYT article and learn something that is well within my rights (and the intention of the content). If I read a NYT article and copy it and claim it as my own that is plagiarism. You can use materials to inform and construct your own opinions but you need to make those words or content in your own voice, AI can't do that as it's literally designed to just analyze and replicate already existing content.
That's simply not true, they don't just regurgitate - these frontier models can and do make creative connections between learned data. I regularly hand it unique problems that no one else has ever faced exactly and it generalizes and applies patterns it knows to help me solve my unique situation. I see the problem almost as the fact that it has photographic memory, so at times it's repeating back exact training material, rather then synthesizing it as expected.
 
Upvote
-6 (1 / -7)

krimhorn

Ars Legatus Legionis
39,865
Fair use generally covers excerpts not the entire copyrighted works, or has exceptions for non profit, educational, etc use cases. It doesn't cover a for profit company making use of works in their entirety.
Well, they're highly likely to rely on the education exemption in their actual court battles (specifically in the US). Their argument is, basically, that they're not training models on copyrighted material to reproduce that material but to learn the styles associated with that material to create new material with the styles of the day. They'll argue that's the equivalent of showing a movie to a classroom of paying students to teach them about the stylistic makeup of that movie so they can then go out and make more movies with that knowledge.
 
Upvote
1 (2 / -1)

PaxTechnica

Wise, Aged Ars Veteran
158
I find myself in the unpopular position of thinking the correct application of copyright here is not on the consumption by OpenAI (since the information in question is freely viewable) but on the output. They already assign various scores to their output, such as how likely it is to be adult content and then actively warn or filter on that. I see them eventually needing to do the same for how closely output resembles copywritten material.

Regardless of what OpenAI does, the genie's out of the bottle on this one. In some industries, quality AI is going to be a huge factor in corporate competitiveness going forward. Regardless of how severely the US elects to slow it down inside its own borders, many other countries won't follow and will continue to use publicly accessible information in their AI training.
 
Upvote
-2 (1 / -3)

ip_what

Ars Tribunus Angusticlavius
6,181
So basicly the argument is a big whataboutism?

Well Google does it so what cant we?

No, a court ruled what google did with its books project—scanning tons of copyright works, without author permission, to build a tool that they are monetizing—is fair use. I’m asking whether you think the google books precedent is the right decision. I outright said I don’t think it is. But I don’t wear a robe to work.

Also, a lot of the tech community was on Google’s side while that was playing out, which doesn’t seem consistent to me with what I’m seeing now. I’m wondering if there’s been a shift and people think they were wrong about google books, or if something else is going on here. And if it’s something else, what is it?
 
Upvote
2 (3 / -1)

LostFate

Ars Scholae Palatinae
972
I don't see a way that we shut down LLM technology because of copyright concerns. This horse has left the barn - LLM capabilities are too valuable for folks in power to walk away from. Do you really think the US Govt is going to say - OK, fair enough, let's pack this thing up - while China powers on full speed ahead? This is strategically significant technology that is potentially only the beginning of an exponential curve. And now that the technology to do this is open source, and scraping of public web content is free use - do we really want to setup constraints so that the only people with the power of frontier LLMs are those with the power and money to do it in secret? Guess what - the NSA has all the training data they could ever want (https://nsa.gov1.info/utah-data-center/) - and I for one want to make sure that EVERYONE has access to the productivity increases made possible by generative AI, not just those with power and influence to do what they desire in secret.
I've got bad news for you, the only reason it's freely available to you right now is because they need you to get hooked in order to make it politically untenable to regulate it. Once the damage is done, they aren't going to share with you anymore and your model trained on an old dataset will get less and less relevant. This is a consolidation of power by the upper class, they aren't your friend.
 
Upvote
10 (12 / -2)
Leaving aside who should win or lose here, I have a fundamental problem with using copyright law in the way the NYT is using it. Training an AI on copyrighted works is no different (other than scale) with a person learning anything through reading copyrighted works. Copyright protects the expression of an idea, not the idea (or facts) itself. So if I train an AI about the idea of a tree by showing it thousands of copyrighted images of a tree, if the AI then draws a tree, it hasn't violated copyright law -- unless it simply takes one of those thousands of images and spits out a copy of it. Similarly, if I read 10 news stories about the Alaska Air door incident, I could then write my own news account, drawing from the facts of those stories -- even using quotes from eye witnesses -- without violating the copyright in those news stories.

Obviously, if the AI is simply spitting out copies of things it has "read", that would be a copyright violation, but that isn't what the NYT alleges here (though they do say that the AI stories are "virtually identical", but the issue then becomes how "virtual" are the AI-stories and that is paired with how thin a copyright there is in a news story).

As to the idea that DALL-E is copying an artist's "style" in the images it generates, again, this isn't a copyright issue. I can paint "in the style of" any number of artists without violating their copyright -- and assuming I'm not simply painting a slavish copy of one of their pieces. Where this would become a problem is if I were attempting to pass off my paintings as those of the original artist. However, that isn't a copyright issue either, it is a trademark issue (and fraud).

My concern here isn't who wins or loses, its about the scope of copyright law and how it is being stretched to cover areas that it really shouldn't. Remember, copyright lasts for the life of the author plus 70 years. That is a long time to allow someone (or some corporation) to monopolize content. If we over-extend copyright we will stifle future creativity. The NYT and others have routes that will allow them to seek appropriate compensation from OpenAI without over-extending copyright law.
 
Upvote
-6 (2 / -8)
You are grossly over simplifying the "stealing" part. We are talking about material that is made available to the public for free and that are being indexed by search engines, for example. What else are you allowed/not allowed to do with that content, that is the question here and the legality of AI training is what is disputed.
How am I? Theft is pretty cut and dry. If I charge money to show a disney picture, disney lawyers won't give me a pass or quibble. They will take my house as payment and throw me in jail. Why is this different because its a tech company doing the stealing?
 
Upvote
7 (7 / 0)

Kjella

Ars Tribunus Militum
2,080
The problem with using this excuse is that it ultimately also precisely describes how many compression algorithms operate. Want to break copyright law? Just send me a jpg of it! The fair use exemption usually requires that some creative or transformative step has taken place. Except this isn't happening with LLMs - all they're doing is combining multiple bits of different existing copyrighted stuff, there's no "original" bit. Is copyright infringement still copyright infringement if it happens on an industrial scale? It's classic salami slicing. I'm reminded of the 1983 Superman III where they're only taking a fraction of a cent off everyones paycheck. Embezzelment isn't embezzelment in that case, right?
This "fraction of a cent" is more like taking a note from many different songs and putting together your own melody. The difference was a lot easier to understand when StyleGAN created human faces based on 70k photographs. If you generated 70k fake photographs you'd see that they weren't memorized but shared statistical properties, like if 32% of the real photos had blonde hair so did 32% of the generated photos. So many were old, so many had short hair, so many smiled and so on. You saw second order statistics like old people wear more glasses, women wear more make-up, men ofter go bald. That manifold of plausible faces is something transformative and way more than the 70k photos you started with. It's not a JPEG...
 
Upvote
2 (4 / -2)

LostFate

Ars Scholae Palatinae
972
This "fraction of a cent" is more like taking a note from many different songs and putting together your own melody. The difference was a lot easier to understand when StyleGAN created human faces based on 70k photographs. If you generated 70k fake photographs you'd see that they weren't memorized but shared statistical properties, like if 32% of the real photos had blonde hair so did 32% of the generated photos. So many were old, so many had short hair, so many smiled and so on. You saw second order statistics like old people wear more glasses, women wear more make-up, men ofter go bald. That manifold of plausible faces is something transformative and way more than the 70k photos you started with. It's not a JPEG...
Also, the way they generated the fraction of a cent was by taking OpenAI's revenue and dividing it by content in the dataset. That presupposes that OpenAI should continue existing as a company and insulates it from the fact that it's business model is just untenable.
 
Upvote
4 (4 / 0)
I've got bad news for you, the only reason it's freely available to you right now is because they need you to get hooked in order to make it politically untenable to regulate it. Once the damage is done, they aren't going to share with you anymore and your model trained on an old dataset will get less and less relevant. This is a consolidation of power by the upper class, they aren't your friend.
It is simple software that runs on commercially available hardware, with numerous competitors, open source models, and DIY models. Not sure how it could cost a lot of money.

I suppose the next version-- GPT-5, ultraBing, etc. will require a large initial outlay ($billions) to build, but that will also shrink in cost in the same way all tech does.
 
Upvote
-6 (0 / -6)

LostFate

Ars Scholae Palatinae
972
It is simple software that runs on commercially available hardware, with numerous competitors, open source models, and DIY models. Not sure how it could ever cost a lot of money.
Well, those open source models are certainly going to go away and you don't have access to the level of content to DIY your own and have it continue to be useful.
 
Upvote
3 (3 / 0)
Well, those open source models are certainly going to go away and you don't have access to the level of content to DIY your own and have it continue to be useful.
Llama is free to own, in the same way that the game Guardian of the Galaxy's is free to own on Epic right now.

I guess you could argue it won't "be useful" anymore because you'll want the latest and greatest, but it's still useful in the sense that you will always be able to use it to help write an SQL query or translate Chinese or whatever.
 
Last edited:
Upvote
-4 (0 / -4)

ip_what

Ars Tribunus Angusticlavius
6,181
Leaving aside who should win or lose here, I have a fundamental problem with using copyright law in the way the NYT is using it. Training an AI on copyrighted works is no different (other than scale) with a person learning anything through reading copyrighted works. Copyright protects the expression of an idea, not the idea (or facts) itself. So if I train an AI about the idea of a tree by showing it thousands of copyrighted images of a tree, if the AI then draws a tree, it hasn't violated copyright law -- unless it simply takes one of those thousands of images and spits out a copy of it. Similarly, if I read 10 news stories about the Alaska Air door incident, I could then write my own news account, drawing from the facts of those stories -- even using quotes from eye witnesses -- without violating the copyright in those news stories.

Obviously, if the AI is simply spitting out copies of things it has "read", that would be a copyright violation, but that isn't what the NYT alleges here (though they do say that the AI stories are "virtually identical", but the issue then becomes how "virtual" are the AI-stories and that is paired with how thin a copyright there is in a news story).

As to the idea that DALL-E is copying an artist's "style" in the images it generates, again, this isn't a copyright issue. I can paint "in the style of" any number of artists without violating their copyright -- and assuming I'm not simply painting a slavish copy of one of their pieces. Where this would become a problem is if I were attempting to pass off my paintings as those of the original artist. However, that isn't a copyright issue either, it is a trademark issue (and fraud).

My concern here isn't who wins or loses, its about the scope of copyright law and how it is being stretched to cover areas that it really shouldn't. Remember, copyright lasts for the life of the author plus 70 years. That is a long time to allow someone (or some corporation) to monopolize content. If we over-extend copyright we will stifle future creativity. The NYT and others have routes that will allow them to seek appropriate compensation from OpenAI without over-extending copyright law.

When you read a book, you don’t download a copy and store it in a database. That’s not how human learning works. It is how ML training operates though. And the thing about copyright is that it is very much intended to make the creation of unauthorized copies illegal.

Is it fair use though? Well, that’s the billion dollar question. I’ve said elsewhere in this thread that as a policy matter I don’t think it should be. But given existing precedent, I think you have to make an uncomfortable reach to get there.
 
Upvote
6 (8 / -2)
The problem with using this excuse is that it ultimately also precisely describes how many compression algorithms operate. Want to break copyright law? Just send me a jpg of it! The fair use exemption usually requires that some creative or transformative step has taken place. Except this isn't happening with LLMs - all they're doing is combining multiple bits of different existing copyrighted stuff, there's no "original" bit. Is copyright infringement still copyright infringement if it happens on an industrial scale? It's classic salami slicing. I'm reminded of the 1983 Superman III where they're only taking a fraction of a cent off everyones paycheck. Embezzelment isn't embezzelment in that case, right?
If I took 1,000 pictures and used photoshop to slice, dice, bend, blend, blur, shade, color, and so forth them to make "blade-runner in the style of da-vinci", nobody in a million years would say that I failed to "transform" things.

There's an argument to be made but "it's not original enough" is not it.
 
Upvote
0 (3 / -3)

GrimPloughman

Wise, Aged Ars Veteran
158
When you read a book, you don’t download a copy and store it in a database. That’s not how human learning works. It is how ML training operates though. And the thing about copyright is that it is very much intended to make the creation of unauthorized copies illegal.
When you read a book you indeed download a copy and store it in a database (it's called e-book) or store a paper copy of the book on your shelf.

When you have fed your ML you store the data in a weighted graph and when you have read a book you store the data in a neural network inside your brain.
 
Upvote
-4 (2 / -6)

Hmnhntr

Ars Scholae Palatinae
3,062
Tough.

.... or in a few more words: if your product depends on copyrighted material to make it useful then maybe you ought to be seeking permission from the copyright holders rather than just using it on a "deal with the consequences later" basis.
Yeah, not a complicated problem. My business can't operate without computers. Can I steal them from the local tech store?
 
Upvote
6 (7 / -1)