"Copyright today covers virtually every sort of human expression" and cannot be avoided.
See full article...
See full article...
Google's index is a new object. Not a trained Markov chain, mind you, but still creating data where none was before.google isn't creating new content based on what it crawls. Though it does do many illegal things like steal content from pages and show it in search results - meaning people don't click on the link. Or the entire news thing. If this means google gets punished and fined billions - well its about time.
The problem with using this excuse is that it ultimately also precisely describes how many compression algorithms operate. Want to break copyright law? Just send me a jpg of it! The fair use exemption usually requires that some creative or transformative step has taken place. Except this isn't happening with LLMs - all they're doing is combining multiple bits of different existing copyrighted stuff, there's no "original" bit. Is copyright infringement still copyright infringement if it happens on an industrial scale? It's classic salami slicing. I'm reminded of the 1983 Superman III where they're only taking a fraction of a cent off everyones paycheck. Embezzelment isn't embezzelment in that case, right?My understanding of how LLMs work is that they do not ultimately store the original text at all - it's effectively a probability ranking. Right now original text can be reproduced if the original text is unique enough or the prompt is specific enough, certainly, and I can imagine OpenAI being forced to implement measures to prevent GPT from reproducing original copyrighted texts verbatim. But reading a public website and building a probability model from its contents doesn't seem like it would be legally any different from Google reading public websites to build indexes and search snippets. As long as GPT's output isn't verbatim and is based on countless other sources as well, it'd be difficult-to-impossible to directly attribute it to any specific website.
Basically, based on my admittedly limited knowledge of precedent, I can see OpenAI being forced to prevent reproduction of copyrighted text but I'm not sure they can be stopped from using it for training at all. I'd love to be wrong on this, to be clear!
A problem? For whom?..Without knowing the legalities, I would say it is already quite a different situation in that with AI you can do this on a superhuman scale, you can copy the trained AI, modify and adapt it, forever... If Joe Blogs used a one-month subscription to read and learn the entire NYT archive, and then made a million NYT-educated clones of himself that he rents out for all sorts of jobs, then that might be a problem, too.
I mean, the obvious answer is "pay people to use their work and let them opt-in for it". This isn't a hard problem, it just has a solution they don't like.
My business model of having a fast food restaurant where grocery stores give me ingredients for free and I have workers I don't have to pay also doesn't work without me being able to get free access to things other people have to pay for, but you don't see me crying to the EU about it.
So basicly the argument is a big whataboutism?So google books goes too then, right?
Because that’s the precedent OpenAI is alluding to when they say what they’re doing is fair use. To be sure, they’re overselling the amount of precedence, but also, it’s a serious argument.
It is possible to distinguish generative AI from google books, but you have to slice the salami pretty thin to do so, lean hard on brand new precedent from the Warhol case, and probably start with more than a smidge of motivated reasoning.
This forum was pretty pro-google and anti-authors guild when that all got litigated and I remain curious as to whether the zeitgeist has shifted away from defending Google’s open information pitch, or if tech communities genuinely support one but not the other. And if anyone is pro-google books and anti-OpenAI, are you comfortable with the apparent contradiction? Is it just results based?
(FWIW, I think both what google did with books and what OpenAI is doing with their model ought to be infringement.)
There's also the evils that the people who are already rich are using it to screw over skilled workers with. Language translators are being told that instead of standard rates, they can "do a faster job" just cleaning up machine translated work at a reduced rate that usually requires more time to realize the autotranslation was junk, throw it out, and start from scratch while getting paid half as much or less. One of the plans of the movie/tv studios was AI generating scripts so that writers would be paid just for editing, not at full writers rates. (with a byproduct being that now nobody is apprenticing in the the writers room to showrunner pipeline)But...but...but....
That means their ability to generate profit disappears!
Because, like all "good" start-ups, they basically only work because of VC capital being pumped in to shore up the money bleeding out.
If they have to start paying for all their stuff now, like some kind of "real" company, it won't work!
Oh wait...
You're telling me that's how it's supposed to go?
You're telling me that companies that can't find a balance between money in and money out should die?
Oooooooooooooooh!
XD
This is what I think too. What are the possible outcomes?Minority opinion here. This push to have AI companies pay royalties to train will result in AI being owned by a handful of truly massive companies. There will be no small players. I think this is going to backfire.
So if I read a NYT article and learn something that is well within my rights (and the intention of the content). If I read a NYT article and copy it and claim it as my own that is plagiarism. You can use materials to inform and construct your own opinions but you need to make those words or content in your own voice, AI can't do that as it's literally designed to just analyze and replicate already existing content.I do not agree with most comments here.
There is a difference between training and distribution. Everyone reading NYTs public content is allowed to learn from that and profit from what you have learned. If you did that, did you steal their content?.....
It's even better, you don't get paid for the time spent training the replacement.AI and copyright is like when your employer asks you to train your replacement before they fire you.
That's simply not true, they don't just regurgitate - these frontier models can and do make creative connections between learned data. I regularly hand it unique problems that no one else has ever faced exactly and it generalizes and applies patterns it knows to help me solve my unique situation. I see the problem almost as the fact that it has photographic memory, so at times it's repeating back exact training material, rather then synthesizing it as expected.So if I read a NYT article and learn something that is well within my rights (and the intention of the content). If I read a NYT article and copy it and claim it as my own that is plagiarism. You can use materials to inform and construct your own opinions but you need to make those words or content in your own voice, AI can't do that as it's literally designed to just analyze and replicate already existing content.
Well, they're highly likely to rely on the education exemption in their actual court battles (specifically in the US). Their argument is, basically, that they're not training models on copyrighted material to reproduce that material but to learn the styles associated with that material to create new material with the styles of the day. They'll argue that's the equivalent of showing a movie to a classroom of paying students to teach them about the stylistic makeup of that movie so they can then go out and make more movies with that knowledge.Fair use generally covers excerpts not the entire copyrighted works, or has exceptions for non profit, educational, etc use cases. It doesn't cover a for profit company making use of works in their entirety.
Any decent programmer can write code leagues beyond the "AI".And that’s the real complaint against AI here. That the privileged creators might lose their jobs. Maybe they should learn to code![]()
AI and copyright is like when your employer asks you to train your replacement before they fire you.
So basicly the argument is a big whataboutism?
Well Google does it so what cant we?
I've got bad news for you, the only reason it's freely available to you right now is because they need you to get hooked in order to make it politically untenable to regulate it. Once the damage is done, they aren't going to share with you anymore and your model trained on an old dataset will get less and less relevant. This is a consolidation of power by the upper class, they aren't your friend.I don't see a way that we shut down LLM technology because of copyright concerns. This horse has left the barn - LLM capabilities are too valuable for folks in power to walk away from. Do you really think the US Govt is going to say - OK, fair enough, let's pack this thing up - while China powers on full speed ahead? This is strategically significant technology that is potentially only the beginning of an exponential curve. And now that the technology to do this is open source, and scraping of public web content is free use - do we really want to setup constraints so that the only people with the power of frontier LLMs are those with the power and money to do it in secret? Guess what - the NSA has all the training data they could ever want (https://nsa.gov1.info/utah-data-center/) - and I for one want to make sure that EVERYONE has access to the productivity increases made possible by generative AI, not just those with power and influence to do what they desire in secret.
How am I? Theft is pretty cut and dry. If I charge money to show a disney picture, disney lawyers won't give me a pass or quibble. They will take my house as payment and throw me in jail. Why is this different because its a tech company doing the stealing?You are grossly over simplifying the "stealing" part. We are talking about material that is made available to the public for free and that are being indexed by search engines, for example. What else are you allowed/not allowed to do with that content, that is the question here and the legality of AI training is what is disputed.
This "fraction of a cent" is more like taking a note from many different songs and putting together your own melody. The difference was a lot easier to understand when StyleGAN created human faces based on 70k photographs. If you generated 70k fake photographs you'd see that they weren't memorized but shared statistical properties, like if 32% of the real photos had blonde hair so did 32% of the generated photos. So many were old, so many had short hair, so many smiled and so on. You saw second order statistics like old people wear more glasses, women wear more make-up, men ofter go bald. That manifold of plausible faces is something transformative and way more than the 70k photos you started with. It's not a JPEG...The problem with using this excuse is that it ultimately also precisely describes how many compression algorithms operate. Want to break copyright law? Just send me a jpg of it! The fair use exemption usually requires that some creative or transformative step has taken place. Except this isn't happening with LLMs - all they're doing is combining multiple bits of different existing copyrighted stuff, there's no "original" bit. Is copyright infringement still copyright infringement if it happens on an industrial scale? It's classic salami slicing. I'm reminded of the 1983 Superman III where they're only taking a fraction of a cent off everyones paycheck. Embezzelment isn't embezzelment in that case, right?
Also, the way they generated the fraction of a cent was by taking OpenAI's revenue and dividing it by content in the dataset. That presupposes that OpenAI should continue existing as a company and insulates it from the fact that it's business model is just untenable.This "fraction of a cent" is more like taking a note from many different songs and putting together your own melody. The difference was a lot easier to understand when StyleGAN created human faces based on 70k photographs. If you generated 70k fake photographs you'd see that they weren't memorized but shared statistical properties, like if 32% of the real photos had blonde hair so did 32% of the generated photos. So many were old, so many had short hair, so many smiled and so on. You saw second order statistics like old people wear more glasses, women wear more make-up, men ofter go bald. That manifold of plausible faces is something transformative and way more than the 70k photos you started with. It's not a JPEG...
It is simple software that runs on commercially available hardware, with numerous competitors, open source models, and DIY models. Not sure how it could cost a lot of money.I've got bad news for you, the only reason it's freely available to you right now is because they need you to get hooked in order to make it politically untenable to regulate it. Once the damage is done, they aren't going to share with you anymore and your model trained on an old dataset will get less and less relevant. This is a consolidation of power by the upper class, they aren't your friend.
Well, those open source models are certainly going to go away and you don't have access to the level of content to DIY your own and have it continue to be useful.It is simple software that runs on commercially available hardware, with numerous competitors, open source models, and DIY models. Not sure how it could ever cost a lot of money.
Llama is free to own, in the same way that the game Guardian of the Galaxy's is free to own on Epic right now.Well, those open source models are certainly going to go away and you don't have access to the level of content to DIY your own and have it continue to be useful.
Leaving aside who should win or lose here, I have a fundamental problem with using copyright law in the way the NYT is using it. Training an AI on copyrighted works is no different (other than scale) with a person learning anything through reading copyrighted works. Copyright protects the expression of an idea, not the idea (or facts) itself. So if I train an AI about the idea of a tree by showing it thousands of copyrighted images of a tree, if the AI then draws a tree, it hasn't violated copyright law -- unless it simply takes one of those thousands of images and spits out a copy of it. Similarly, if I read 10 news stories about the Alaska Air door incident, I could then write my own news account, drawing from the facts of those stories -- even using quotes from eye witnesses -- without violating the copyright in those news stories.
Obviously, if the AI is simply spitting out copies of things it has "read", that would be a copyright violation, but that isn't what the NYT alleges here (though they do say that the AI stories are "virtually identical", but the issue then becomes how "virtual" are the AI-stories and that is paired with how thin a copyright there is in a news story).
As to the idea that DALL-E is copying an artist's "style" in the images it generates, again, this isn't a copyright issue. I can paint "in the style of" any number of artists without violating their copyright -- and assuming I'm not simply painting a slavish copy of one of their pieces. Where this would become a problem is if I were attempting to pass off my paintings as those of the original artist. However, that isn't a copyright issue either, it is a trademark issue (and fraud).
My concern here isn't who wins or loses, its about the scope of copyright law and how it is being stretched to cover areas that it really shouldn't. Remember, copyright lasts for the life of the author plus 70 years. That is a long time to allow someone (or some corporation) to monopolize content. If we over-extend copyright we will stifle future creativity. The NYT and others have routes that will allow them to seek appropriate compensation from OpenAI without over-extending copyright law.
If I took 1,000 pictures and used photoshop to slice, dice, bend, blend, blur, shade, color, and so forth them to make "blade-runner in the style of da-vinci", nobody in a million years would say that I failed to "transform" things.The problem with using this excuse is that it ultimately also precisely describes how many compression algorithms operate. Want to break copyright law? Just send me a jpg of it! The fair use exemption usually requires that some creative or transformative step has taken place. Except this isn't happening with LLMs - all they're doing is combining multiple bits of different existing copyrighted stuff, there's no "original" bit. Is copyright infringement still copyright infringement if it happens on an industrial scale? It's classic salami slicing. I'm reminded of the 1983 Superman III where they're only taking a fraction of a cent off everyones paycheck. Embezzelment isn't embezzelment in that case, right?
When you read a book you indeed download a copy and store it in a database (it's called e-book) or store a paper copy of the book on your shelf.When you read a book, you don’t download a copy and store it in a database. That’s not how human learning works. It is how ML training operates though. And the thing about copyright is that it is very much intended to make the creation of unauthorized copies illegal.
Yeah, not a complicated problem. My business can't operate without computers. Can I steal them from the local tech store?Tough.
.... or in a few more words: if your product depends on copyrighted material to make it useful then maybe you ought to be seeking permission from the copyright holders rather than just using it on a "deal with the consequences later" basis.