OpenAI says it’s “impossible” to create useful AI models without copyrighted material

Status
You're currently viewing only deadman12-4's posts. Click here to go back to viewing the entire thread.
Post content hidden for low score. Show…
This runs completely counter to the "move fast and break things" philosophy they all believe in.

The whole point is to "disrupt" social norms and/or the regulatory structure to profit before regulations can catch up with them. LLMs may have moved too slowly though - creator backlash has been quick and LLMs are still a novel toy rather than a necessary tool for life. It would be trivially easy to tell these companies "So what? Pay them." and move on. They haven't actually disrupted enough to make it painful for us to do that yet (mostly because LLMs look impressive but actually have fewer use cases than people think when you start trying to use them).
Them waxing on about "disrupting" is just libertarian code. Their real philosphy is make money now consequences be damned.
Besides, consequences are usually for people who don't have money. Especially in America. For example, the CEO of microsoft will never see an single personal penalty for this no matter what happens.
 
Upvote
128 (136 / -8)
My fear is that the way ChatGPT crawls websites and pulls data from them will be seen as no technically different from, say, Google indexing, which has always been opt-out and indexed basically everything publicly available on a website. Ethically I agree it's a completely different matter given how ChatGPT is actually using the material, but on a legal level I'm worried it'll be hard to stop them.
google isn't creating new content based on what it crawls. Though it does do many illegal things like steal content from pages and show it in search results - meaning people don't click on the link. Or the entire news thing. If this means google gets punished and fined billions - well its about time.
 
Upvote
20 (31 / -11)
NYT owner has a market cap of around 7 billion USD. MSFT should just buy them with some of the loose change in couch cushions in Redmond. Heck, MSFT could buy out every newspaper and book publisher and not break a sweat.
Why bother? Google and facebook have demonstrated stealing every news article while not paying for them works.
 
Upvote
-3 (14 / -17)
The genie is already out of the bottle, it’s too late to stop it, in five years these arguments will all look like Metallica complaining about Napster
Do you also think nothing can be done about gun violence? Guns exist, the genie is out of the bottle. Too bad there is "nothing" that can be done to prevent gun violence.
 
Upvote
38 (50 / -12)
Using it for training is fair use, sure, but when it regurgitates that training data, that's infringement.
Can I use your car to drive to work then give it back when I'm done? Same idea - I'm taking something of value from you.
The idea is that the artists are willing to SELL their works. Copyright exists to create a marketplace of work and ideas. AI training on these works means they are stealing them and unable to sell them.

Then it will turn around and spit out the work for free - though now it'll be in a "legally distinct but still identicle" form. Even if it doesn't, They still took your work and are profiting from it without compensating you.
You're not using all that money in your bank, I'll take some and if you notice and sue, I'll give it back. However I'll keep all the profit I made from using your initial "investment".
Same thing again.
 
Last edited:
Upvote
15 (35 / -20)
ChatGPT training corpus size order of 10^12 words. OpenAI annual revenue order of 10^9 dollars. That's a tenth of a cent per word for your million words ~$1k. This, of course, neglects any kind of operating costs, which I've seen estimated at 10^7-10^8 dollars/year just to support the infra for chatgpt.

So, even if they do end up paying you, it's not going to end up being some huge windfall. In fact, all the big players (NYT, etc) are gonna take all the biggest slices of cake, and you'll get to have the crumbs. Possibly it will be enough to buy you and a date a nice dinner. Enjoy.
NYT doesn't want to get paid. Their terms are anything trained on their data should be deleted. Its not for sale.
This is also their right to demand. Simply that AI that has stolen thier data must be deleted because you cannot extract their copywritten work from it. I kinda hope they succeed. Thats the ONLY way companies will learn. OpenAI literally gets deleted and they have to start over.
 
Upvote
26 (32 / -6)
IP law does not grant you any protections as a human consumer of content. It grants a monopoly to copyright holders to create incentives for creating content. One could definitely argue that blocking AI training is contrary to the goal of incentivizing content creation.
Are you getting paid to post this bullshit?
 
Upvote
14 (20 / -6)
Licensing training data for input is a non-starter. If that is required it is the end of LLMs since there will be no accessible, broad training data. Just consider training off from the Internet -- 133 million active domains. That's 133 million licenses you would need to acquire. That's just never going to happen.

Sure you could declare some statutory royalty for training but places like the NYT will fight that forever. And statutory royalties worked out so, so well in the music industry where the labels run off with 90% of it. For sure the same thing would happen with training royalties and a few firms will run off with the lion's share of the money collected.
You are literally saying that since its hard/impossible without breaking the law, then its ok. The ends justify the means eh?
 
Upvote
26 (32 / -6)
I initially had the same gut response as many others here, but, the same logic applies to humans.
We grow up exposed to all sorts of copyrighted content and our culture and taste changes over the course of our lives as a result of this content.
The content any creative human generates is undoubtedly influenced by this exposure to copyrighted content and we acknowledge this is our (extremely generous IMHO) copyright laws.
Why is the content generated by a computer exposed to the same material (albeit likely a vastly larger subset) any different?
Cause one of them is a human. The other is a business. Why do people pretend they are the same.
Humans eat food. Dogs eat food. Yet dogs don't have the legal rights of humans. This is the same. Just because both AI and a human can "consume" a copywritten work doesn't mean they are the same thing.

But lets go deeper, because "consuming the work" its totally NOT the same thing. One is reading it, the other is copying and using it to generate profit. Copyright law EXPRESSLY states that you cannot use a copywritten work to generate profit without permission. However thats hard if you gotta pay for it. Hence this litany of shadenfreud in the comments.
 
Upvote
17 (28 / -11)
I consider using it for training to be fair use. Reproducing large portions verbatim should be licensed or eliminated from the models.
So if I steal textbooks and learn from them to pass the class thats ok too? If I take the source code for some software you sell and tweak it eough to be legally distinct, thats ok too? If I read your diary and publish a book about a character's life identical to yours (But with an "e" on the end of the name) thats ok with you?

Thats what open AI does. In the last case, the diary represents a paywall. Because they often don't pay for subscriptions, but hack their way around paywalls.
 
Last edited:
Upvote
5 (17 / -12)
It is not domain holders that you need licenses from. It is all content creators. Anyone that produces content (like this text), has copyright on that content. There is just no way to even find someone to request a license from for most content. Say you scrape Facebook content. That's like 2 billion anonymous users. How do you license that? How how would you even know if the posters actually own the content? It is not doable, I'd say, and the problem is much, much larger than finding 133 million domain holders.
Just cause its hard doesn't mean the laws can be ignored. That just mean the business model is a failure. Hence the comments about organized crime.

I never said that they could. However, obviously, they can't give creators more money than they have. So, creators can either accept the pittance, or take their ball and go home. My point is that if you're a creator, and you're hoping to actually make reasonable money on such a deal, well, that pool of money doesn't look nearly as big when you consider how many swimmers there are.
I prefer the terms of the NY Yimes lawsuit. Delete all AI trained on sets that stole thier data. Open AI deserves NOTHING. They have no bargaining position. Why should penalties be limited based upon their current income?
Make the investors pay? Why aren't they liable? Bankrupt microsoft to pay for the laws they knew they were breaking.

Oh wait, now I'm talking about consequences to our billionaire overlords. I've gone to far.
 
Upvote
17 (22 / -5)
You know, anyone can read a book and write an analysis of it. They can even publish the analysis or sell access to it in its own right, without paying the author of the book they analyzed. That’s fair use, right?

And later, if someone uses that analysis to produce a new product or service, the author of the book that was analyzed doesn’t get paid, right?

Training a model is like analyzing many books in aggregate. The resulting analysis does not contain the text of the original books. It’s definitely fair use to produce a private analysis of copyrighted works.

Now imagine you are OpenAI. You have this big analysis you’ve done, and you can take user prompts and use them along with it to generate text. You aren’t selling the analysis (model) you built. You aren’t selling anything your users produce with the model. You sell access to the tool.

If the tool generates infringing content, it does so in response to user input. In fact, the tool’s output is partially derived from the user’s input. OpenAI doesn’t put infringing content out there in public.

The examples of being able to craft prompts in order to extract a facsimile of training data? That’s being engineered out more and more each day.

So it seems to me that under fair use, these companies are permitted to train on whatever they like, and people whose content is used in training have no way of getting paid.

And liability for generated infringing content will likely rest with users of these tools.
Once again, AI is not human and these are in no way alike, which you know.
As well, this one ai can then be sold as millions/billions of copies. Which magnifies the problem.

One more time. A computer is not a person. Everyone knows this. We just have alot of bullshitters (or imbiciles) who like to pretend there is no difference. Ends justify the means. That and some of these posters must stand to profit from this massive theft.
 
Upvote
9 (20 / -11)
That is not how OpenAI works. Each stream from Spotify is at least one copy (probably several, if you count all the caches and stuff that you don't see).

You can't attribute the output from Chat GPT to any single source.

Open AI is using publicly available material to train a model in order to generate completely NEW content.

What NYT is arguing, is that using the publicly available content for AI training, say instead of indexing for search, or learning from or commenting on, should be illegal.

We're in new territory here and I do not see this as open and shut, like. most people here seem to do.
Its been exhaustively proven that they are not using publically available material. Its been proven they've stolen millions of NON-PUBICALLY available stories and images. And thats only the tip of the iceberg. Thats just the very small 0.0001% of cases which had enough money/skill to prove it.

We all know how training sets work. It doesn't have to be proven ad naseum. The story is about the head of open ai admiting they are stealing everything. Not sure what fantasy you are living in. The only thing I can imagine is that you have financial motivation to muddy the waters and trick people.
 
Upvote
15 (24 / -9)
What exactly is the problem that you see magnified here? It is not the copying as there are no millions of copies distributed. My feeling is that you see a problem with AI more than the copying. That people can be replaced by machines.
I too see issues with this and I do think new legal and social frameworks might be required to handle this, but I also see great potential for all of humanity. I only stand to profit in the sense that I do believe that the AI tools can be used for the greater good. You seem to have a much darker view of this.
AI steals the works and learns from them. Then millions of copies of the AI are sold. Since the ai stole that and then is sold, thats millions of times the works are being sold without permission.
Also, I'm not sure about you trying to insinuate I have other "issues about AI". Everytime AI profits from a theft, thats basically another theft.
If you steal a billion copywritten works, and then make a million sales with your AI? Thats a quadrillion copyright infringments. It gets magnified and out of control instantly. What about not just selling the AIs, but all the stuff each copy does that generates income on their own?

This entire framework is built on theft and abuse. I'm reminding of someone mentioning colonization. Its taking from those who cannot defend themselves. However you say there "are issues" but then think that potential unnamed future things mean we must allow AI to do what it wants. And by "let ai do what it wants" you really mean the rich tech companies do what they want. If people get paid a few pennys, thats ok too, but not required in your world view.

I like to think we're a country of laws and that laws mean something. If you cannot pay full market price for EVERY SINGLE THING stolen, delete the AI and throw the execs in jail. Cause thats EXACTLY what happens to individuals who steal copywritten stuff from large companies like MS and Google that you are so eager to defend. But lets take a step further.

The laws were already broken. Full market price isn't good enough. There are penalties now that they broke the law. Pay full market price and additional penalties to every rights holder, and delete all training sets and all ai, and execs go to jail. Investors are on the hook for any money that the company cannot pay. Lets treat them like normal people are treated.

I'm fine iwth AI if its done legally. Let them start over and do it right.
 
Last edited:
Upvote
-6 (10 / -16)
You are grossly over simplifying the "stealing" part. We are talking about material that is made available to the public for free and that are being indexed by search engines, for example. What else are you allowed/not allowed to do with that content, that is the question here and the legality of AI training is what is disputed.
How am I? Theft is pretty cut and dry. If I charge money to show a disney picture, disney lawyers won't give me a pass or quibble. They will take my house as payment and throw me in jail. Why is this different because its a tech company doing the stealing?
 
Upvote
7 (7 / 0)
Status
You're currently viewing only deadman12-4's posts. Click here to go back to viewing the entire thread.