NYT to start searching deleted ChatGPT logs after beating OpenAI in court

Lunakki

Smack-Fu Master, in training
97
Subscriptor
The coverage of this case on Ars has done a lot of water carrying for OpenAI and a poor job of explaining the routine legal context and explanations of litigation holds.

Please interview a dispassionate civil lawyer instead of repeating OpenAIs PR spin.
Agreed. I was initially taken in by OpenAI's representation of what's going on, because it does seem reasonable if you're unfamiliar with how confidential data is dealt with in the courts. This article even implies that lawyers have terrible security and leak sensitive data all the time. I'm sure it does happen sometimes, but surely it's the sort of thing lawyers would lose their license over, right? It's weird this case is being covered the way it is here.
 
Upvote
25 (29 / -4)

Mordac

Ars Tribunus Militum
2,787
Subscriptor++
Society has a vested interest in knowing how badly AI sucks and how dangerous it is. Having the companies who control it be the only ones who see this data at scale is a massive net negative for us. I'm glad to hear that there's going to be third parties with anonymized data that can be analyzed.
I honeslty don't think much of society gives the tiniest of shits how good/bad/otherwise AI is and couldn't possibly care less even if you tried to explain it to them. It's a tool that makes their lives easier and they're accustomed to 2 decades of just clicking 'I agree' when installing apps and having ads mainlined into their eyes by search. This is just more of the same to play with the latest thing.
 
Upvote
14 (15 / -1)

Megahedron

Smack-Fu Master, in training
91
A cynic might wonder whose payroll Mr Edelson is on. One also might wonder at what kind of an idiot is required to voluntarily upload plenty of sensitive personal data into a giant data-hoovering company.
Glad to see I'm not the only one side-eyeing how literally 60% of TFA was dedicated to the personal opinion of a non-related lawyer that uses an AI-generated avatar on their Xitter account (where they post about their experiences using ChatGPT).
 
Upvote
30 (32 / -2)

graylshaped

Ars Legatus Legionis
67,692
Subscriptor++
I don't think it's in this article specifically, but the point of plaintiffs getting this dataset is so that they can look and see whether users are prompting the LLM to specifically output copyrighted material, which they will then use as an argument in the case (that OpenAI is enabling this infringement and that it's an inherent part of GPT's feature set, which I'm inclined to agree with). So they're trying to obtain a massive volume of data so that they can go fishing for evidence of something that users may be doing.
What are you smoking? They aren't fishing when that is exactly what their lawsuit alleges.

You seem to really be irritated that the judge ruled very quickly and decisively your entire argument is irrelevant FUD.
 
Upvote
15 (21 / -6)

graylshaped

Ars Legatus Legionis
67,692
Subscriptor++
Agreed. I was initially taken in by OpenAI's representation of what's going on, because it does seem reasonable if you're unfamiliar with how confidential data is dealt with in the courts. This article even implies that lawyers have terrible security and leak sensitive data all the time. I'm sure it does happen sometimes, but surely it's the sort of thing lawyers would lose their license over, right? It's weird this case is being covered the way it is here.
Ars also reported the judge's take on OpenAI's malarkey.

Most Ars readers are smart enough to read between the lines.
 
Upvote
2 (9 / -7)
Post content hidden for low score. Show…

trekker473

Wise, Aged Ars Veteran
162
But see, if you write a book report and spit out passages of that book or another book verbatim in that report and represent it as your own material, then you get in trouble for plagiarism. OpenAI has the ability to spit out large sections of copyrighted material and representing it as its own material. That's the difference. That's why the book report analogy is flawed.
Yes, and in those cases where Open AI spits out passages that go beyond what is commonly agreed upon to be fair use (like the way bits of articles are referenced in any number of online venues including this site), they should be liable for plagiarism or even copyright infringement.
This still has nothing to do with the legality of using the legally obtained data for training.
 
Upvote
5 (9 / -4)

trekker473

Wise, Aged Ars Veteran
162
Also the book report is not a commercial product. If the child sells plagiarized book reports on the street corner, it's a copyright violation.
That is simply untrue. You can sell a book report so long as it does not copy verbatim large sections of the actual book. Otherwise book reviews, movie reviews, music reviews, and art reviews would all be in violation of copyright.
I really don't understand how we got to the ridiculous world where people are so eager to support large corporation's government granted monopolies on the very foundations of our culture and shared knowledge.
 
Upvote
-13 (7 / -20)

graylshaped

Ars Legatus Legionis
67,692
Subscriptor++
I really don't understand how we got to the ridiculous world where people are so eager to support large corporation's government granted monopolies on the very foundations of our culture and shared knowledge.
I really don't understand how we got to the ridiculous world where people think reductio ad absurdum is a good play here. The gulf between Fair Use and some mythical government granted monopoly on knowledge is substantial.
 
Upvote
14 (18 / -4)

SubWoofer2

Ars Tribunus Militum
2,550
Thanks Ars. I needed a laugh over my morning coffee and Mr Edelson provided it.

He further warned that ChatGPT users turning to OpenAI rival services like Anthropic's Claude or Google's Gemini could suggest that Wang's order is improperly influencing market forces, which also seems "crazy."
Judges' decisions and actions affect markets all the fucking time.

It doesn't require a verdict. Even a query to look at something, will naturally result in capital taking flight. It's called risk management and people with real money devote real effort into it.

Perhaps a qualified lawyer would have noticed this process and the careers, businesses, and share markets built on it.

I don't know who he is working for, but my goodness they are getting their money's worth.
 
Upvote
34 (34 / 0)

gerbal

Wise, Aged Ars Veteran
197
Subscriptor++
That is simply untrue. You can sell a book report so long as it does not copy verbatim large sections of the actual book. Otherwise book reviews, movie reviews, music reviews, and art reviews would all be in violation of copyright.
I really don't understand how we got to the ridiculous world where people are so eager to support large corporation's government granted monopolies on the very foundations of our culture and shared knowledge.
In this analogy, the book report contains verbatim substantial sections outside the scope of fair use. You know, plagiarism.
 
Upvote
17 (19 / -2)
I get a strong "pot calls kettle black" vibe from this, considering how most AI companies appear to be digesting so much data without asking permission.
They're all doing it in some way. If they're not using user data for training future models, they're using it for evals to improve system prompts and to check quantized model performance.

The only way out is to run open weight LLMs on your own hardware with a vetted inference stack, but there are few people out there who know how. It's also a lot slower.
 
Upvote
4 (4 / 0)
This article seems majorly slanted towards OpenAI despite the fact that we know they’ve stolen tons of data for training, are accused of stealing current revenue-generating content, and are definitely NOT (in the long run) motivated by altruism or likely to protect user privacy at the expense of potential profits.
 
Upvote
11 (16 / -5)

justsomebytes

Wise, Aged Ars Veteran
178
Subscriptor
Getting every single ChatGPT log just to try and find something that's reproducing NYT content, given the likely numbers of chats that have nothing to do with NYT content, seems like an incredible fishing expedition?
Who said the NYT is getting every single chat log?
 
Upvote
17 (19 / -2)

cbfvn

Smack-Fu Master, in training
23
I'll be interested to see what the NYT actually finds. One of my clients jumped to Google's AI offerings due to the fear of retained chat logs being leaked. That's why I don't totally buy the "they're scared because they know" argument, though it may be true, since I've seen, first hand, companies balk over this sudden change.

And props to the author for being pretty level on this issue. A lot of people want to read the cynicism from editorial staff, which I get, but it takes a lot of restraint to just report the situation as it is.
 
Upvote
-10 (4 / -14)

DonutBreak

Smack-Fu Master, in training
3
How is this different than any other time the court requires private data to be sent to them and safeguarded during a trial? These (anonymized) chat logs aren't really any different than other evidence sent during trials (financial documents, internal company memos). Discovery and evidence retention during that and the trial is a key part of the American judicial system.

OpenAI is saying "we want our data (to hide evidence and train our product), but no, you can't see it. That's private to the justice system, no matter what crime we may commit with it."

I also doubt that OpenAI has "as Edelson suggested", "some of the most sensitive data on the planet". They have SSNs, personal health data, financial records? If so, then holy fuck Open AI is more despicable than I knew.

Oh yes - and don't forget, OpenAI has a commercial version of this service they sell to enterprises (ChatGPT Enterprise). Certainly some of these businesses are using data with a variety of sensitivities/classifications... They also offer a government-based service now too. I wonder if this ruling means that NYT can access that data trove as well?
 
Upvote
-9 (4 / -13)
Post content hidden for low score. Show…
In this analogy, the book report contains verbatim substantial sections outside the scope of fair use. You know, plagiarism.
Plagiarism isn't the same thing as violating copyright.

Copying a protected thing without permission or a fair use exception is a copyright violation
Plagiarism is passing off someone else's work as your own (ie without credit). You can plagiarise public domain works, and you can plagiarise the hell out of an answer someone has given you for your homework or coursework
 
Upvote
9 (9 / 0)

SeanJW

Ars Legatus Legionis
11,769
Subscriptor++
I'd argue that any data you've willingly handed over to a third party isn't really private anymore.

You don't have to argue it. That's already been done and the Supreme Court decided third party doctrine already. It's not your data any more, though they may have a duty of care depending on other legislation (HIPA for example)
 
Upvote
16 (19 / -3)

SeanJW

Ars Legatus Legionis
11,769
Subscriptor++
Oh yes - and don't forget, OpenAI has a commercial version of this service they sell to enterprises (ChatGPT Enterprise). Certainly some of these businesses are using data with a variety of sensitivities/classifications... They also offer a government-based service now too. I wonder if this ruling means that NYT can access that data trove as well?

You can wonder, or you can read the fine article which explicitly says its excluded. Your choice of course.
 
Upvote
15 (17 / -2)

markgo

Ars Praefectus
3,776
Subscriptor++
But "lawyers have notoriously been pretty bad about securing data," Edelson suggested, so "the idea that you've got a bunch of lawyers who are going to be doing whatever they are" with "some of the most sensitive data on the planet" and "they're the ones protecting it against hackers should make everyone uneasy."

This quote caused me to lose all respect for any views Edelson expresses.

ChatGPT chats are “some of the most sensitive data on the planet”? What planet? Compared to medical records? Sealed court files? (also handled by lawyers, note).

And as far as breaches go, tech firms have had a lot more than the New York Times.
 
Upvote
18 (23 / -5)
Which is the whole point to OpenAI screaming and whining and throwing a fit about this order.

They know damn well that huge chunks of verbatim copyrighted material was vomited up in user's faces. They know damn well they violated all kinds of copyright laws scraping that shit without permission or payment. They also know damn well that they're not making a fucking dime on it, because unlike traditional business which can scale up to become more efficient, AI doesn't scale AT ALL. You have to add exactly the same capacity, and costs, for every user. So the more users they proportionally get, the more money they have to proportionally spend.

Paying for copyright access is just another cost, and they don't have the funds to spare to lay that out, or add to their costs per user when their database is accessed by each one.

I don't make any predictions about when AI will become a smoking crater, mostly because there seems to be an infinite supply of idiots who keep trying it out, and even throwing more money at it. but it's not sustainable as a business as it's currently implemented. And I'd guess sooner or later, after lots and lots more good money is thrown after bad, they'll figure out that the way they're doing it now is not a viable way to do business.

I don't know exactly how it could become viable, given the way it works, but that's well hidden behind a SEP field, buried in a IDGAS hole.

I'm surprised you think that AI doesn't scale well. The costs of creating a model are huge and (after creation) fixed. Running a single query on that model, which gives the marginal cost, is much lower. This blog post assumes that it costs 3Wh to run a single query, but warns that the value is likely an overestimate. Even so, 3Wh is only 10x the energy cost of a google search query, and while that is a lot more expensive, I don't think people generally think of google search as wasteful of energy, or of scaling badly.

I mean, sure, that's a worse scaling than say factory production of a physical item, where if you know you'll produce a lot of something you can spend more on tooling to get a lower marginal cost of production, but it's the same scaling relationship as any software - linear after creation - even if the marginal cost is a lot higher for querying an LLM than say the cost of uploading a copy of your game to your client via steam.

It's also not that they don't have money to pay copyright holders for using their works.

Firstly, they don't want to pay them if they don't need to, so they'll of course do their best not to.

Second, the LLM isn't hitting up a database of works in a query, the answer is generated out of the weights in the LLM, and there aren't a different set of parameters corresponding to each copyright work it ingested in training. You also can't instruct an LLM not to break copyright because the LLM doesn't have a copy of what was in the training data either. It's therefore not possible to tell if a response has infringed copyright except by running a very slow fuzzy search against all the training data - which obliterates the usefulness of the LLM as you now need to both have a copy of all the training data and access it all every time you run a query (yes you can do clever things with indices but still there's a lot of text inputs). It's then even harder to determine if that use was fair use or not and thus you don't need to pay the copyright holder anyway.

Thirdly, they would have to have a licensing agreement with every copyright holder! You don't need to do this to have a public library where people can look up facts, assuming you acquired the books legally, which is good because it would be impossible to actually achieve! LLM creators are hoping to have the same reliance on transformative use hold for them so that they don't need to license with all the copyright holders of their training data. Whether the use of the data is deemed fair use in the end is still an open question however.
 
Upvote
-11 (8 / -19)

GrimR3

Smack-Fu Master, in training
96
Subscriptor++
How is this different than any other time the court requires private data to be sent to them and safeguarded during a trial? These (anonymized) chat logs aren't really any different than other evidence sent during trials (financial documents, internal company memos). Discovery and evidence retention during that and the trial is a key part of the American judicial system.

OpenAI is saying "we want our data (to hide evidence and train our product), but no, you can't see it. That's private to the justice system, no matter what crime we may commit with it."

I also doubt that OpenAI has "as Edelson suggested", "some of the most sensitive data on the planet". They have SSNs, personal health data, financial records? If so, then holy fuck Open AI is more despicable than I knew.
You may have never used Chatgtp but I imagine that they have tax records, bank records, medical records,SSNs and more in the chats that the court ordered preserved. Users can upload PDFs, zip files, and images and then ask questions about the uploaded data. Even if the uploaded data is not retained by this order (unclear to me if it is) any responses from the AI answering questions about the data or analysis will be retained in this preservation order.
 
Upvote
-4 (3 / -7)
Post content hidden for low score. Show…
Post content hidden for low score. Show…
Post content hidden for low score. Show…

bigjoec

Wise, Aged Ars Veteran
199
There's an easy way for OpenAI to protect its users' days if it wants to -- fall on its sword. All it needs to do is stipulate to the facts that NYT has alleged that are motivating this discovery (which already to be around the prevalence of users pulling up NYT content), then the judge couldn't/wouldn't allow NYT to have the data. That's always and forever how you kneecap something you fear is a fishing expedition.

NYT's allegations of what's been happening appear well-founded, based on the judge granting the discovery. That OpenAI won't take the hit on that fact in order to protect its users' data is OpenAI's choice, not the NYT's. OpenAI is trying to blame the NYT, but it's the one with the relationship with users, it's the one that collected the data; if anyone bears any responsibility for what's going on here, it's OpenAI for failing to make its users sufficiently aware of this risk before they gave it their data.

The NYT needs the opportunity to pursue its rights to protect itself. OpenAI pretending the risks posed to its users here aren't its own fault is pure theater.
 
Upvote
13 (14 / -1)
You may have never used Chatgtp but I imagine that they have tax records, bank records, medical records,SSNs and more in the chats that the court ordered preserved. Users can upload PDFs, zip files, and images and then ask questions about the uploaded data. Even if the uploaded data is not retained by this order (unclear to me if it is) any responses from the AI answering questions about the data or analysis will be retained in this preservation order.
for sure OpenAI will just provide them dirty data probably anonymized while keeping a very processed personal data of its users to monetized them right?
 
Upvote
-5 (0 / -5)
Thirdly, they would have to have a licensing agreement with every copyright holder! You don't need to do this to have a public library where people can look up facts, assuming you acquired the books legally, which is good because it would be impossible to actually achieve! LLM creators are hoping to have the same reliance on transformative use hold for them so that they don't need to license with all the copyright holders of their training data. Whether the use of the data is deemed fair use in the end is still an open question however.
I suggest you go and talk to an actual librarian. Trust me, they’ll be delighted to tell you all about publisher licensing agreements. I hope you have several hours to spare.
 
Upvote
24 (24 / 0)