Ars OpenForum

lucubratory

Ars Scholae Palatinae
1,430
Subscriptor++
How is this different than any other time the court requires private data to be sent to them and safeguarded during a trial?

Scale and risk of harm, these are hundreds of millions of user logs. Imagine the NYT sues Google and alleges "Your AI search summaries are violations of our copyright, and as a result your AI outputs are being used to bypass our paywall by your users." When Google gives them access to the user search data that Google stores to check for themselves, the NYT doesn't find anything. I don't find this particularly troubling from a privacy perspective, because Google was already storing that data & it's a relatively minor increase in risk for the NYT's lawyers to be able to search it too. But now, having found nothing in the regular logs that Google already keeps, the NYT alleges "Users would know it's illegal to violate copyright by searching for paywalled NYT articles to read our AI-regurgitated articles from them, so they would have done those searches in Incognito mode to prevent the evidence of their using the system to bypass paywalls from showing up in discovery. You need to turn over to us all Incognito mode searches and associated AI summaries." Now Google is upset & that is a massive privacy risk, because Google wasn't storing that data. If they comply they now have to build a whole system for storing data that they promised their customers they wouldn't store (and quickly too, under court order), then provide that data to the NYT's lawyers. Storing that data at all creates a massive risk of a data breach, and it's likely to be particularly sensitive data because it's the things people are searching for in Incognito mode. That also means they have to tell everyone using incognito "Hey, you know how we said we don't keep data on what you search for? Sorry, we're under court order so we need to start storing data", which is going to inflict potentially permanent reputational damage on Incognito mode as a privacy-preserving tool. This is all to help NYT with their extremely speculative argument that users would both use these services to bypass the NYT paywall, and that users would be legally sophisticated enough to hide all infringing conduct in Incognito mode to prevent it being found in discovery in a future copyright trial, so the NYT must have access to everyone's incognito searches in order to see the data that would prove their theory correct. That is bonkers.

These (anonymized) chat logs aren't really any different than other evidence sent during trials (financial documents, internal company memos).

First of all, "anonymised" is fake. If the dataset does leak there is nothing that will stop those chat logs from being associated with specific people, privacy researchers have shown that de-anonymisation is possible in ~every realistic case in many papers going back decades. "Anonymisation" stops technically non-sophisticated NYT lawyers from seeing a name at the top of the chat, it won't actually protect individual user's privacy if the dataset leaks.

OpenAI is saying "we want our data (to hide evidence and train our product), but no, you can't see it. That's private to the justice system, no matter what crime we may commit with it."

No, again, this is explicitly data that OpenAI is not training on & cannot train on because they are deleting it regularly at the request of users. All of the data on user queries & responses that OpenAI does store, for training & other purposes, has already been "given" to the NYT and they didn't find anything that would support the insane idea that users are typing partial articles in ChatGPT in order to bypass the NYT paywall. The data NYT is demanding now & the court is granting access to is the logs from OpenAI's equivalent of Incognito mode, a "temporary chat" where they promise they won't train on your data & they'll delete it within 30 days.

I also doubt that OpenAI has "as Edelson suggested", "some of the most sensitive data on the planet". They have SSNs, personal health data, financial records?

"I don't know if I love my wife anymore", "My son is smoking meth & he's dropped out of school, I feel like I've failed as a parent", "My wife is undocumented and I'm worried about the way things are going, what can I do to protect her?", "My state has made it illegal for my transgender child to continue to receive medical care and I'm worried about being prosecuted for child abuse if I try to move somewhere more accepting. How the hell do I uproot my entire life without it looking suspicious or a neighbour snitching on me?", "I've been feeling suicidal. I can't let anyone know because of my job, but I need someone to talk to", "My husband always yelled at me but now he's started hitting me as well. I don't know what to do", "I have a crush on my yoga teacher and I feel guilty because I love my husband" and a million more like it.

Whether you like it or not, people are talking about these topics with ChatGPT, and if you think that's bad I would submit that the way to stop it is not to risk a massive privacy leak of everyone's sensitive chats, it's to try and make it so these people feel comfortable talking about these topics with humans IRL.
 
Upvote
-24 (11 / -35)