Agreed. I was initially taken in by OpenAI's representation of what's going on, because it does seem reasonable if you're unfamiliar with how confidential data is dealt with in the courts. This article even implies that lawyers have terrible security and leak sensitive data all the time. I'm sure it does happen sometimes, but surely it's the sort of thing lawyers would lose their license over, right? It's weird this case is being covered the way it is here.The coverage of this case on Ars has done a lot of water carrying for OpenAI and a poor job of explaining the routine legal context and explanations of litigation holds.
Please interview a dispassionate civil lawyer instead of repeating OpenAIs PR spin.
I honeslty don't think much of society gives the tiniest of shits how good/bad/otherwise AI is and couldn't possibly care less even if you tried to explain it to them. It's a tool that makes their lives easier and they're accustomed to 2 decades of just clicking 'I agree' when installing apps and having ads mainlined into their eyes by search. This is just more of the same to play with the latest thing.Society has a vested interest in knowing how badly AI sucks and how dangerous it is. Having the companies who control it be the only ones who see this data at scale is a massive net negative for us. I'm glad to hear that there's going to be third parties with anonymized data that can be analyzed.
Glad to see I'm not the only one side-eyeing how literally 60% of TFA was dedicated to the personal opinion of a non-related lawyer that uses an AI-generated avatar on their Xitter account (where they post about their experiences using ChatGPT).A cynic might wonder whose payroll Mr Edelson is on. One also might wonder at what kind of an idiot is required to voluntarily upload plenty of sensitive personal data into a giant data-hoovering company.
I'm nominating this one for Fatesrider Post Hall of Fame.I don't know exactly how it could become viable, given the way it works, but that's well hidden behind a SEP field, buried in a IDGAS hole.
I'd argue that any data you've willingly handed over to a third party isn't really private anymore.this is certainly a complicated mess from a privacy standpoint
What are you smoking? They aren't fishing when that is exactly what their lawsuit alleges.I don't think it's in this article specifically, but the point of plaintiffs getting this dataset is so that they can look and see whether users are prompting the LLM to specifically output copyrighted material, which they will then use as an argument in the case (that OpenAI is enabling this infringement and that it's an inherent part of GPT's feature set, which I'm inclined to agree with). So they're trying to obtain a massive volume of data so that they can go fishing for evidence of something that users may be doing.
Really? I literally just watched a video earlier today where a guy claimed he uses Grok for all his searches because it's up to date on current events.I just.. really don't think people are using AI to get around NYT paywall.
Ars also reported the judge's take on OpenAI's malarkey.Agreed. I was initially taken in by OpenAI's representation of what's going on, because it does seem reasonable if you're unfamiliar with how confidential data is dealt with in the courts. This article even implies that lawyers have terrible security and leak sensitive data all the time. I'm sure it does happen sometimes, but surely it's the sort of thing lawyers would lose their license over, right? It's weird this case is being covered the way it is here.
Yes, and in those cases where Open AI spits out passages that go beyond what is commonly agreed upon to be fair use (like the way bits of articles are referenced in any number of online venues including this site), they should be liable for plagiarism or even copyright infringement.But see, if you write a book report and spit out passages of that book or another book verbatim in that report and represent it as your own material, then you get in trouble for plagiarism. OpenAI has the ability to spit out large sections of copyrighted material and representing it as its own material. That's the difference. That's why the book report analogy is flawed.
That is simply untrue. You can sell a book report so long as it does not copy verbatim large sections of the actual book. Otherwise book reviews, movie reviews, music reviews, and art reviews would all be in violation of copyright.Also the book report is not a commercial product. If the child sells plagiarized book reports on the street corner, it's a copyright violation.
I really don't understand how we got to the ridiculous world where people think reductio ad absurdum is a good play here. The gulf between Fair Use and some mythical government granted monopoly on knowledge is substantial.I really don't understand how we got to the ridiculous world where people are so eager to support large corporation's government granted monopolies on the very foundations of our culture and shared knowledge.
I mean, maybe? But this seems incredibly far fetched and unlikely.alleging that ChatGPT users are likely to delete chats where they attempted to use the chatbot to skirt paywalls to access news content.
Judges' decisions and actions affect markets all the fucking time.He further warned that ChatGPT users turning to OpenAI rival services like Anthropic's Claude or Google's Gemini could suggest that Wang's order is improperly influencing market forces, which also seems "crazy."
In this analogy, the book report contains verbatim substantial sections outside the scope of fair use. You know, plagiarism.That is simply untrue. You can sell a book report so long as it does not copy verbatim large sections of the actual book. Otherwise book reviews, movie reviews, music reviews, and art reviews would all be in violation of copyright.
I really don't understand how we got to the ridiculous world where people are so eager to support large corporation's government granted monopolies on the very foundations of our culture and shared knowledge.
They're all doing it in some way. If they're not using user data for training future models, they're using it for evals to improve system prompts and to check quantized model performance.I get a strong "pot calls kettle black" vibe from this, considering how most AI companies appear to be digesting so much data without asking permission.
Who said the NYT is getting every single chat log?Getting every single ChatGPT log just to try and find something that's reproducing NYT content, given the likely numbers of chats that have nothing to do with NYT content, seems like an incredible fishing expedition?
How is this different than any other time the court requires private data to be sent to them and safeguarded during a trial? These (anonymized) chat logs aren't really any different than other evidence sent during trials (financial documents, internal company memos). Discovery and evidence retention during that and the trial is a key part of the American judicial system.
OpenAI is saying "we want our data (to hide evidence and train our product), but no, you can't see it. That's private to the justice system, no matter what crime we may commit with it."
I also doubt that OpenAI has "as Edelson suggested", "some of the most sensitive data on the planet". They have SSNs, personal health data, financial records? If so, then holy fuck Open AI is more despicable than I knew.
Plagiarism isn't the same thing as violating copyright.In this analogy, the book report contains verbatim substantial sections outside the scope of fair use. You know, plagiarism.
I'd argue that any data you've willingly handed over to a third party isn't really private anymore.
Oh yes - and don't forget, OpenAI has a commercial version of this service they sell to enterprises (ChatGPT Enterprise). Certainly some of these businesses are using data with a variety of sensitivities/classifications... They also offer a government-based service now too. I wonder if this ruling means that NYT can access that data trove as well?
But "lawyers have notoriously been pretty bad about securing data," Edelson suggested, so "the idea that you've got a bunch of lawyers who are going to be doing whatever they are" with "some of the most sensitive data on the planet" and "they're the ones protecting it against hackers should make everyone uneasy."
Which is the whole point to OpenAI screaming and whining and throwing a fit about this order.
They know damn well that huge chunks of verbatim copyrighted material was vomited up in user's faces. They know damn well they violated all kinds of copyright laws scraping that shit without permission or payment. They also know damn well that they're not making a fucking dime on it, because unlike traditional business which can scale up to become more efficient, AI doesn't scale AT ALL. You have to add exactly the same capacity, and costs, for every user. So the more users they proportionally get, the more money they have to proportionally spend.
Paying for copyright access is just another cost, and they don't have the funds to spare to lay that out, or add to their costs per user when their database is accessed by each one.
I don't make any predictions about when AI will become a smoking crater, mostly because there seems to be an infinite supply of idiots who keep trying it out, and even throwing more money at it. but it's not sustainable as a business as it's currently implemented. And I'd guess sooner or later, after lots and lots more good money is thrown after bad, they'll figure out that the way they're doing it now is not a viable way to do business.
I don't know exactly how it could become viable, given the way it works, but that's well hidden behind a SEP field, buried in a IDGAS hole.
Fair point, i missed the linked article in the first paragraph which had that explicitly laid out. Thanks.You can wonder, or you can read the fine article which explicitly says its excluded. Your choice of course.
I believe I've quoted the relevant section in more than one of the prior stories on this case.You can wonder, or you can read the fine article which explicitly says its excluded. Your choice of course.
You may have never used Chatgtp but I imagine that they have tax records, bank records, medical records,SSNs and more in the chats that the court ordered preserved. Users can upload PDFs, zip files, and images and then ask questions about the uploaded data. Even if the uploaded data is not retained by this order (unclear to me if it is) any responses from the AI answering questions about the data or analysis will be retained in this preservation order.How is this different than any other time the court requires private data to be sent to them and safeguarded during a trial? These (anonymized) chat logs aren't really any different than other evidence sent during trials (financial documents, internal company memos). Discovery and evidence retention during that and the trial is a key part of the American judicial system.
OpenAI is saying "we want our data (to hide evidence and train our product), but no, you can't see it. That's private to the justice system, no matter what crime we may commit with it."
I also doubt that OpenAI has "as Edelson suggested", "some of the most sensitive data on the planet". They have SSNs, personal health data, financial records? If so, then holy fuck Open AI is more despicable than I knew.
for sure OpenAI will just provide them dirty data probably anonymized while keeping a very processed personal data of its users to monetized them right?You may have never used Chatgtp but I imagine that they have tax records, bank records, medical records,SSNs and more in the chats that the court ordered preserved. Users can upload PDFs, zip files, and images and then ask questions about the uploaded data. Even if the uploaded data is not retained by this order (unclear to me if it is) any responses from the AI answering questions about the data or analysis will be retained in this preservation order.
I suggest you go and talk to an actual librarian. Trust me, they’ll be delighted to tell you all about publisher licensing agreements. I hope you have several hours to spare.Thirdly, they would have to have a licensing agreement with every copyright holder! You don't need to do this to have a public library where people can look up facts, assuming you acquired the books legally, which is good because it would be impossible to actually achieve! LLM creators are hoping to have the same reliance on transformative use hold for them so that they don't need to license with all the copyright holders of their training data. Whether the use of the data is deemed fair use in the end is still an open question however.