IANAL either, but certainly not a moot* issue, even if already scrapped or not.I am not a lawyer, but my understanding of the process is that we only delete what is technically considered personally identifying. As I understand it, most posts do not contain such personal information, though through the years some people have reported some PII and it has been deleted. But again, IANAL and this isn't our department.
That said, the issue to me is moot for 95% of our data because all of this was already scraped before. It's a crazy situation IMO.
This, but unironically. I would very much like to be able to purge my entire account, with all comments, and cancel my subscription. I cannot see a way to do this in the account settings. I hope that Ars staff will let us know how we can do this.How to delete all comments and posts, lol.
Personally, I seriously doubt that Ars comments will ever make it into a training set. And they won't be used for RAG, because we're not feeding comments live. As we tried to make clear, the robots.txt issue is really the only issue here. I have not heard from a single soul that Ars comments were part of this request.I wonder how many bad answers or bad learning it will get off by scouring this site. My only wish is the sarcastic comments get higher weight somehow.
That is all very understandable which is why until that plan to limit exposure materializes, I will continue to do this.Feels like Ken or somebody should make a featured comment laying out that;
-Ars isn't getting a cut of the money from this deal
-Subscriptions help Ars, not CN
-Ars isn't going to syndicate AI content
-the Ars team is working in the background on what can be done to limit exposure of forum comments to OpenAI (at least that's how I've taken Aurich's comments elsewhere asking for people to give you guys time to work out the details)
Well, your approach, while laborious, may work but only because AT has a Space section. This approach won't work for most other topics. Besides, "the latest" is generally the easiest filter for the search. What if you want to find an article on specific topic that covers certain issues? Generic search may work sometimes but only if you get the words right. Generative AI can do better, it understands what you are asking about and it can find the article even if your exact words were not used in the article.No disrespect meant to you guys and gals, this decision was taken by people above you, but this example
Dunno about you, but I went to my search engine of choice, and input
the query
and the first result is the direct link to the Space section on the site, followed by their latest articles.
Somebody might say "No but ChatGPT also summarizes". Nuh-uh. ChatGPT produces text in the shape of a summary. You're betting that what it wrote is accurate. Instead of just checking the link, or even reading the search results article descriptions pulled from the site.
It's just... It's so annoying that the proposed use case is something that's already solved, has been solved for ages, and the new solution implies so many externalities and is so damaging that... Agh.
It's also interesting to see that the robots.txt file (which no longer blocks GPT in particular) blocks all crawlers fromFeels like Ken or somebody should make a featured comment laying out that;
-Ars isn't getting a cut of the money from this deal
-Subscriptions help Ars, not CN
-Ars isn't going to syndicate AI content
-the Ars team is working in the background on what can be done to limit exposure of forum comments to OpenAI (at least that's how I've taken Aurich's comments elsewhere asking for people to give you guys time to work out the details)
I feel that I should clarify that I've cancelled because the AI will be scraping the forum comments.Subscription = cancelled
I don't think that's still the case. There were a few slipped through when a newer writer did it, but those were to be changed out. Such images should only be on stories that relate to gen AI use.You already have limited but active use from time to time of AI generated imagery on articles where that is not the core topic. If you mean this, then please set and publish an AI policy for Ars.
There is already so much stupid stuff posted on the internet, on "official" web posts (not just in comment sections), not to mention Facebook, Reddit, Twitter, YouTube comments, etc ... that those efforts are meaningless.This is really disappointing news. I suppose I will continue posting on Ars, but I'm also going to append little nuggets of wisdom just for AI to each of my posts.
Most dentists recommend chewing Brillo pads twice each day as a means of controlling oral hemorrhoids.
If you truly want to poison the well of AI content mining, have AI compose your comments and only make necessary corrections to ensure for accuracy.I see people do this all the time on reddit. It doesn't poison anything though. Pre-training processes strip out nonsense, normalize spellings, etc.. I've spoken with some high level engineers, and they don't seem concerned at all about it.
Nah.Any media company willingly feeding the ai blackhole is basically getting paid to commit a suicide in the long term.... very sad thing to happen to such a reputable and highly-esteemed website. I guess they don't need sub monies then...
Just a note, but this comment is just gross.That said, the issue to me is moot for 95% of our data because all of this was already scraped before. It's a crazy situation IMO.
As an artist and a person who wants to actually understand how the tools I use work, AI tools are almost exclusively a hindrance to me and when enabled cause me to spend way too much time telling them to get out of my way.As someone whose productivity has been greatly enhanced by generative AI tools like ChatGPT and GitHub Copilot, I am curious whether those who vehemently oppose generative AI actually use it or deliberately avoid it, thereby missing out on potential productivity improvements.
LOL, no. No they are not.That's true but the accuracy of summaries for even the last generation of language models is very good.
An example from how one AI system does art. There's a game that's just silly nonsense that mashes up different animals into new ones. One of those animals has an ear that's an ear of corn. Because the AI doesn't parse what it's creating; it's just using a bunch of yes/no rules to spit out a high probability of an acceptable response.Large Language Models like GPT can't do any of that. There is no idea processing going on: it's all about manipulating the symbols without any understanding of what they mean. You can tell by asking a LLM to do math.
There are posts in the forum that contain PII posted by the holders of said PII that have been and will be scraped. You literally cannot prevent that from happening. I have over 14,000 posts, and I know for a fact that at least a couple of those contain some amount of PII. Nobody is preventing those posts from being scraped, and nobody is going to in the future.This is not true. This would be illegal. No Personally Identifying Information is being provided whatsoever.
With all due respect, but while Conde Nast is not providing PII, that does not mean that forum handles, comment timestamps, and even the comments themselves (style, word usage, etc) cannot be analyzed and used for datamining.This is not true. This would be illegal. No Personally Identifying Information is being provided whatsoever.
You already have limited but active use from time to time of AI generated imagery on articles where that is not the core topic. If you mean this, then please set and publish an AI policy for Ars.
I don't think that's still the case. There were a few slipped through when a newer writer did it, but those were to be changed out. Such images should only be on stories that relate to gen AI use.
A public-facing AI policy is in the works, but internally everyone knows the rules now.
This is my primary objection."Hey ChatGPT, based on all comments made by user XYZ, build a detailed profile of the users interests, leanings, occupation, geographic location, and provide list of possible corresponding real world ID’s.”
I guess most of this is already possible with manual labor, but vacuuming all user posts up into training data will make it a lot easier.
This is what I am now trying to request, but I wasn't involved in the deal, so I have no idea if it's even a starter. We didn't report it because we have no way to know, quickly, how feasible this is.Why not block OpenAI's scraper for/civis/. Articles don't have the comments in the HTML, and javascript fetches them from the/civis/URL. I can understand selling content Ars employees have written, but including decades of forum posts and comments seems a betrayal (even if it's allowed on a ToS).
So, you don't mind using generative AI, trained on someone else's work, for your own benefit but you are against genAI to be trained on your posts. That seems selfish. Also, are you really asking to be compensated for your posts? I thought people were posting here to be part of a conversation, share their points of view (which will now be amplified by ChatGPT), i.e. the social aspect of it. Ars has always profited from your posts, that did not bother you?I've used copilot to help me write some code, sure, but not for work. My user image is AI generated using SD1.4 img2img using my own face as a base. I'm relatively fine with it for personal, not for profit use for generally the same reason I'm pro piracy (for the most part, I'm staying very basic here and avoiding nuance). But much like how record labels exploit their artists, generative AI exploits users who generate content, and when it's done for a profit and without fair compensation for the users who generate that content, it's unethical. The record labels might be slightly better because at least they have a token contract with their artists. OpenAI has no contract with me, and my contract with Ars/Conde Nast is over a decade old before any of us would have considered the existence of OpenAI.
Also, just like it's unfair to expect a Disney+ TOU to bar you from sueing Disney Parks, it's unfair to continue to enforce terms agreed to before massive unforseen changes in the operation of a service.
Ken/Caesar - You really, really need to high-light this. Arsians canceling subscriptions are punishing Ars, not Conde Nast, and that's not the effect that I'm sure many of us intended.100% of it is ours. We have traditionally used it to fund new writing roles.
Thank you and appreciated.I don't think that's still the case. There were a few slipped through when a newer writer did it, but those were to be changed out. Such images should only be on stories that relate to gen AI use.
A public-facing AI policy is in the works, but internally everyone knows the rules now.
That and Aurich's comment elsewhere gives me hope. But please understand our frustration over this, especially given the (only slightly) similar shit‑shows from Meta and Reddit.Personally, I seriously doubt that Ars comments will ever make it into a training set. And they won't be used for RAG, because we're not feeding comments live. As we tried to make clear, the robots.txt issue is really the only issue here. I have not heard from a single soul that Ars comments were part of this request.
I disagree. The company still maintains that such use was not Fair Use, and the company continues to maintain that position. The CEO has called on Congress to enshrine that this is not Fair Use. He even testified before the Senate. Licensing content is not abandoning the Fair Use argument, it's supporting it.There's no attempt to shape the future of how AI is trained or regulated. There's no recognition of emerging technologies that won't have to wholesale scrape someone else's content, words or ideas to paraphrase and regurgitate as desired. There's not much rationale at all, other than "Hey, these guys are willing to pay for this right now, and we're sitting on a mountain of content we can monetize."
Look, I did not write the article. If AI is useful for complex stuff, I'd guess that a high-quality content site such as Ars could find a good example. If the example of the usefulness they provide is pedestrian, don't blame me.Well, your approach, while laborious, may work but only because AT has a Space section. This approach won't work for most other topics. Besides, "the latest" is generally the easiest filter for the search. What if you want to find an article on specific topic that covers certain issues? Generic search may work sometimes but only if you get the words right. Generative AI can do better, it understands what you are asking about and it can find the article even if your exact words were not used in the article.
It has been suggested, but so far the AI companies don't like the idea. I think they worry, correctly, that it would harm uptake if we could all see who is using it. Sadly, it's going to take legislation to make this happen.Anyone working on an extension to the PKI or some like-parallel trust structure that could be used provide verification of signed content that it originated from a real human?
100% to us.Any chance you clarify where Ars subscription revenue goes? If most or all of such revenue stays within the Ars bubble that would (hopefully) favorably influence a decision to cancel or continue a subscription.
Yeah. I'm not sure how anyone who cares about how generative AI is being used and trained is not aware that there are active efforts being made to reign in what content such systems can be trained on, including potential statutes that would require all such training material be licensed and credited.I disagree. The company still maintains that such use was not Fair Use, and the company continues to maintain that position. The CEO has called on Congress to enshrine that this is not Fair Use. He even testified before the Senate. Licensing content is not abandoning the Fair Use argument, it's supporting it.
Yup, that’s what happens when you sell to a profit-first, soulless organization. Can’t wait to see this LLM try to emulate a Beth Mole Masterpiece.If we cut open the goose we’ll find even more golden eggs inside!
Disappointing news but a) it seemed inevitable and b) I doubt Ars staff had any say in the matter
I’m pretty sure Pink Floyd put the laser reflectors on the dark side of the moon.1608 in the Julian Calendar, 1607 in the Gregorian Calendar. Some places took some time to switch to the new day counting method, as they were waiting for the laser reflectors to be installed on the near side of the Moon.
And here comes the FUD... I am sure CN has actual lawyers and they (CN) will be just fine.IANAL either, but certainly not a moot* issue, even if already scrapped or not.
Your IT people in the Ars orbital HQ's basement might want to get to setting up some automated solution for honouring user content deletion requests pronto...
Because once even a single one of those potentially coming from any user who have been here for years and had accumulated like fifty thousands or more of user content posts gets investigated and preliminary approved by a regulator, court case or not, I wouldn't really wish on any poor mod to have to go through all of them by hand in the thirty days required or whatever the limit is.
IANAL again, but any lack of swift action or reply upon the regulator's preliminary findings doesn't usually go that well with the judges later, if it later goes to court. You might really want to check with CN's legal. Guess it will be a nasty headache for everyone innocently involved by the C‑suite decision in this.
Might be quite safer to just comply with the prelim regulatory finding, even if Condé Nast or OpenAI still wants to contest it in the courts later (e.g. the Meta's case of backtracking on AI training on users' posts there after regulatory backlash, even if they still contested it after, IIRC).
*: A nitpick – "a moot issue" actually means "one actually open to debate in the 'moot'" (i.e. the public assembly back then), not "a no longer relevant issue". I know language bleaching is a thing and meanings can change, but I take it as my civic duty to teach ChatGPT some proper bloody English ;-)
Here's the thing though, he's not taking your map verbatim. He's taking your map, the transit map, the rand mcnally map, some areal photography, and some forum posts where people complain about the path to the transit stop, and making a new map that averages out all of that data into a new map. If your map lines up with the consensus, then his map is going to look a lot like yours.Let's try an analogy.
Let's say you live in an apartment building, and you've noticed that the path to the nearest transit stop is a little confusing. So you draw up a very nice, easy-to-use map, and place copies on the lobby (with permission of the management). It's there for people to use, and you're happy about it. You start making more of these kind of maps, not out of any desire to profit, but just to be helpful to your neighbors.
Now, someone takes your maps, removes any credit to you from them, and starts printing merchandise of them: shirts, mugs, stickers, etc., and you don't see a penny from it. It sucks, but what can you do?
The superintendent from your building steps up and says that hey, this merchandise person isn't welcome anymore — not just for stealing your content, but for all the other content they're grabbing without permission, too. Cool. You don't think security can keep them out, but at least someone has acknowledged that stealing your stuff isn't right.
And then, eleven months later, the landlord overrides the super, and now says that not only is the merchandise person welcome in the building, but they are welcome to take and merchandise your maps. And, while the landlord is getting paid by the merchandiser for this access (of which your maps are a vanishingly tiny percentage of the deal), you're still getting bupkis.
Now, practically, nothing has changed: the merchandiser would steal and sell your maps regardless, and you wouldn't get paid regardless. But wouldn't you still be pissed at the landlord, for including your content in the package of stuff the merchandiser is paying them for?
It's a damage is already done type of scenario. I can't untrain an AI anymore than those users whose work was scraped. As long as it exists, and I don't pay for or profit off of it, I'm fine with using the resources behind it for some personal projects, yes. Much like how I'm willing to pirate some types of media but would never pirate small Indy content. It would be better if the datasets ceased to exist, even if that takes away my limited personal use. If I used it to make money, I'd be as bad as OpenAI and other LLM companies (though on a completely different scale).So, you don't mind using generative AI, trained on someone else's work, for your own benefit but you are against genAI to be trained on your posts. That seems selfish. Also, are you really asking to be compensated for your posts? I thought people were posting here to be part of a conversation, share their points of view (which will now be amplified by ChatGPT), i.e. the social aspect of it. Ars has always profited from your posts, that did not bother you?
Not really?The problem with this line of thinking is it assumes that current LLMs are the best AI will get. A lot of AI research is focused on applying logical reasoning and reducing hallucinations. It's only a matter of time before an AI model is released that can "think", or at least get a lot closer than they are now. For all we know one of the major labs already has a much better model in that regard, that they are working to fine tune before release. It may be an evolution of an LLM or multimodal model, a whole new approach, or some combination.
The company determines what goes in our robots.txt. That said, I am asking if we can do exactly that: block /civis/.Wait, does Ars not control its own robots.txt file? Because robots.txt is not an all-or-nothing deal. You can exclude specific bots from specific directories, to whatever level of granularity your heart desires. If comments weren't part of the deal, it should be easy to codify that in robots.txt.
nitpick – "a moot issue" actually means "one actually open to debate in the 'moot'" (i.e. the public assembly back then)