Musk can't convince judge public doesn’t care about where AI training data comes from.
See full article...
See full article...
Judging from Grok's output, I wonder if xAI is now mining preteen angst for training data. That would certainly give it an edge over OpenAI.However, this information is precisely what makes xAI valuable, with its intensive data sourcing supposedly setting it apart from its biggest rivals, xAI argued.
“It strains credulity to essentially suggest that no consumer is capable of making a useful evaluation of Plaintiff’s AI models by reviewing information about the datasets used to train them and that therefore there is no substantial government interest advanced by this disclosure statute,” Bernal wrote.
Sadly, because it's Elon Musk saying it we can be sure it's false.Musk fails to block California data disclosure law he fears will ruin xAI
So "We have no special sauce at all," basically. At this point it's safe to assume that essentially all the model makers in play have access to all the same training data (e.g., everything ever put on the Internet).However, this information is precisely what makes xAI valuable, with its intensive data sourcing supposedly setting it apart from its biggest rivals, xAI argued.
"Go fast and break things" does not seem like it's a natural fit for "keep extensive and careful documentation", does it?I do suspect that AI companies don't actually know the source of all their training data, they just gobble up all the data they can. I also suspect that they like not knowing.
I was thinking can we project this onto the side of every corporate HQ owned by Musk. lolCan I get a tee shirt with this?
That reminds me. I'm still waiting on my DOGE rebate check.Hmmm, lets shave with Occam's Razor:
Does Elon want to keep xAI's data sources secret because they are so much better at finding training materials than other AI companies?
Or does Elon want to keep xAI's data sources secret because many of them are copyright infringing or illegal, like the DOGE Social Security dataset that got copied by his henchmen?
Ha! I think Grok compiled that directly from various Ars posts/comments related to tesla/Xxx/skuM articles.
I can speed that up for a small "processing fee."That reminds me. I'm still waiting on my DOGE rebate check.
They’d start datamining 4Chan.If competitors could see the sources of all of xAI’s datasets…
It’s in the mail along with your tariff refund check.That reminds me. I'm still waiting on my DOGE rebate check.
That's a lot of ad revenue to give up.Would the ultimate solution to Musk’s problems simply be to not operate xAI in CA?
Allowing enforcement could be “economically devastating” to xAI, Musk’s company argued, effectively reducing “the value of xAI’s trade secrets to zero,” xAI’s complaint said.
Allowing enforcement could be “economically devastating” to xAI, Musk’s company argued, effectively reducing “the value of xAI’s trade secrets to zero,” xAI’s complaint said.
Further, xAI insisted, these disclosures “cannot possibly be helpful to consumers” while supposedly posing a real risk of gutting the entire AI industry.
It’s like Pornhub being blocked in 23 states because of age ID laws. Yes, if you use a VPN you can bypass the blocks. But it adds another level of hoops to jump through.I've wondered this about the various EU lawsuits as well. I don't see how CA or the EU can prevent people from using Grok, and having the actual data centers somewhere else is almost certainly cheaper. Can they prevent people from downloading xAI apps or something?
Yes. They could theoretically geofence people based on location. But then people could just use a VPN if they really wanted to use Grok.I've wondered this about the various EU lawsuits as well. I don't see how CA or the EU can prevent people from using Grok, and having the actual data centers somewhere else is almost certainly cheaper. Can they prevent people from downloading xAI apps or something?
Vast copyright infringement is the entire basis for modern LLMs. Plus actually stealing the copyrighted source material rather than paying for it.The only way they could gut the entire AI industry is if revealing the training data exposed vast copyright infringement or abuse of personal information. And if that happens? Good, let's get started.
That seems as clear a statement of "we have nothing original to offer here" as you could ask for.[their training data] is what makes xAI valuable, with its intensive data sourcing supposedly setting it apart from its biggest rivals, xAI argued.
As to the cleaning or pre-processing of data prior to training, they could simply say "Yes, Data Set Z was cleaned to remove objectionable material." Of course, it they cleaned it to remove unfavorable references to Musk.... or didn't clean the data set at all to remove objectionable material (while other AI model developers did...).(a) A high-level summary of the datasets used in the development of the generative artificial intelligence system or service, including, but not limited to:
-(1) The sources or owners of the datasets.
-(2) A description of how the datasets further the intended purpose of the artificial intelligence system or service.
-(3) The number of data points included in the datasets, which may be in general ranges, and with estimated figures for dynamic datasets.
-(4) A description of the types of data points within the datasets. For purposes of this paragraph, the following definitions apply:
---(A) As applied to datasets that include labels, “types of data points” means the types of labels used.
---(B) As applied to datasets without labeling, “types of data points” refers to the general characteristics.
-(5) Whether the datasets include any data protected by copyright, trademark, or patent, or whether the datasets are entirely in the public domain.
-(6) Whether the datasets were purchased or licensed by the developer.
-(7) Whether the datasets include personal information, as defined in subdivision (v) of Section 1798.140.
-(8) Whether the datasets include aggregate consumer information, as defined in subdivision (b) of Section 1798.140.
-(9) Whether there was any cleaning, processing, or other modification to the datasets by the developer, including the intended purpose of those efforts in relation to the artificial intelligence system or service.
-(10) The time period during which the data in the datasets were collected, including a notice if the data collection is ongoing.
-(11) The dates the datasets were first used during the development of the artificial intelligence system or service.
-(12) Whether the generative artificial intelligence system or service used or continuously uses synthetic data generation in its development. A developer may include a description of the functional need or desired purpose of the synthetic data in relation to the intended purpose of the system or service.
Is the secret sauce of AI the data sources or the actual programming? I figure it'd be useful to have your sources that aren't well known to give a model an edge, but ultimately the training should be more important than the dataset all things being equal. I get that this is the wild west starting point and tech companies have no morals if it makes them money, but eventually I would think that there'd be some kind of standard or regulation of what training data can be used.Specifically, xAI argued that its dataset sources, dataset sizes, and cleaning methods were all trade secrets.
“If competitors could see the sources of all of xAI’s datasets or even the size of its datasets, competitors could evaluate both what data xAI has and how much they lack,” xAI argued. In one hypothetical, xAI speculated that “if OpenAI (another leading AI company) were to discover that xAI was using an important dataset to train its models that OpenAI was not, OpenAI would almost certainly acquire that dataset to train its own model, and vice versa.”