Musk fails to block California data disclosure law he fears will ruin xAI

Lexus Lunar Lorry · Mar 6, 2026

However, this information is precisely what makes xAI valuable, with its intensive data sourcing supposedly setting it apart from its biggest rivals, xAI argued.

Judging from Grok's output, I wonder if xAI is now mining preteen angst for training data. That would certainly give it an edge over OpenAI.

Aurich · Mar 6, 2026

“It strains credulity to essentially suggest that no consumer is capable of making a useful evaluation of Plaintiff’s AI models by reviewing information about the datasets used to train them and that therefore there is no substantial government interest advanced by this disclosure statute,” Bernal wrote.

[raises hand]

Hi, resident of California here. I care about where your training comes from.

DrewW · Mar 6, 2026

Hmmm, lets shave with Occam's Razor:

Does Elon want to keep xAI's data sources secret because they are so much better at finding training materials than other AI companies?

Or does Elon want to keep xAI's data sources secret because many of them are copyright infringing or illegal, like the DOGE Social Security dataset that got copied by his henchmen?

ChrisSD · Mar 6, 2026

I do suspect that AI companies don't actually know the source of all their training data, they just gobble up all the data they can. I also suspect that they like not knowing.

I mean, presumably AI companies don't choose to have CSAM in their training data. And yet here we are.

OldPhartReef · Mar 6, 2026

Good. IMHO, anything that thwarts Musk's insanity is a good thing.

From a practical policy discussion standpoint, I'm of the opinion that all AI developers should be forced to divulge where their training data comes from. Rampant intellectual property theft was what allowed the rapid development of LLMs. IP owners deserve the transparency to determine for themselves whether they want to exercise their rights.

Wheels Of Confusion · Mar 6, 2026

Musk fails to block California data disclosure law he fears will ruin xAI

Sadly, because it's Elon Musk saying it we can be sure it's false.

However, this information is precisely what makes xAI valuable, with its intensive data sourcing supposedly setting it apart from its biggest rivals, xAI argued.

So "We have no special sauce at all," basically. At this point it's safe to assume that essentially all the model makers in play have access to all the same training data (e.g., everything ever put on the Internet).
In fact it's worse than that. If Musk is alleging that only the proportional contents of his training data set the company apart from its rivals and that this law will not harm its rivals in the same way it would harm his company, that points to a unique vulnerability in their approach.

Frodo Douchebaggins · Mar 6, 2026

grok-i-wasnt-familiar-with-your-game-v0-1gmrvuuksfng1.jpeg.jpg

dnorman · Mar 6, 2026

"We really don't think the public will care that Grok was trained on archives downloaded from 4Chan" - Elon, probably.

graylshaped · Mar 6, 2026

Publicly available content is not a trade secret by definition. Data developed internally might be. Declare both under penalty of perjury, redacting the internal data subject to independent audit confirming its proprietary nature.

Of course Musk can't prove OpenAI stole xAI's secret ingredient. There is no secret ingredient.

Sarty · Mar 6, 2026

ChrisSD said:
I do suspect that AI companies don't actually know the source of all their training data, they just gobble up all the data they can. I also suspect that they like not knowing.

"Go fast and break things" does not seem like it's a natural fit for "keep extensive and careful documentation", does it?

Doers versus checkers, as one corresponding author might put it.

graylshaped · Mar 6, 2026

Frodo Douchebaggins said:
View attachment 129838

Can I get a tee shirt with this?

multimediavt · Mar 6, 2026

graylshaped said:
Can I get a tee shirt with this?

I was thinking can we project this onto the side of every corporate HQ owned by Musk. lol

msawzall · Mar 6, 2026

DrewW said:
Hmmm, lets shave with Occam's Razor:

Does Elon want to keep xAI's data sources secret because they are so much better at finding training materials than other AI companies?

Or does Elon want to keep xAI's data sources secret because many of them are copyright infringing or illegal, like the DOGE Social Security dataset that got copied by his henchmen?

That reminds me. I'm still waiting on my DOGE rebate check.

Hypatia · Mar 6, 2026

Remember when the claim was that this approach to technological development would give birth to a digital god that conjures infinite wealth and created utopia….

….and now a simple demand for transparency will somehow undo the whole enterprise?

They clearly think we are the dumbest carbon-based life forms in existence.

UhtredsonofUhtred · Mar 6, 2026

Frodo Douchebaggins said:
View attachment 129838

Ha! I think Grok compiled that directly from various Ars posts/comments related to tesla/Xxx/skuM articles.

p96 · Mar 6, 2026

msawzall said:
That reminds me. I'm still waiting on my DOGE rebate check.

I can speed that up for a small "processing fee."

/s

ConfusedBystander · Mar 6, 2026

If competitors could see the sources of all of xAI’s datasets…

They’d start datamining 4Chan.

Dakar · Mar 6, 2026

One of the sources is probably 4 chan judging from some of the questionable output it makes.

He's probably training it on those leaked FL GOP chats as we speak

SixDegrees · Mar 6, 2026

"Musk fails to block California data disclosure law he fears will ruin xAI"

Bring on the ruin. More ruin, please.

"Allowing enforcement could be “economically devastating”"

More of that, too.

khumak50 · Mar 6, 2026

Translation: We stole all the data we used for training and don't want to have to pay for it.

LesMilpool____ · Mar 6, 2026

Something tells me one can't go morally wrong betting against Elon Musk

Lansow · Mar 6, 2026

"...posing a real risk of gutting the entire AI industry."

Ooooh. Don't threaten me with a good time.

AusPeter · Mar 6, 2026

Would the ultimate solution to Musk’s problems simply be to not operate xAI in CA?

AusPeter · Mar 6, 2026

msawzall said:
That reminds me. I'm still waiting on my DOGE rebate check.

It’s in the mail along with your tariff refund check.

SixDegrees · Mar 6, 2026

AusPeter said:
Would the ultimate solution to Musk’s problems simply be to not operate xAI in CA?

That's a lot of ad revenue to give up.

Schpyder · Mar 6, 2026

Allowing enforcement could be “economically devastating” to xAI, Musk’s company argued, effectively reducing “the value of xAI’s trade secrets to zero,” xAI’s complaint said.

Inshallah.

Uragan · Mar 6, 2026

Allowing enforcement could be “economically devastating” to xAI, Musk’s company argued, effectively reducing “the value of xAI’s trade secrets to zero,” xAI’s complaint said.

OtherSystemGuy · Mar 6, 2026

I'm sorry, I must have missed something or there's part of the California law that wasn't mentioned. Musk kept brining up data cleaning but that wasn't mentioned by TFA as part of the law. So this must be a smoke screen to grow the level of supposed devastation to xAI. Data cleaning could be IP if the company come up with unique ways to clean out things like CSAM. The fact they mention cleaning multiple times tells me they don't actually have a plan.

ajmas · Mar 6, 2026

If Aaron Swartz was potentially liable (he took his own life due to the savagery of the prosecution) for collecting and making available he content of PACER available, then I'd argue corporations should be held to an equal or higher standard and be transparent for the data sources.

At least Aaron wasn't looking for monetary profit for what he did. I don't see Elon Musk or a number of the other "AI bros" working "for the good of humanity".

jdale · Mar 6, 2026

Further, xAI insisted, these disclosures “cannot possibly be helpful to consumers” while supposedly posing a real risk of gutting the entire AI industry.

The only way they could gut the entire AI industry is if revealing the training data exposed vast copyright infringement or abuse of personal information. And if that happens? Good, let's get started.

AusPeter · Mar 6, 2026

jlredford said:
I've wondered this about the various EU lawsuits as well. I don't see how CA or the EU can prevent people from using Grok, and having the actual data centers somewhere else is almost certainly cheaper. Can they prevent people from downloading xAI apps or something?

It’s like Pornhub being blocked in 23 states because of age ID laws. Yes, if you use a VPN you can bypass the blocks. But it adds another level of hoops to jump through.

Ultimately however, you’re back at the old adage of the internet treating censorship as damage and routing around it.

Uragan · Mar 6, 2026

jlredford said:
I've wondered this about the various EU lawsuits as well. I don't see how CA or the EU can prevent people from using Grok, and having the actual data centers somewhere else is almost certainly cheaper. Can they prevent people from downloading xAI apps or something?

Yes. They could theoretically geofence people based on location. But then people could just use a VPN if they really wanted to use Grok.

markgo · Mar 6, 2026

jdale said:
The only way they could gut the entire AI industry is if revealing the training data exposed vast copyright infringement or abuse of personal information. And if that happens? Good, let's get started.

Vast copyright infringement is the entire basis for modern LLMs. Plus actually stealing the copyrighted source material rather than paying for it.

Frank C. · Mar 6, 2026

Thank you once again California.

Shiunbird · Mar 6, 2026

Coca-Cola still lists its ingredients in the labels and yet it somehow remains a viable product.

If listing "the ingredients of the product" makes your product so easy to replicate and so worthless as xAI is arguing, then maybe it is just not so valuable and special to begin with?

BadBart · Mar 6, 2026

[their training data] is what makes xAI valuable, with its intensive data sourcing supposedly setting it apart from its biggest rivals, xAI argued.

That seems as clear a statement of "we have nothing original to offer here" as you could ask for.

black_box · Mar 6, 2026

The text of the bill is quite short, but the important portion includes:

(a) A high-level summary of the datasets used in the development of the generative artificial intelligence system or service, including, but not limited to:
-(1) The sources or owners of the datasets.
-(2) A description of how the datasets further the intended purpose of the artificial intelligence system or service.
-(3) The number of data points included in the datasets, which may be in general ranges, and with estimated figures for dynamic datasets.
-(4) A description of the types of data points within the datasets. For purposes of this paragraph, the following definitions apply:
---(A) As applied to datasets that include labels, “types of data points” means the types of labels used.
---(B) As applied to datasets without labeling, “types of data points” refers to the general characteristics.
-(5) Whether the datasets include any data protected by copyright, trademark, or patent, or whether the datasets are entirely in the public domain.
-(6) Whether the datasets were purchased or licensed by the developer.
-(7) Whether the datasets include personal information, as defined in subdivision (v) of Section 1798.140.
-(8) Whether the datasets include aggregate consumer information, as defined in subdivision (b) of Section 1798.140.
-(9) Whether there was any cleaning, processing, or other modification to the datasets by the developer, including the intended purpose of those efforts in relation to the artificial intelligence system or service.
-(10) The time period during which the data in the datasets were collected, including a notice if the data collection is ongoing.
-(11) The dates the datasets were first used during the development of the artificial intelligence system or service.
-(12) Whether the generative artificial intelligence system or service used or continuously uses synthetic data generation in its development. A developer may include a description of the functional need or desired purpose of the synthetic data in relation to the intended purpose of the system or service.

As to the cleaning or pre-processing of data prior to training, they could simply say "Yes, Data Set Z was cleaned to remove objectionable material." Of course, it they cleaned it to remove unfavorable references to Musk.... or didn't clean the data set at all to remove objectionable material (while other AI model developers did...).

The requirements all seem reasonable to me. There are exceptions to the disclosure requirement for models related to security/integrity, operation of aircraft, and models exclusively provided to a federal entity for defense/national security/military.

Granadico · Mar 6, 2026

Specifically, xAI argued that its dataset sources, dataset sizes, and cleaning methods were all trade secrets.

“If competitors could see the sources of all of xAI’s datasets or even the size of its datasets, competitors could evaluate both what data xAI has and how much they lack,” xAI argued. In one hypothetical, xAI speculated that “if OpenAI (another leading AI company) were to discover that xAI was using an important dataset to train its models that OpenAI was not, OpenAI would almost certainly acquire that dataset to train its own model, and vice versa.”

Is the secret sauce of AI the data sources or the actual programming? I figure it'd be useful to have your sources that aren't well known to give a model an edge, but ultimately the training should be more important than the dataset all things being equal. I get that this is the wild west starting point and tech companies have no morals if it makes them money, but eventually I would think that there'd be some kind of standard or regulation of what training data can be used.

Musk fails to block California data disclosure law he fears will ruin xAI

Ars Scholae Palatinae

Director of Many Things

Ars Tribunus Militum

Ars Tribunus Angusticlavius

Ars Centurion

Ars Legatus Legionis

Ars Legatus Legionis

Wise, Aged Ars Veteran

Ars Legatus Legionis

Ars Tribunus Angusticlavius

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Tribunus Angusticlavius

Ars Centurion

Ars Praetorian

Smack-Fu Master, in training

Smack-Fu Master, in training

Smack-Fu Master, in training

Ars Legatus Legionis

Ars Tribunus Militum

Ars Scholae Palatinae

Ars Centurion

Ars Praefectus

Ars Praefectus

Ars Legatus Legionis

Ars Tribunus Angusticlavius

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Praefectus

Ars Legatus Legionis

Ars Praefectus

Ars Legatus Legionis

Ars Praefectus

Ars Scholae Palatinae

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Ars Centurion

Ars Scholae Palatinae