Musk fails to block California data disclosure law he fears will ruin xAI

Lexus Lunar Lorry

Ars Scholae Palatinae
846
Subscriptor++
However, this information is precisely what makes xAI valuable, with its intensive data sourcing supposedly setting it apart from its biggest rivals, xAI argued.
Judging from Grok's output, I wonder if xAI is now mining preteen angst for training data. That would certainly give it an edge over OpenAI.
 
Upvote
88 (88 / 0)

Aurich

Director of Many Things
40,904
Ars Staff
“It strains credulity to essentially suggest that no consumer is capable of making a useful evaluation of Plaintiff’s AI models by reviewing information about the datasets used to train them and that therefore there is no substantial government interest advanced by this disclosure statute,” Bernal wrote.

[raises hand]

Hi, resident of California here. I care about where your training comes from.
 
Upvote
403 (405 / -2)

DrewW

Ars Tribunus Militum
1,928
Subscriptor++
Hmmm, lets shave with Occam's Razor:

Does Elon want to keep xAI's data sources secret because they are so much better at finding training materials than other AI companies?

Or does Elon want to keep xAI's data sources secret because many of them are copyright infringing or illegal, like the DOGE Social Security dataset that got copied by his henchmen?
 
Upvote
452 (452 / 0)

OldPhartReef

Ars Centurion
306
Subscriptor
Good. IMHO, anything that thwarts Musk's insanity is a good thing.

From a practical policy discussion standpoint, I'm of the opinion that all AI developers should be forced to divulge where their training data comes from. Rampant intellectual property theft was what allowed the rapid development of LLMs. IP owners deserve the transparency to determine for themselves whether they want to exercise their rights.
 
Upvote
208 (208 / 0)

Wheels Of Confusion

Ars Legatus Legionis
75,398
Subscriptor
Musk fails to block California data disclosure law he fears will ruin xAI
Sadly, because it's Elon Musk saying it we can be sure it's false.

However, this information is precisely what makes xAI valuable, with its intensive data sourcing supposedly setting it apart from its biggest rivals, xAI argued.
So "We have no special sauce at all," basically. At this point it's safe to assume that essentially all the model makers in play have access to all the same training data (e.g., everything ever put on the Internet).
In fact it's worse than that. If Musk is alleging that only the proportional contents of his training data set the company apart from its rivals and that this law will not harm its rivals in the same way it would harm his company, that points to a unique vulnerability in their approach.
 
Upvote
117 (118 / -1)

Frodo Douchebaggins

Ars Legatus Legionis
11,995
Subscriptor
grok-i-wasnt-familiar-with-your-game-v0-1gmrvuuksfng1.jpeg.jpg
 
Upvote
292 (301 / -9)

graylshaped

Ars Legatus Legionis
67,695
Subscriptor++
Publicly available content is not a trade secret by definition. Data developed internally might be. Declare both under penalty of perjury, redacting the internal data subject to independent audit confirming its proprietary nature.

Of course Musk can't prove OpenAI stole xAI's secret ingredient. There is no secret ingredient.

7312530cc6b9b750dc4fe22d93e5c333.jpg
 
Upvote
94 (94 / 0)

Sarty

Ars Tribunus Angusticlavius
7,816
I do suspect that AI companies don't actually know the source of all their training data, they just gobble up all the data they can. I also suspect that they like not knowing.
"Go fast and break things" does not seem like it's a natural fit for "keep extensive and careful documentation", does it?

Doers versus checkers, as one corresponding author might put it.
 
Upvote
100 (101 / -1)

msawzall

Ars Tribunus Angusticlavius
7,354
Hmmm, lets shave with Occam's Razor:

Does Elon want to keep xAI's data sources secret because they are so much better at finding training materials than other AI companies?

Or does Elon want to keep xAI's data sources secret because many of them are copyright infringing or illegal, like the DOGE Social Security dataset that got copied by his henchmen?
That reminds me. I'm still waiting on my DOGE rebate check.
 
Upvote
122 (122 / 0)

Hypatia

Ars Centurion
202
Subscriptor
Remember when the claim was that this approach to technological development would give birth to a digital god that conjures infinite wealth and created utopia….

….and now a simple demand for transparency will somehow undo the whole enterprise?

They clearly think we are the dumbest carbon-based life forms in existence.
 
Upvote
124 (124 / 0)
Post content hidden for low score. Show…

OtherSystemGuy

Ars Scholae Palatinae
1,284
Subscriptor++
I'm sorry, I must have missed something or there's part of the California law that wasn't mentioned. Musk kept brining up data cleaning but that wasn't mentioned by TFA as part of the law. So this must be a smoke screen to grow the level of supposed devastation to xAI. Data cleaning could be IP if the company come up with unique ways to clean out things like CSAM. The fact they mention cleaning multiple times tells me they don't actually have a plan.
 
Upvote
33 (33 / 0)
If Aaron Swartz was potentially liable (he took his own life due to the savagery of the prosecution) for collecting and making available he content of PACER available, then I'd argue corporations should be held to an equal or higher standard and be transparent for the data sources.

At least Aaron wasn't looking for monetary profit for what he did. I don't see Elon Musk or a number of the other "AI bros" working "for the good of humanity".
 
Upvote
91 (91 / 0)

jdale

Ars Legatus Legionis
18,261
Subscriptor
Further, xAI insisted, these disclosures “cannot possibly be helpful to consumers” while supposedly posing a real risk of gutting the entire AI industry.

The only way they could gut the entire AI industry is if revealing the training data exposed vast copyright infringement or abuse of personal information. And if that happens? Good, let's get started.
 
Upvote
71 (71 / 0)

AusPeter

Ars Praefectus
5,086
Subscriptor
I've wondered this about the various EU lawsuits as well. I don't see how CA or the EU can prevent people from using Grok, and having the actual data centers somewhere else is almost certainly cheaper. Can they prevent people from downloading xAI apps or something?
It’s like Pornhub being blocked in 23 states because of age ID laws. Yes, if you use a VPN you can bypass the blocks. But it adds another level of hoops to jump through.

Ultimately however, you’re back at the old adage of the internet treating censorship as damage and routing around it.
 
Upvote
29 (29 / 0)

Uragan

Ars Legatus Legionis
11,176
I've wondered this about the various EU lawsuits as well. I don't see how CA or the EU can prevent people from using Grok, and having the actual data centers somewhere else is almost certainly cheaper. Can they prevent people from downloading xAI apps or something?
Yes. They could theoretically geofence people based on location. But then people could just use a VPN if they really wanted to use Grok.
 
Upvote
20 (20 / 0)

markgo

Ars Praefectus
3,776
Subscriptor++
The only way they could gut the entire AI industry is if revealing the training data exposed vast copyright infringement or abuse of personal information. And if that happens? Good, let's get started.
Vast copyright infringement is the entire basis for modern LLMs. Plus actually stealing the copyrighted source material rather than paying for it.
 
Upvote
68 (68 / 0)
The text of the bill is quite short, but the important portion includes:
(a) A high-level summary of the datasets used in the development of the generative artificial intelligence system or service, including, but not limited to:
-(1) The sources or owners of the datasets.
-(2) A description of how the datasets further the intended purpose of the artificial intelligence system or service.
-(3) The number of data points included in the datasets, which may be in general ranges, and with estimated figures for dynamic datasets.
-(4) A description of the types of data points within the datasets. For purposes of this paragraph, the following definitions apply:
---(A) As applied to datasets that include labels, “types of data points” means the types of labels used.
---(B) As applied to datasets without labeling, “types of data points” refers to the general characteristics.
-(5) Whether the datasets include any data protected by copyright, trademark, or patent, or whether the datasets are entirely in the public domain.
-(6) Whether the datasets were purchased or licensed by the developer.
-(7) Whether the datasets include personal information, as defined in subdivision (v) of Section 1798.140.
-(8) Whether the datasets include aggregate consumer information, as defined in subdivision (b) of Section 1798.140.
-(9) Whether there was any cleaning, processing, or other modification to the datasets by the developer, including the intended purpose of those efforts in relation to the artificial intelligence system or service.
-(10) The time period during which the data in the datasets were collected, including a notice if the data collection is ongoing.
-(11) The dates the datasets were first used during the development of the artificial intelligence system or service.
-(12) Whether the generative artificial intelligence system or service used or continuously uses synthetic data generation in its development. A developer may include a description of the functional need or desired purpose of the synthetic data in relation to the intended purpose of the system or service.
As to the cleaning or pre-processing of data prior to training, they could simply say "Yes, Data Set Z was cleaned to remove objectionable material." Of course, it they cleaned it to remove unfavorable references to Musk.... or didn't clean the data set at all to remove objectionable material (while other AI model developers did...).

The requirements all seem reasonable to me. There are exceptions to the disclosure requirement for models related to security/integrity, operation of aircraft, and models exclusively provided to a federal entity for defense/national security/military.
 
Upvote
35 (35 / 0)

Granadico

Ars Scholae Palatinae
1,161
Specifically, xAI argued that its dataset sources, dataset sizes, and cleaning methods were all trade secrets.

“If competitors could see the sources of all of xAI’s datasets or even the size of its datasets, competitors could evaluate both what data xAI has and how much they lack,” xAI argued. In one hypothetical, xAI speculated that “if OpenAI (another leading AI company) were to discover that xAI was using an important dataset to train its models that OpenAI was not, OpenAI would almost certainly acquire that dataset to train its own model, and vice versa.”
Is the secret sauce of AI the data sources or the actual programming? I figure it'd be useful to have your sources that aren't well known to give a model an edge, but ultimately the training should be more important than the dataset all things being equal. I get that this is the wild west starting point and tech companies have no morals if it makes them money, but eventually I would think that there'd be some kind of standard or regulation of what training data can be used.
 
Upvote
22 (22 / 0)