This argument actually gets far worse. I had been taught how to do 32 bit math with 16 bit numbers in college out of a textbook. I turned around and did the exact same thing from memory the next year at a contract job I had. Did I violate the textbook’s copyright? I’m fairly sure that function was practically identical in the code I wrote.Curious for any of our lawyers:
Supposed we accept the argument that because Claude is exposed to this LGPL code, it can't rewrite this code without LGPL.
Why wouldn't that argument apply to literally any code produced by Claude in any project? The code it writes is influenced by all of the code it has consumed.
Neither are corporations but apparently that hasn't stopped them from gaining all sorts of rights.Society exists for people.
Machines aren't people.
Monkey selfie.
And therefore not copyrightable.
We know this. It's settled law. The clean room arguments are just nice to have as a backup argument.
Let's not start pretending that AI products are copyrightable, or we're going to end up with bad results in the future.
It's impossible to know for sure, but the fact that most of the major providers are offering copyright indemnification as table stakes in their contracts indicates that their paying customers sure don't trust that they'll be able to.Can Claude or any other LLM reliably distinguish between code licensed under different open source (or for that matter, proprietary l) licenses?
Plenty of companies are juggling with this at the moment, and the conclusion is that you'd better make sure there's some human "transformative" input into the process.If the AI can't hold copyrights, wouldn't that put all AI output in the public domain?
I seem to remember there being lots of court cases over Java APIs recreated for Android. Presumably a lot of the legal questions here have been decided by the courts already.
Does this case mean that any closed source API can be recreated by AI as open source?
Meet your new open source coding team! Credit: Getty Images
If there's no human-written contribution to the code, you can't copyright it. If there is a lot of human-written contributions to the code, you can copyright it. If there is a small amount, you can look forward to any conflict about that code going to the courts because there is no well-defined cutoff.How do you define created by a human?
Take for example any IDE that us Rapid Application Development. The IDE in the background inserts a lot of code into your project automatically to handle things like the parts of the GUI. This, under the current law, is copyrightable when part of a bigger program
At what point does the code venture into something you can't copyright? If it' anything written by a LLM, do that also include the same exact code that a RAD would put in there?
Neither are corporations but apparently that hasn't stopped them from gaining all sorts of rights.
"Big commercial software" -- this is how you know that outcome will NEVER happen.I think the funny implication will be codebases for big commercial software products ironically losing their copyright into public domain from using LLM produced code. It's quite easy to prove it from version control history, LLM chat history, files, and, you know, openly bragging about replacing human devs with AI.
this makes me wonder what would happen if a human coder studied a particular open source project while they were in school. are they forever tainted by that knowledge and unable to do a clean room rewrite of that application? what if they were very un-studious and their grades sucked? what if they failed the class?Exactly. The whole premise behind the original "clean room" copy was with Phoenix cloning the IBM PC BIOS. One team examined the code and wrote a detailed specification. Then a different team that had (supposedly) never seen the original BIOS used the specification alone to write a compatible, but legally "non-derivative" alternative. They key part of this was not just that the engineers writing it were not directly copying IBM, it was that they explicitly had no knowledge of the IBM code.
If the AI trains on the open source code, it absolutely should not be called a clean room implementation!
That isn't to say that there is no other way to make a compatible re-implementation of something. The clean room technique was important because given the limited nature of the original PC, there are only so many ways to do things. It was likely that many bits of code would end up substantially identical to the IBM version. So the clean room technique was key to demonstrating that any similarities were due to the constraints of the API rather than from engineers just copying the code.
i've been hearing this type of argument a lot, and i find completely ridiculous.Well, to test this, I asked Claude about it. And it first stated no, and then when I pointed out it had contradicted itself in its answers, stated yes.
So in other words, it's likely a sycophantic LLM that, on questions where it doesn't really have any training data (such as new areas - like using an LLM to reimplemented copyleft code), tells the user what the user wants to hear. So most likely, when Kyle asked it, it looked at the design document and went "okey dokey, looks good to me, you're all in the clear with that plan". Then, when I fed Kyle's same design document into it and pointed out some problems, Claude went "oh yeah, I was fooled by the plan saying to do a clean room implementation and didn't pay attention to the bits that violate my understanding of what clean room implemenation means". (I am paraphrasing here - Claude's output is multiple paragraphs long, because LLMs apparently like the sound of their own tokens).
Claude Code definitely can't reconcile contradictory instructions like this. LLMs cannot think. The provide a good illusion of thinking, but it's only an illusion. A bit more probing Claude gave a response which ended with this:
So yeah, Claude agrees it doesn't reason about licensing implications.
Dude did a complete rewrite with Claude only. Does that strike you as the sort of person who thinks deeply and then expends effort?If licensing is the issue why not have the Claude code in a new repository with a new name? Instead of going from 6.0 to 7.0 go from 6.0 to 1.0 of something new.
Then you can argue about whether a new project a) inherits the license of an OSS library it is functionally replacing and b) if an AI project is clean room enough to avoid the issue or not.
Technically yes, but a lot of it is going to end up as derivative works with the copyright owned by whoever was asking the AI for its output.If the AI can't hold copyrights, wouldn't that put all AI output in the public domain?
No. It can't "distinguish" anything. It doesn't know anything. It's a sparkling autocomplete.Can Claude or any other LLM reliably distinguish
Of note, the instructions reference specific source code files in the LGPL Chardet, and include an instruction to download data from the LGPL Chardet Github repository.
**Era assignments match chardet 6.0.0** (`chardet/metadata/charsets.py`):if you don't have time to improve it, you probably don't have time to judiciously review and test what some brainless LLM generated for you. and you would never release code you haven't reviewed and tested, right?There is a bright side to this, too: maintainers can modernize and improve open source code that they wouldn't have had the time to do before.
I ported the littlefs C library to rust in a few days, allowing me and anyone else to use it without the trouble of maintaining a C toolchain. I did not change the license.
If he made a new project under a different name, it wouldn't be a drop-in replacement for the old library's dependencies--as the new project absorbs the maintenance effort from the original and the original becomes deprecated, all those dependencies would eventually probably need to refactor to use the new project which, wrong or right, creates a lot of busywork for those dependency maintainers that could be avoided if the creators could sort out the licensing issues in some other way. There are some weird quirks of open source development, and this is definitely one.The bigger issue for me is, why didn't he create a new project? AI aside rewriting code AND making it more permissive doesn't seem right if you are not the original developer or the community.
This is incorrect, under current law as the courts have been interpreting it, if you don't meet a certain threshold of human involvement the copyright doesn't default to anyone, the material is just uncopyrightable. Anything AI generated is effectively public domain.AI only changes that in that the AI can't make any claims, so only you end up with the copyright.
you can as long as you never looked at the original code and were given only the signature of the function. The moment you have the code in the same room/building and you happen to create similar enough implementation good luck defending it courtsThis seems totally unstoppable. I can write strcmp if I want, and it's my code. Doesn't matter in the slightest that other impls of strcmp exist. Same thing with any API like Kafka, etc. Redpanda exists because they did an alternate impl of Kafka (that's better). There's no way for anyone to know or prove whether I used an LLM or not, or how much I used an LLM, etc etc. It's moot anyway, since I can do it by hand. The stopper before was that it would take too long. But clearly that's not true now.
I think the idea of a software license just became a dead letter. Or, you know, we could just delete the models.
On the one hand, yes, but on the other hand, should we legally protect in any way the creations of these LLM models? Copyright ideally protects the interests of human creators, and making AI works ineligible for copyright protections, as the Supreme Court has already confirmed, arguably supports that goal.Is this situation another argument for the fact that copyright law is no longer up to the ask it was setup to do?
with enough reviewing and suggestions you can turn hello world into linux kernel.100%. He said explicitly "I did not write the code by hand, but I was deeply involved in designing, reviewing, and iterating on every aspect of it.” But writing is the step that merits copyright. Just like an editor does not take copyright of a written text from the author, reviewing does not make him the author. Iterating here just means running the LLM multiple times, that does not make him an author.
Either it's derivative, or it's public domain. There's no middle ground where he gets to pick a different license. To achieve that, he would need to do some writing.
i've been hearing this type of argument a lot, and i find completely ridiculous.
ai is a tool, not a consciousness. it makes no sense to ask claude why it contradicted itself. or why it "caused harm" based on its answers in some thread. or, my personal favorite (not in OP) why it lied you. might as well lecture a hammer on its moral culpability for hitting your thumbnail.
...
if a user's prompting results in claude output contradicting previous results, or "hallucinating its response" (a really silly way to spell incorrect answers imho) that is either a direct result of the user's lack of skill with this tool or a limitation of the tool itself. and i would argue that understanding the strengths and limitations of a tool is an important aspect of being a skilled user.
Nope. You're trying to force this situation into the prior framework. This is not about one case (where you are surely correct). This is about the total impossibility of litigating the hundreds of thousands of offenses that will just keep popping up, unstoppably.you can as long as you never looked at the original code and were given only the signature of the function. The moment you have the code in the same room/building and you happen to create similar enough implementation good luck defending it courts
Even here you give it too much credit - "it is a tool of language (and thus code). the details of its architecture and its training have resulted in it encoding relationships among concepts (yeah, i know, the term of art is tokens) directly into its structure". It is a stochastic word predictor where the probabilities for prediction have been encoded into a database formed by illegally and legally encoding data. An LLM doesn't know what a concept is let alone relationships between them but, because it has so much data in it, and it has a convincing and sycophantic linguistic interface, it is capable of performing word prediction on a scale that it is able to mimic the form of what someone actually knowledgeable would say. Think of it as a very expensive way that AI companies can pirate lots of copyrighted material and mix it with some public domain material while distracting everyone with an interface that makes people feel more self-important and believe there is more thought, consideration and capability than there really is, thus most people won't see or understand the theft that has happened.i've been hearing this type of argument a lot, and i find completely ridiculous.
ai is a tool, not a consciousness. it makes no sense to ask claude why it contradicted itself. or why it "caused harm" based on its answers in some thread. or, my personal favorite (not in OP) why it lied you. might as well lecture a hammer on its moral culpability for hitting your thumbnail.
today's ai has no "intentional stance". it completely lacks a sense of self, and it's incapable of performing moral reasoning. not in the sense of being unable to work in that logical framework but in the sense that it is not a moral actor. it cannot apply value judgments to its actions in any morally meaningful sense. and none of those things make it useless, stupid, or a fraud.
it is a tool of language (and thus code). the details of its architecture and its training have resulted in it encoding relationships among concepts (yeah, i know, the term of art is tokens) directly into its structure. other "specialty" networks are then "bolted" on to make the resulting system more functional.
based on its structure, the limitations of this tool are determined by how well its "constructed relationships among concepts" overlap with a human user's own idea mappings in terms of logical consistency and novelty. and so, the extent to which a human user can unambiguously represent desired tasks/outcomes in a manner that maps effectively to an ai's concept mappings, determines how skilled the user is with this tool.
if a user's prompting results in claude output contradicting previous results, or "hallucinating its response" (a really silly way to spell incorrect answers imho) that is either a direct result of the user's lack of skill with this tool or a limitation of the tool itself. and i would argue that understanding the strengths and limitations of a tool is an important aspect of being a skilled user.
Terms of service can't replace copyright. They bind the person who used the service to create the content, but if I -- as someone who has never used the service -- copy that output and do something against the TOS, they have no recourse against me. I'm not bound by the TOS, there's no copyright, from a legal standpoint I can republish it and reuse it any way I want.This is incorrect, under current law as the courts have been interpreting it, if you don't meet a certain threshold of human involvement the copyright doesn't default to anyone, the material is just uncopyrightable. Anything AI generated is effectively public domain.
A caveat here actually points in the opposite direction: while AI works aren't copyrightable, a lot of AI services contain TOS language meant to restrict the use of AI generated works without permission, couched in contract law rather than copyright... so the AI (or at least the company that owns it) is making claims in a very roundabout way... though as far as I know this legal theory hasn't been tested very rigorously yet so who knows how the courts will treat it.
Only if the suggestions are implemented in code. Either you write it (in which case, good job) or AI writes it (in which case, public domain).with enough reviewing and suggestions you can turn hello world into linux kernel.
I think this project is the perfect test case. The MIT license he is trying to apply to largely AI generated code doesn't apply, this is public domain code. The only question would be if the code he actually wrote should also be public domain given that he was openly contributing it to a public domain project, or if he can claim MIT license for those portions.I think the funny implication will be codebases for big commercial software products ironically losing their copyright into public domain from using LLM produced code. It's quite easy to prove it from version control history, LLM chat history, files, and, you know, openly bragging about replacing human devs with AI.
The clean room method of replicating functionality is commonly used as it is relatively easy to set up and prove to a court's satisfaction. However, it is not the only way to show that an implementation is independent. Another way that courts have accepted is structural differences between implementations. This approach is a bit more problematic as it requires a certain level of technical competence of the judge, so clean room implementations are the easier route.Is "clean room" more than a sort of legal insurance to easily prove no infringement in case of litigation ?
My understanding (please correct me) is that not using a "clean room" setup would not automatically imply that the new work is a derived work as defined by copyright law.