AI can rewrite open source code—but can it rewrite the license, too?

Pdlove

Smack-Fu Master, in training
2
Curious for any of our lawyers:
Supposed we accept the argument that because Claude is exposed to this LGPL code, it can't rewrite this code without LGPL.
Why wouldn't that argument apply to literally any code produced by Claude in any project? The code it writes is influenced by all of the code it has consumed.
This argument actually gets far worse. I had been taught how to do 32 bit math with 16 bit numbers in college out of a textbook. I turned around and did the exact same thing from memory the next year at a contract job I had. Did I violate the textbook’s copyright? I’m fairly sure that function was practically identical in the code I wrote.
Similar with an LLM. It has no idea which weights were learned from the chardet project vs some blog site vs the python documentation. So if the LLM was trained on GitHub data, and I’m confident it was, it is literally impossible for it to ignore the weights learned from chardet. Weights that happen to do exactly what was requested and in the language requested. I have a similar issue having AI help with CSS. Without strict wording in the prompt, the results look a lot like bootstrap.
 
Upvote
-16 (4 / -20)

norton_I

Ars Praefectus
5,929
Subscriptor++
Monkey selfie.

And therefore not copyrightable.

We know this. It's settled law. The clean room arguments are just nice to have as a backup argument.

Let's not start pretending that AI products are copyrightable, or we're going to end up with bad results in the future.

That's true but the clean room implementation is about something different.

A thing can be non copyrightable and still infringe copyright. For instance if the monkey selfie was a movie and soemone had a radio in the background that was playing The Beatles. The selfie would still not be copyrightable (no significant human creative input) but distributing it would still infringe the Beatles copyright.

Another example is text to speech. If I run an audio book through a text to speech translator there is no new copyright. But if I record a human reading a book that creates a new copyrighted work that is also derivative of the original.

Ir's the same way here: as far as I understand the law and what happened here, the "new" project is not entitled to copyright in its own right. However that doesn't mean it isn't derivative of the previous version. The "clean room" claim is that it is sufficiently separated from the original to not be a copy. I don't know nearly enough to tell if this new version would successfully avoid the previous copyright but I definitely don't think it is a clean room implementation in the sense that is normally used.
 
Upvote
10 (11 / -1)

gkorper

Wise, Aged Ars Veteran
195
Subscriptor++
I am surprised there has not been any comment about at least some of the files that were referenced appearing to be facts (a list of the names or character sets for example). Facts, including in most cases compilations of them, are not copyrightable and therefore were likely never under LGPL. To me that adds an interesting twist to the story.
 
Upvote
8 (8 / 0)

avilhelmo

Ars Scholae Palatinae
912
It's pretty obvious to anyone who has used claude code that it will readily reproduce code that was written with restrictive license (GPL, etc) verbatim. I recently tasked it to implement some part of an RFC in a given language and had the canonical example of it from an open source project in another screen. It had same structure, function names, and functions had basically same logic. Little tweaks here and there, but obviously derivative code.
 
Upvote
11 (11 / 0)

adamsc

Ars Praefectus
4,303
Subscriptor++
I’ve been thinking that open source as we knew it, especially as practiced by companies, is dead when someone can clone your code so quickly that it could he done on every release. Commercial open source was already struggling with questions over the business model in a cloud environment (Redis, etc.) and this effectively makes those “business” licenses irrelevant.

That leaves us with the code people are comfortable releasing effectively in the public domain or staying proprietary. There’s a chance that some projects could follow SQLite’s lead in having the code free but the test suite licensed to make forks harder, but even that seems like a gamble. If I was a business right now, I’d have to think long and hard about releasing anything which isn’t a client for a proprietary service—especially if it was a software development tool which a large company might benefit from Sherlock-ing—and if you’re a developer there’s a very literal “training your replacement” mood.

This seems likely to lead to an enormous hollowing out of the commons all LLMs were built on. The mindset of the people who cut down the last trees in Iceland or eating the last dodo no longer seems so foreign.
 
Upvote
2 (4 / -2)
Can Claude or any other LLM reliably distinguish between code licensed under different open source (or for that matter, proprietary l) licenses?
It's impossible to know for sure, but the fact that most of the major providers are offering copyright indemnification as table stakes in their contracts indicates that their paying customers sure don't trust that they'll be able to.
 
Last edited:
Upvote
10 (10 / 0)

Marlor_AU

Ars Tribunus Angusticlavius
7,785
Subscriptor
If the AI can't hold copyrights, wouldn't that put all AI output in the public domain?
Plenty of companies are juggling with this at the moment, and the conclusion is that you'd better make sure there's some human "transformative" input into the process.

Usually, with a coding agent, there's a lot of back-and-forth between the user and the tool to create something that works. It's not just a case of ingesting specs and outputting code - it's iterative. There's likely a solid legal case that this back-and-forth iteration between the agent and user is transformative. If the user is selecting bits and pieces, reverting others, asking for changes and generally driving the process, then (from a legal standpoint) it can be considered a tool-assisted human work, rather than a pure AI creation.
 
Upvote
1 (5 / -4)

adamsc

Ars Praefectus
4,303
Subscriptor++
I seem to remember there being lots of court cases over Java APIs recreated for Android. Presumably a lot of the legal questions here have been decided by the courts already.

Does this case mean that any closed source API can be recreated by AI as open source?

Oracle v. Google was the big case there, but interestingly enough the Supreme Court did not decide the question of whether an API itself can be copyrighted since they held that Google’s usage was covered by fair use in any case. That’s interesting in this case both for the question of whether the current court would rule differently for an exact reimplementation (Breyer noted that Google used less than a percent of the Java source) or something which directly competes against the original (Android had almost no impact on existing Java usage).

I would not be shocked to see one of the companies with a big open source base—IBM, say, with stuff like Terraform or one of the business source licenses which tries to restrict competitive reuse—ended up making a second pass on that and getting a decision on the copyright question. I remember during the Oracle case a number of people were worried because they sounded receptive to Oracle’s arguments in that regard.

A really interesting scenario: what if someone blackbox-cloned improvements to something like WINE so it became even closer to a production-grade Windows replacement? Would Microsoft reverse their amicus position in Oracle v. Google if that started cutting into their corporate customer base?
 
Upvote
3 (3 / 0)

jdale

Ars Legatus Legionis
18,438
Subscriptor
How do you define created by a human?

Take for example any IDE that us Rapid Application Development. The IDE in the background inserts a lot of code into your project automatically to handle things like the parts of the GUI. This, under the current law, is copyrightable when part of a bigger program

At what point does the code venture into something you can't copyright? If it' anything written by a LLM, do that also include the same exact code that a RAD would put in there?
If there's no human-written contribution to the code, you can't copyright it. If there is a lot of human-written contributions to the code, you can copyright it. If there is a small amount, you can look forward to any conflict about that code going to the courts because there is no well-defined cutoff.

The copyright office says (emphasis added):

Copyright does not extend to purely AI-generated material, or material where there is insufficient human control over the expressive elements.
• Whether human contributions to AI-generated outputs are sufficient to constitute authorship must be analyzed on a case-by-case basis.
• Based on the functioning of current generally available technology, prompts do not alone provide sufficient control.
• Human authors are entitled to copyright in their works of authorship that are perceptible in AI-generated outputs, as well as the creative selection, coordination, or arrangement of material in the outputs, or creative modifications of the outputs.


And also:

A number of commenters also made the point that if a user edits, adapts, enhances, or modifies AI-generated output in a way that contributes new authorship, the output would be
entitled to protection.132 They argued that these modifications “should be assessed in the same way as . . . editorial or other changes to a pre-existing work.”133 Although such works would not technically qualify as “derivative works,”134 derivative authorship provides a helpful analogy in identifying originality. Again, the copyright would extend to the material the human author contributed but would not extend to the underlying AI-generated content itself.135


It's hard to imagine that an arrangement of code would be copyrightable in that sense, but even if it was, the actual AI-generated code would not be, only the arrangement.

https://www.copyright.gov/ai/
 
Upvote
18 (18 / 0)

SeanJW

Ars Legatus Legionis
11,996
Subscriptor++
Neither are corporations but apparently that hasn't stopped them from gaining all sorts of rights.

Corporations are collectives of people (the shareholders), so they inherit some rights of those people - if the shareholders can do it individually, the company can too effectively (if this wasn't recognised, it would be ridiculous - an individual would have to say... purchase property... and make it available to the other shareholders under a contract)
 
Upvote
2 (3 / -1)
I think the funny implication will be codebases for big commercial software products ironically losing their copyright into public domain from using LLM produced code. It's quite easy to prove it from version control history, LLM chat history, files, and, you know, openly bragging about replacing human devs with AI.
"Big commercial software" -- this is how you know that outcome will NEVER happen.
 
Upvote
1 (1 / 0)

overtoad

Seniorius Lurkius
42
Exactly. The whole premise behind the original "clean room" copy was with Phoenix cloning the IBM PC BIOS. One team examined the code and wrote a detailed specification. Then a different team that had (supposedly) never seen the original BIOS used the specification alone to write a compatible, but legally "non-derivative" alternative. They key part of this was not just that the engineers writing it were not directly copying IBM, it was that they explicitly had no knowledge of the IBM code.

If the AI trains on the open source code, it absolutely should not be called a clean room implementation!

That isn't to say that there is no other way to make a compatible re-implementation of something. The clean room technique was important because given the limited nature of the original PC, there are only so many ways to do things. It was likely that many bits of code would end up substantially identical to the IBM version. So the clean room technique was key to demonstrating that any similarities were due to the constraints of the API rather than from engineers just copying the code.
this makes me wonder what would happen if a human coder studied a particular open source project while they were in school. are they forever tainted by that knowledge and unable to do a clean room rewrite of that application? what if they were very un-studious and their grades sucked? what if they failed the class?
this might sound snarky, but i think there's a meaningful point here.
i think it's a reasonable assumption that the level of expertise demonstrated by AI today (in coding) can be placed somewhere in the range between: "dropping out 1st semester freshman year having never shown up for class" all the way to "completely mastering all topics presented during your course of study including studying that *gpl-X licensed project".
i'm not claiming to know exactly where ai is on that scale, as of today or even in general, but i'm also not sure we should be disqualifying coders from rewrite work based on academic diligence and curriculum specifics. evaluating claude code's fundamental capacity to perform a clean-room rewrite based on how well it may have learned a specific set of code that might have been included in its training, seems equally problematic.
i also think it makes as much sense to say ai is only capable of derivative work as it does to say the same of human coders.

-edit- damn. totally ninja'd by dehildum
 
Last edited:
Upvote
-10 (2 / -12)

overtoad

Seniorius Lurkius
42
Well, to test this, I asked Claude about it. And it first stated no, and then when I pointed out it had contradicted itself in its answers, stated yes.

So in other words, it's likely a sycophantic LLM that, on questions where it doesn't really have any training data (such as new areas - like using an LLM to reimplemented copyleft code), tells the user what the user wants to hear. So most likely, when Kyle asked it, it looked at the design document and went "okey dokey, looks good to me, you're all in the clear with that plan". Then, when I fed Kyle's same design document into it and pointed out some problems, Claude went "oh yeah, I was fooled by the plan saying to do a clean room implementation and didn't pay attention to the bits that violate my understanding of what clean room implemenation means". (I am paraphrasing here - Claude's output is multiple paragraphs long, because LLMs apparently like the sound of their own tokens).

Claude Code definitely can't reconcile contradictory instructions like this. LLMs cannot think. The provide a good illusion of thinking, but it's only an illusion. A bit more probing Claude gave a response which ended with this:



So yeah, Claude agrees it doesn't reason about licensing implications.
i've been hearing this type of argument a lot, and i find completely ridiculous.

ai is a tool, not a consciousness. it makes no sense to ask claude why it contradicted itself. or why it "caused harm" based on its answers in some thread. or, my personal favorite (not in OP) why it lied you. might as well lecture a hammer on its moral culpability for hitting your thumbnail.

today's ai has no "intentional stance". it completely lacks a sense of self, and it's incapable of performing moral reasoning. not in the sense of being unable to work in that logical framework but in the sense that it is not a moral actor. it cannot apply value judgments to its actions in any morally meaningful sense. and none of those things make it useless, stupid, or a fraud.

it is a tool of language (and thus code). the details of its architecture and its training have resulted in it encoding relationships among concepts (yeah, i know, the term of art is tokens) directly into its structure. other "specialty" networks are then "bolted" on to make the resulting system more functional.

based on its structure, the limitations of this tool are determined by how well its "constructed relationships among concepts" overlap with a human user's own idea mappings in terms of logical consistency and novelty. and so, the extent to which a human user can unambiguously represent desired tasks/outcomes in a manner that maps effectively to an ai's concept mappings, determines how skilled the user is with this tool.

if a user's prompting results in claude output contradicting previous results, or "hallucinating its response" (a really silly way to spell incorrect answers imho) that is either a direct result of the user's lack of skill with this tool or a limitation of the tool itself. and i would argue that understanding the strengths and limitations of a tool is an important aspect of being a skilled user.
 
Upvote
17 (17 / 0)

arsisloam

Ars Scholae Palatinae
1,428
Subscriptor
If licensing is the issue why not have the Claude code in a new repository with a new name? Instead of going from 6.0 to 7.0 go from 6.0 to 1.0 of something new.

Then you can argue about whether a new project a) inherits the license of an OSS library it is functionally replacing and b) if an AI project is clean room enough to avoid the issue or not.
Dude did a complete rewrite with Claude only. Does that strike you as the sort of person who thinks deeply and then expends effort?
 
Upvote
-3 (5 / -8)

xWidget

Ars Tribunus Militum
2,872
If the AI can't hold copyrights, wouldn't that put all AI output in the public domain?
Technically yes, but a lot of it is going to end up as derivative works with the copyright owned by whoever was asking the AI for its output.

Normally if you take someone else's copyrighted work (whatever it is) and organize or edit it, both of you end up with a claim to its copyright and it requires the coordination or licensing between all the parties to actually distribute.

AI only changes that in that the AI can't make any claims, so only you end up with the copyright.

If it was possible to see what the original AI content was, that part could be copied by anyone without repercussions. That's usually going to be impossible though.
 
Upvote
-5 (1 / -6)
Of note, the instructions reference specific source code files in the LGPL Chardet, and include an instruction to download data from the LGPL Chardet Github repository.

Which references to source code files are you meaning?

When I look, I'm only seeing line 270:

**Era assignments match chardet 6.0.0** (`chardet/metadata/charsets.py`):

That one seems more like "ensure the new programs output is the same as the old programs output", which kinda seems like doing a comparison measurement. From a logical point of view, maybe that's ok? No idea from a legal point of view though. ;)
 
Last edited:
Upvote
4 (4 / 0)

cleek

Ars Scholae Palatinae
1,260
There is a bright side to this, too: maintainers can modernize and improve open source code that they wouldn't have had the time to do before.

I ported the littlefs C library to rust in a few days, allowing me and anyone else to use it without the trouble of maintaining a C toolchain. I did not change the license.
if you don't have time to improve it, you probably don't have time to judiciously review and test what some brainless LLM generated for you. and you would never release code you haven't reviewed and tested, right?
 
Upvote
11 (12 / -1)

sitmonkey

Ars Centurion
218
Subscriptor
If he never told us about using AI to code it, would we be any the wiser?
Would we be having this controversy if he had hid this?

I go into this thinking that AI should not be legally allowed to do this and claim independence from the existing protections but a person putting out the same code would.
How much does method versus state change matter? Practically, none.
But licenses are just about practical impacts but also legal impacts and that's where the legal fictions/lies come into play. How often will we see people and companies straight up lie about how they write new code and generate new copies of existing IP?
 
Upvote
-1 (2 / -3)

GenericAnimeBoy

Ars Tribunus Militum
1,852
Subscriptor++
The bigger issue for me is, why didn't he create a new project? AI aside rewriting code AND making it more permissive doesn't seem right if you are not the original developer or the community.
If he made a new project under a different name, it wouldn't be a drop-in replacement for the old library's dependencies--as the new project absorbs the maintenance effort from the original and the original becomes deprecated, all those dependencies would eventually probably need to refactor to use the new project which, wrong or right, creates a lot of busywork for those dependency maintainers that could be avoided if the creators could sort out the licensing issues in some other way. There are some weird quirks of open source development, and this is definitely one.
 
Last edited:
Upvote
2 (3 / -1)
AI only changes that in that the AI can't make any claims, so only you end up with the copyright.
This is incorrect, under current law as the courts have been interpreting it, if you don't meet a certain threshold of human involvement the copyright doesn't default to anyone, the material is just uncopyrightable. Anything AI generated is effectively public domain.

A caveat here actually points in the opposite direction: while AI works aren't copyrightable, a lot of AI services contain TOS language meant to restrict the use of AI generated works without permission, couched in contract law rather than copyright... so the AI (or at least the company that owns it) is making claims in a very roundabout way... though as far as I know this legal theory hasn't been tested very rigorously yet so who knows how the courts will treat it.
 
Upvote
4 (4 / 0)

unconcerned

Ars Scholae Palatinae
1,071
This seems totally unstoppable. I can write strcmp if I want, and it's my code. Doesn't matter in the slightest that other impls of strcmp exist. Same thing with any API like Kafka, etc. Redpanda exists because they did an alternate impl of Kafka (that's better). There's no way for anyone to know or prove whether I used an LLM or not, or how much I used an LLM, etc etc. It's moot anyway, since I can do it by hand. The stopper before was that it would take too long. But clearly that's not true now.

I think the idea of a software license just became a dead letter. Or, you know, we could just delete the models.
you can as long as you never looked at the original code and were given only the signature of the function. The moment you have the code in the same room/building and you happen to create similar enough implementation good luck defending it courts
 
Upvote
0 (0 / 0)
Is this situation another argument for the fact that copyright law is no longer up to the ask it was setup to do?
On the one hand, yes, but on the other hand, should we legally protect in any way the creations of these LLM models? Copyright ideally protects the interests of human creators, and making AI works ineligible for copyright protections, as the Supreme Court has already confirmed, arguably supports that goal. Even if this "rewrite" were to survive a "clean room" test, which I don't believe is the case, by either the spirit or the letter of the concept, the output isn't copyrightable, so any kind of licensing that relies on copyright would be effectively null.

Editing to acknowledge that the whole clean room reverse engineering thing is about patents, not copyright, and I was typing before thoroughly reading all the previous comments (thanks other posters for being smarties). I guess if the result is almost completely different in how it is written it probably would survive a copyright infringement challenge, but couldn't itself be copyrightable, so it could not be effectively licensed in any way.
 
Last edited:
Upvote
1 (1 / 0)

unconcerned

Ars Scholae Palatinae
1,071
100%. He said explicitly "I did not write the code by hand, but I was deeply involved in designing, reviewing, and iterating on every aspect of it.” But writing is the step that merits copyright. Just like an editor does not take copyright of a written text from the author, reviewing does not make him the author. Iterating here just means running the LLM multiple times, that does not make him an author.

Either it's derivative, or it's public domain. There's no middle ground where he gets to pick a different license. To achieve that, he would need to do some writing.
with enough reviewing and suggestions you can turn hello world into linux kernel.
 
Upvote
0 (1 / -1)
i've been hearing this type of argument a lot, and i find completely ridiculous.

ai is a tool, not a consciousness. it makes no sense to ask claude why it contradicted itself. or why it "caused harm" based on its answers in some thread. or, my personal favorite (not in OP) why it lied you. might as well lecture a hammer on its moral culpability for hitting your thumbnail.
...
if a user's prompting results in claude output contradicting previous results, or "hallucinating its response" (a really silly way to spell incorrect answers imho) that is either a direct result of the user's lack of skill with this tool or a limitation of the tool itself. and i would argue that understanding the strengths and limitations of a tool is an important aspect of being a skilled user.

You concluded more-or-less exactly what Anthropic co-founder Jack Clark stated in an interview with Ezra Klein a few weeks ago.

https://www.nytimes.com/2026/02/24/opinion/ezra-klein-podcast-jack-clark.html

Clark is absolutely clear that people who get the best results out of Claude -- using the example of Anthropic's own coding team -- are those who are highly skilled at their job but who also have worked with Claude long enough to know what its strengths and limitations are and how to communicate with it.

Many people who I've seen complain about LLM output act like it's the Enterprise's ship computer where you can speak some vague-yet-complex demand to the air and Majel Barrett's voice will give you the correct answer every time. Then they are dismissive or angry when it doesn't give them what they expect.

Maybe someday we'll get there, but that's not how these things work currently.
 
Upvote
2 (4 / -2)
you can as long as you never looked at the original code and were given only the signature of the function. The moment you have the code in the same room/building and you happen to create similar enough implementation good luck defending it courts
Nope. You're trying to force this situation into the prior framework. This is not about one case (where you are surely correct). This is about the total impossibility of litigating the hundreds of thousands of offenses that will just keep popping up, unstoppably.

How effective has RIAA been at stopping movie piracy? Sure, they can win their case. If they can find someone big enough or dumb enough to sue, they do. But it's so easy for people to pirate movies, they just keep on doing it anyway. This is that. It used to be hard to dupe a whole library or codebase. Now it's easy. Sure, individual cases can and will be litigated and won. That doesn't change a thing for the 9999/10000 cases that just aren't economically practical to go to court over.

Interestingly, the RIAA piracy situation has not slowed investment in movies at all. It has grown enormously since the 1980s. It just fueled consolidation. It isn't that movies aren't being made, but the money is concentrated in fewer hands. Assuming the parallel is legit, I certainly don't predict the demise of software. I predict the demise of small scale operators on the 1-20 person scale, as they will be vulnerable without any of the compensating defensive advantages of scale.

In such an environment, why release open source code at all? And once that becomes the norm, what will the LLMs have to feed on?
Interesting times.
 
Last edited:
Upvote
0 (2 / -2)

nray

Smack-Fu Master, in training
70
i've been hearing this type of argument a lot, and i find completely ridiculous.

ai is a tool, not a consciousness. it makes no sense to ask claude why it contradicted itself. or why it "caused harm" based on its answers in some thread. or, my personal favorite (not in OP) why it lied you. might as well lecture a hammer on its moral culpability for hitting your thumbnail.

today's ai has no "intentional stance". it completely lacks a sense of self, and it's incapable of performing moral reasoning. not in the sense of being unable to work in that logical framework but in the sense that it is not a moral actor. it cannot apply value judgments to its actions in any morally meaningful sense. and none of those things make it useless, stupid, or a fraud.

it is a tool of language (and thus code). the details of its architecture and its training have resulted in it encoding relationships among concepts (yeah, i know, the term of art is tokens) directly into its structure. other "specialty" networks are then "bolted" on to make the resulting system more functional.

based on its structure, the limitations of this tool are determined by how well its "constructed relationships among concepts" overlap with a human user's own idea mappings in terms of logical consistency and novelty. and so, the extent to which a human user can unambiguously represent desired tasks/outcomes in a manner that maps effectively to an ai's concept mappings, determines how skilled the user is with this tool.

if a user's prompting results in claude output contradicting previous results, or "hallucinating its response" (a really silly way to spell incorrect answers imho) that is either a direct result of the user's lack of skill with this tool or a limitation of the tool itself. and i would argue that understanding the strengths and limitations of a tool is an important aspect of being a skilled user.
Even here you give it too much credit - "it is a tool of language (and thus code). the details of its architecture and its training have resulted in it encoding relationships among concepts (yeah, i know, the term of art is tokens) directly into its structure". It is a stochastic word predictor where the probabilities for prediction have been encoded into a database formed by illegally and legally encoding data. An LLM doesn't know what a concept is let alone relationships between them but, because it has so much data in it, and it has a convincing and sycophantic linguistic interface, it is capable of performing word prediction on a scale that it is able to mimic the form of what someone actually knowledgeable would say. Think of it as a very expensive way that AI companies can pirate lots of copyrighted material and mix it with some public domain material while distracting everyone with an interface that makes people feel more self-important and believe there is more thought, consideration and capability than there really is, thus most people won't see or understand the theft that has happened.

I often think of, "Pay no attention to that man behind the curtain!" from the 1939 Wizard of Oz. The giant mechatronic fire breathing head and booming amplified voice is like the surface of the LLM. Except there isn't even a human operating the insides of an LLM. An LLM is less than the original Mechanical Turk, it is not even a human pretending to be a mechanical chess playing machine. It is a gargantuan pile of data and associated probabilities so large and an interface so beguiling and distracting that most people can't see it for what it is, something simple and devoid of intelligence.
 
Upvote
-2 (3 / -5)
The re-write would probably at least partially qualify for copyright. Parts not qualifying for copyright would be public domain, which would not make a huge difference since MIT is one of the most permissive licences.

Also, the re-write would not automatically be considered a derivative work, even if a "clean-room" setup was not strictly followed, and even if the model was trained on the original source code.

In both cases it's not settled law, but if, as the maintainer claims, the new code has significantly different design and architecture and was written from a empty repo, and no piece of logic matches the previous repo, it's probably fine from a legal standpoint.

Is it ethical ?
The guy is the project maintainer, so this sort of thing is his decision to make. His intentions seem fine, as he seemed to truly care about not infringing the rights of previous authors.
 
Upvote
-6 (2 / -8)

jdale

Ars Legatus Legionis
18,438
Subscriptor
This is incorrect, under current law as the courts have been interpreting it, if you don't meet a certain threshold of human involvement the copyright doesn't default to anyone, the material is just uncopyrightable. Anything AI generated is effectively public domain.

A caveat here actually points in the opposite direction: while AI works aren't copyrightable, a lot of AI services contain TOS language meant to restrict the use of AI generated works without permission, couched in contract law rather than copyright... so the AI (or at least the company that owns it) is making claims in a very roundabout way... though as far as I know this legal theory hasn't been tested very rigorously yet so who knows how the courts will treat it.
Terms of service can't replace copyright. They bind the person who used the service to create the content, but if I -- as someone who has never used the service -- copy that output and do something against the TOS, they have no recourse against me. I'm not bound by the TOS, there's no copyright, from a legal standpoint I can republish it and reuse it any way I want.

This is really the opposite of copyright, which does not bind the creator but instead binds everyone else.
 
Upvote
8 (8 / 0)

dehildum

Ars Scholae Palatinae
1,036
I think the funny implication will be codebases for big commercial software products ironically losing their copyright into public domain from using LLM produced code. It's quite easy to prove it from version control history, LLM chat history, files, and, you know, openly bragging about replacing human devs with AI.
I think this project is the perfect test case. The MIT license he is trying to apply to largely AI generated code doesn't apply, this is public domain code. The only question would be if the code he actually wrote should also be public domain given that he was openly contributing it to a public domain project, or if he can claim MIT license for those portions.

I would hope the answer to that is no, if you contribute to a public domain project, then your contributions are also public domain. If he forked it, then maybe he could license his fork….

Personally, I have contributed to public domain projects and expected that my contributions would also be public domain.
 
Upvote
0 (3 / -3)

dehildum

Ars Scholae Palatinae
1,036
Is "clean room" more than a sort of legal insurance to easily prove no infringement in case of litigation ?
My understanding (please correct me) is that not using a "clean room" setup would not automatically imply that the new work is a derived work as defined by copyright law.
The clean room method of replicating functionality is commonly used as it is relatively easy to set up and prove to a court's satisfaction. However, it is not the only way to show that an implementation is independent. Another way that courts have accepted is structural differences between implementations. This approach is a bit more problematic as it requires a certain level of technical competence of the judge, so clean room implementations are the easier route.

However, I suspect that with the increasing complexity of APIs, the structural method will be much easier to use as these complex APIs have many more implementation options than the simple APIs of 50 years ago when the clean room method was introduced.

That 48x performance boost and 1% codebase similarity goes a long way to show that this is an independent, superior implementation. It is not sufficient on its own, but comparing the source graphs should demonstrate sufficient differences to make the rest of the argument.
 
Upvote
5 (7 / -2)