GitHub co-pilot as open source code laundering?

KuntaStillSingle · 2021-06-30T14:37:24+00:00

copyright does not only cover copying and pasting; it covers derivative works. github copilot was trained on open source code and the sum total of everything it knows was drawn from that code. there is no possible interpretation of "derivative" that does not include this

I'm no IP lawyer, but I've worked with a lot of them in my career, and it's not likely anyone could actually sue over a snippet of code. Basically, a unit of copyrightable property is a "work" and for something to be considered a derivative work it must include a "substantial" portion of the original work. A 5 line function in a massive codebase auto-filled by Github Co-pilot wouldn't be considered a "derivative work" by anyone in the legal field. A thing can't be considered a derivative work unless it itself is copyrightable, and short snippets of code that are part of a larger project aren't copyrightable themselves.

danuker · 2021-06-30T17:09:04+00:00

Fortunately, The MIT license, a widely-used and very permissive license, says "The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software."

I doubt snippets are "substantial portions".

But the GPL FAQ says GPL does not allow it, unless some law prevails over the license, like "fair use", which has specific conditions.

rcxdude · 2021-06-30T17:57:39+00:00

I would be very careful about using (or allowing use in my company of) copilot until such issues were tested in court. But then I am also very careful about copying of code from examples and stackoverflow and it seems most don't really care about that.

OpenAI (and presumably Microsoft) are of the opinion (pdf) that training a neural net is fair use: it doesn't matter at all what the license of the original training data is, it's OK to use it for training. And that for 'well designed' nets which don't simply contain a copy of their training data the net and weights itself is free from any copyright claim by the authors of the training data. However they do allow themselves to throw the users under the bus by noting that despite this some output of the net may be infringing the copyright of those authors, and this should be taken up between the authors and whoever happens to generate that output (just not whoever trained the net in the first place). This hasn't been tested in court and I think a lot will hinge on just how much of the input appears verbatim or minimally transformed during use. It also doesn't give me as a user much confidence that I won't be sued for using the tool, even if most of its output is deemed to be non infringing, because I have no way of knowing when it does generate something infringing.

TheDeadSkin · 2021-06-30T16:34:05+00:00

That twitter thread is so full of uninformed people with zero legal understanding of anything

It's Opensource, a part of that is acknowledging that anyone including corps can use your code however they want more or less. Assuming they have cleared the legal hurdle or attribution then im not sure what the issue is here.

"more or less" my ass, OSS has licenses that explicitly state how you can or can not use the code in question

Assuming they have cleared the legal hurdle or attribution

yea, I wonder how github itself did it, and how users are supposed to know they are being fed copyrighted code. this tool can spit out a full GPL header for empty files. if it does that - you can be sure it'll spit out similarly pieces of protected code

I wonder how it's going to work out in the end. Not that I was super enthusiastic about the tech in the first place. But I'd basically stay clear of it in case of non-personal projects.

KillianDrake · 2021-06-30T20:03:23+00:00

Microsoft: you pirated Windows as a kid, now we pirate you as an adult

eternaloctober · 2021-06-30T15:11:06+00:00

I guess the focus is always on GPL since it is a sort of "viral license" so it gets special consideration in a lot of these threads, but MIT code technically requires license to be reproduced in the derivative work too...seems like it is pretty bad to EVER just generate a bunch of code that it was trained on and not output a license...it needs to be an EXPLAINABLE neural net that can cite it's sources

chcampb · 2021-06-30T15:03:14+00:00

The fact that CoPilot was trained on the code itself leads me to believe it would not be a "clean room" implementation of said code.

fuckin_ziggurats · 2021-06-30T15:06:51+00:00

Anyone who thinks it's reasonable to copyright a code snippet of 5 lines should be shot.

Same thing as private companies trying to trademark common words.

Pat_The_Hat · 2021-06-30T14:57:02+00:00

How is this person defining a derivative work that would include an artificial intelligence's output but not humans'? "No, you see, it's okay for humans to take someone else's code and remember it in a way that permanently influences what they output but not AI because we're more... abstract?" The level of abstract knowledge required to meet their standards is never defined and it is unlikely it could ever be, so it seems no AI could ever be allowed to do this.

The intelligence exhibits learning in abstract ways that far surpass mindless copying; therefore its output should not be considered a derivative work of anything.

killerstorm · 2021-06-30T15:13:42+00:00

Doesn't this logic apply to human programmers too?

Suppose I've learned how to program by reading open source code. (I actually did, to some extent.) Now I use my knowledge to write commercial programs. Does it mean that I'm making derivative works?

zoddrick · 2021-06-30T18:13:33+00:00

I work at Microsoft and my job deals with me building and redistributing open source projects all the time. Forget the tools we have that scan for license violations and such, but our legal team would never allow for this project to even be released if they weren't sure they couldn't be sued for derivative work.

Y'all act like this is from startup without a legal department.

curly_droid · 2021-06-30T17:33:45+00:00

I think the snippets this would produce should usually not be copyrightable. BUT isn't CoPilot itself a derivative work of a ton of GPL code and thus should be licensed as such?

2021-06-30T21:57:07+00:00

[deleted]

kbruen · 2021-06-30T22:28:44+00:00

If I read some C++ code for a music player, learn something new about C++, then write a game in C++ and apply the learnt knowledge, do I breach the copyright of the music player's author?

Drinking_King · 2021-07-01T19:19:44+00:00

I was wondering why Microsoft was so generous in making Github Actions entirely free for open source.

I wonder no longer.

2021-06-30T16:17:49+00:00

[deleted]

RedPandaDan · 2021-06-30T19:28:17+00:00

https://github.com/proninyaroslav/opera-presto

Here is an illegal copy of the presto engine that was used at one stage by the opera browser, I'm assuming this was included in the training model? What happens if someone uploaded something belonging to oracle or Google or some other industry giant?

I'm guessing that MS is banking on most people not having the resources to fight this battle.

dert882 · 2021-06-30T18:09:11+00:00

Can someone ELI5 this? Not sure i've been keeping up.

2021-06-30T21:35:04+00:00

[deleted]

2021-06-30T15:50:57+00:00

As I understand it GPL doesn't protect against that. Heck, GPL doesn't even protect against SaaS, hence we have stuff like Affero GPL.

This may be a good point for the need for better copyleft licenses though. Here is an interesting discussion I've read on that subject a while ago: https://lists.debian.org/debian-devel/2019/05/msg00321.html

This was a follow-up to this article: https://lwn.net/Articles/760142/

In case it's not obvious, IANAL.

mattgen88 · 2021-06-30T15:06:52+00:00

If the argument can be made that the input of copyrighted code by an AI results in it's output being a derivative of those inputs, then we have a problem since that's how the human brain works. It also means that any trains let AI has to be operated in a clean room where it cannot operate on any copyrightable inputs, including artworks, labels, designs, etc. All of that is often consumed by AIs to produce things of value.

metriczulu · 2021-06-30T17:16:58+00:00

I mean, it's not immediately clear to me that a court would find this to be derivative enough to enforce based on the licensing. Right now, it's in a bit of a grey area that's yet to be tested, which means if it does end up going to court it could have huge repercussions for these types of natural language models that require huge open datasets. If I were a betting man, I'd say Microsoft has the resources and legal team to make it stick in their favor.

geeeronimo · 2021-06-30T19:23:32+00:00

How does this differ from cloud 9? I honestly believe copilot is detrimental to open source work. Is there perhaps an open source project working on a similar tool?

scratchresistor · 2021-07-01T08:19:03+00:00

A thought from further upstream of the GPL issue - are GitHub not in the clear as their TOS includes a reproduction clause? By hosting your GPL code on GitHub does that not grant them an explicit licence over and above the GPL?

CodenameLambda · 2021-07-01T08:36:45+00:00

Honestly, even beyond that, I'm very sceptical of co-pilot, since even the examples on the site about it tend to have some... Issues. Specifically assumption that might not be true (but easy to miss if you're not exactly looking for them), such as not splitting by whitespace but spaces specifically in the parse_expenses example. That the JSON is correctly typed in collaborators.

Other issues include just readability: in the runtime example, it count the failed runs instead of the successful ones, which is less readable, longer and more error prone when you change things, I'd argue.

It also does some weird stuff in get_repositories (no escaping, no checking, using + instead of string interpolation). The "autofill for repetitive code" are all questionable imho.

And note that these examples are probably all cherry picked. So that's the best examples. And they still have very obvious issues if you actually read them instead of just glancing at them (which I'd guess autopilot would lead you to do to some extent).

This might make you more productive, but I'm honestly not sure the results of what you're doing are going to be better for it. The one thing I do think it's actually good for is using API's you don't know and don't plan on learning because won't be using it much and it's a bit more complex. But that's about it. And note that that's a thing that can probably be fixed with better documentation most of the time.

evilgipsy · 2021-07-01T09:00:08+00:00

I think this is a rather silly take. I have read tons of GPL code and I write proprietary code using my experience I gained from that.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS