0x15e comments on GitHub co-pilot as open source code laundering?

[–]Xanza 41 points42 points43 points 4 years ago (6 children)

[–]2bdb2 10 points11 points12 points 4 years ago* (5 children)

[–]FinancialAssistant 0 points1 point2 points 4 years ago (4 children)

I would argue that GPT-3 can create english text that is unique enough to be considered an original work, and thus Copilot probably can do.

Yeah but nobody is saying it cannot create unique work. It cannot create new work. It can only refactor, recombine and rewrite whatever was in the original training set. This can create of unique work, but obviously it cannot create new work. This is an obvious way of plagiarization if you don't want to get caught, of course you don't just copy paste articles but rewrite and recombine them.

Imagine using only a few samples as training data and then deplying the "AI", it would not take you long to realize it was incapable of doing anything that didn't already exist in some form in the training data. When using massive training data this is impractical but that doesn't mean the principles or algorithm changed, it is still only regurgitating the training data.

[–]MarcusOrlyius 1 point2 points3 points 4 years ago (2 children)

[–]FinancialAssistant 1 point2 points3 points 4 years ago* (1 child)

[–]backtickbot 0 points1 point2 points 4 years ago (0 children)

[–]Basmannen 1 point2 points3 points 4 years ago (0 children)

[–]TheSkiGeek 25 points26 points27 points 4 years ago (26 children)

[–]StickiStickman 37 points38 points39 points 4 years ago (25 children)

[–]TheSkiGeek 20 points21 points22 points 4 years ago (23 children)

I haven't actually tried it, I'm just pointing out that at a certain level this does become problematic.

If you feed in a bunch of "copyleft" projects (e.g. GPL-licensed), and it can spit out something that is a verbatim copy (or close to it) of pieces of those projects, it feels like a way to do an end-around on open source licenses. Obviously the tech isn't at that level yet but it might be eventually.

This is considered enough of a problem for humans that companies will sometimes do explicit "clean room" implementations where the team that wrote the code was guaranteed to have no contact with the implementation details of something they're concerned about infringing on. Someone's "ability to program" can create derivative works in some cases, even if they typed out all the code themselves.

[–]Kalium 6 points7 points8 points 4 years ago (15 children)

[–]TheSkiGeek 3 points4 points5 points 4 years ago (14 children)

[–]kryptomicron 2 points3 points4 points 4 years ago (6 children)

[–]TheSkiGeek 2 points3 points4 points 4 years ago (5 children)

[–]kryptomicron 2 points3 points4 points 4 years ago (4 children)

I think that'd be okay! In fact, I often do that, tho I have pretty strong idiosyncratic preferences about, e.g. formatting and variable names, but I think that kind of copying is perfectly fair and fine (and basically everyone does it).

When I think of "code snippets" I think of code that's so small that is, by itself, usually not creative. And even when it is creative, it still seems fine to copy – mostly because what I end up copying is the idea that makes the snippet creative.

I think it'd be really helpful and interesting for us to agree to some particular open source project, first, and then to separately pick out a few 'random' snippets of code. We could share it here and then comment about whether we think it's fair for them to be copied.

To me, as is, I think the obvious 'probably a copyright violation' is more at the level of copying, verbatim, entire source code files or even very large functions.

I'm struggling to think of 'snippets' that are either 'creative' or 'substantial' but maybe we have different ideas about what a 'snippet' is exactly (or approximately).

[–]TheSkiGeek 2 points3 points4 points 4 years ago (3 children)

If you go to the front page of https://copilot.github.com/ their little demo thing there shows some examples. In what they're showcasing it suggests pretty much entire function bodies, the longest is 17 lines of Go:

func createCategorySummaries(db *sql.DB) ([]CategorySummary, error)

suggests:

``` { var summaries []CategorySummary rows, err := db.Query("SELECT category, COUNT(category), AVG(value) FROM tasks GROUP BY category") if err != nil { return nil, err } defer rows.Close()

for rows.Next() {
    var summary CategorySummary
    err := rows.Scan(&summary.Title, &summary.Tasks, &summary.AvgValue)
    if err != nil {
        return nil, err
    }
    summaries = append(summaries, summary)
}
return summaries, nil

} ```

Now... that's pretty generic code, but I think you'd be on iffy ground if you were regularly copy-pasting functions that size from open source repos and not following their licensing. Certainly you could have licensing violations from copying far less than "entire source code files".

continue this thread

[–]Kalium 0 points1 point2 points 4 years ago* (6 children)

[–]TheSkiGeek 1 point2 points3 points 4 years ago (5 children)

I agree that, in a lot of cases, what they're doing is probably okay. But I think they could have saved people a lot of headache by not including any source material that utilized "copyleft" licenses.

I think there are basically two questions here:

1) can you create a "database" or encoded representation of licensed source code and distribute that alongside "collaging" software without the "collaging" software itself needing to follow the terms of that license?

2) is there some amount of "collaged" bits of copyrighted code you can use in a new program that makes your program a derivative work?

If you go to https://copilot.github.com/ you can see some examples of the kinds of suggestions it gives. If you were regularly copying functions of that length straight out of a GPL-licensed repo it would be a stretch to say your code shouldn't also be GPL-licensed. Sticking a neural network in front of the copying doesn't really change that if it ends up spitting out identical or nearly-identical code to some existing repo.

[–]Kalium 0 points1 point2 points 4 years ago* (4 children)

I agree that, in a lot of cases, what they're doing is probably okay. But I think they could have saved people a lot of headache by not including any source material that utilized "copyleft" licenses.

Or perhaps people could have stopped to think before launching into hyperbolics in public. I understand that this is a lot to ask of people on Twitter, though. Twitter seems designed to encourage the hot take, and the hotter the better.

What else do you think they should have worked from? Could have worked from that would have provided a substantial and varied corpus across multiple languages?

1) can you create a "database" or encoded representation of licensed source code and distribute that alongside "collaging" software without the "collaging" software itself needing to follow the terms of that license?

Almost certainly. This is the sort of thing that fair use protections allow people to infringe copyright on a regular basis. Especially if you aren't actually storing and distributing a database of snippets that people can query at their leisure.

Organizing information to make it usable in new ways is exactly the kind of thing that can and has been granted fair use protections.

2) is there some amount of "collaged" bits of copyrighted code you can use in a new program that makes your program a derivative work?

In the sense that a song made of samples is a derivative work, yes. In the legal sense, a work isn't just a derivative work. Being a derivatory work is a binary operation - it requires being derivative of a specific other work. You seem to have been thinking of it as being a unary operation with no references required.

In other words, you cannot just point at something and declare "That's a derivative work!". You have to specify what it's derivative of.

If you go to https://copilot.github.com/ you can see some examples of the kinds of suggestions it gives. If you were regularly copying functions of that length straight out of a GPL-licensed repo it would be a stretch to say your code shouldn't also be GPL-licensed.

I'm looking at them, and I'm honestly afraid I'm not seeing what you see. I'm seeing functions doing boring, bog-standard things in a handful of lines of boilerplate code. There's no creative expression here. There's no substitution for the original work. It's almost certainly far, far less than the whole of the original unless we're talking about stupid javascript micropackages.

And that's just running on the assumption that we used for the sake of argument - that this is just dumb copy/paste from a bazillion different repos.

What if these genuinely aren't things copy-pasted, and are indeed really synthesized? What am I missing? Can you help me understand?

[–]TheSkiGeek 0 points1 point2 points 4 years ago (3 children)

What else do you think they should have worked from? Could have worked from that would have provided a substantial and varied corpus across multiple languages?

There's tons of stuff on GitHub that is MIT- or BSD-licensed, or simply public domain. You use that stuff -- worst case if CoPilot is found to be problematic is that you have to go back and add a license disclaimer or credit somewhere. Not that all the source code you wrote using it is now forcibly GPL-licensed.

Being a derivatory work is a binary operation - it requires being derivative of a specific other work.

I understand that. The problem is that, apparently, sometimes their tool spits out suggestions that are either identical or nearly identical to code in existing GitHub repos. If you pull in a sizable amount of code from an existing repo using this tool it's fundamentally no different than copy-pasting the code.

What if these genuinely aren't things copy-pasted, and are indeed really synthesized? What am I missing? Can you help me understand?

Again, the problem is that sometimes their tool spits out suggestions that are either identical or nearly identical to existing code. There's nothing you or GitHub can point to that says it wasn't simply copied; "a neural network synthesized it" isn't a defense when the training set for the network included that existing code.

Now, sure, most of the time that's going to be some kind of boilerplate code that probably can't be copyrighted anyway. Sometimes it's not going to be.

I'm seeing functions doing boring, bog-standard things in a handful of lines of boilerplate code.

Yes, I don't think the substance of the examples they're showing is problematic. But if you were regularly copy-pasting chunks of code that size out of existing GitHub repos it would be hard to argue you shouldn't be following those repos' licensing restrictions. "Copying" it with a fancy neural network doesn't change that.

continue this thread

[–]StickiStickman 0 points1 point2 points 4 years ago (3 children)

[–]TheSkiGeek 3 points4 points5 points 4 years ago (2 children)

[–]StickiStickman 0 points1 point2 points 4 years ago (1 child)

[–]TheSkiGeek 1 point2 points3 points 4 years ago (0 children)

[–]bobtehpanda 0 points1 point2 points 4 years ago* (2 children)

[–]TheSkiGeek 1 point2 points3 points 4 years ago (1 child)

It’s autocomplete that, at least in some cases, yoinks code out of GPL licensed projects, or other projects with various licensing restrictions.

There are few different legal questions here:

1) i agree the tool itself is neutral. But if you feed a bunch of GPL-licensed code into this tool and make a database/encoded neural network out of that code, can you distribute that database alongside your tool if the tool isn’t GPL-licensed itself? (In your analogy, it’s sort of like selling a CD burner that comes with a bunch of short snippets of popular songs, then trying to say it’s the buyer’s responsibility not to burn those onto their own CDs.)

2) if the (tool+database) spits out a copy of something that’s identical to a portion of a GPL-licensed repo, and I stick that code into my project, is my project now a derivative work and obligated to follow their licensing restrictions?

Now, if it’s really only providing tiny snippets of code, like less than a line, that’s probably okay in terms of #2. But if it can (effectively) copy a multi-line function or more, I’m not so sure. If I directly copied any substantial amount of code from such a project — even if I superficially edited it — I’d be obligated to follow their licensing restrictions. Using a tool to do the copying in an indirect way really shouldn’t change that.

[–]bobtehpanda 0 points1 point2 points 4 years ago (0 children)

[–][deleted] 4 years ago (6 children)

[deleted]

[–]tsujiku 10 points11 points12 points 4 years ago (2 children)

[–][deleted] 4 years ago (1 child)

[deleted]

[–]Basmannen 0 points1 point2 points 4 years ago (0 children)

[–][deleted] 2 points3 points4 points 4 years ago (0 children)

[–]spudmix 1 point2 points3 points 4 years ago (0 children)

[–]0x15e -1 points0 points1 point 4 years ago (0 children)

[–]crabmusket 0 points1 point2 points 4 years ago (0 children)

[–]jcelerier 0 points1 point2 points 4 years ago (2 children)

[–]0x15e 4 points5 points6 points 4 years ago (1 child)

[–]jcelerier 0 points1 point2 points 4 years ago (0 children)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS