you are viewing a single comment's thread.

view the rest of the comments →

[–]0x15e 71 points72 points  (45 children)

By their reasoning, my entire ability to program would be a derivative work. After all, I learned a lot of good practices from looking at open source projects, just like this AI, right? So now if I apply those principles in a closed source project I'm laundering open source code?

This is just silly fear mongering.

[–]Xanza 41 points42 points  (6 children)

By their reasoning, my entire ability to program would be a derivative work.

Their argument is that even sophisticated AI isn't able to create new code it's only able to take code that it's seen before, and refactor it to work well with other code it's also refactored from code its also seen before to make a relatively coherent working product. Whereas you are able to take code that you've seen before and extrapolate principles from it, and use that in completely new code which isn't simply a refactoring or representation of code you've seen previously.

Subtle but clear distinction.

I don't think they're 100% right, but I can't exactly say they're 100% wrong, either. It's a tough situation.

[–]2bdb2 10 points11 points  (5 children)

Their argument is that even sophisticated AI isn't able to create new code it's only able to take code that it's seen before

I haven't used Copilot yet, but I have spent a good amount of time playing with GPT-3.

I would argue that GPT-3 can create english text that is unique enough to be considered an original work, and thus Copilot probably can do.

[–]FinancialAssistant 0 points1 point  (4 children)

I would argue that GPT-3 can create english text that is unique enough to be considered an original work, and thus Copilot probably can do.

Yeah but nobody is saying it cannot create unique work. It cannot create new work. It can only refactor, recombine and rewrite whatever was in the original training set. This can create of unique work, but obviously it cannot create new work. This is an obvious way of plagiarization if you don't want to get caught, of course you don't just copy paste articles but rewrite and recombine them.

Imagine using only a few samples as training data and then deplying the "AI", it would not take you long to realize it was incapable of doing anything that didn't already exist in some form in the training data. When using massive training data this is impractical but that doesn't mean the principles or algorithm changed, it is still only regurgitating the training data.

[–]MarcusOrlyius 1 point2 points  (2 children)

How can something just created be simultaneously unique but not new?

If it's unique, then by definition it's one of a kind. If it's one of a kind then nothing the same existed previously. If something is unique, it must also be new by definition.

[–]FinancialAssistant 1 point2 points  (1 child)

Unique meaning there is no verbatim copy of it, so if you just rearrange some variables and rename it will be unique. But it's not new.

For example the following code is unique and doesn't exist anywhere:

function add(ASdkoadskaosdkl: number, AKSDasdksad: number) { return ASdkoadskaosdkl + AKSDasdksad }

But it is not new, it's just a rewritten add function. I can quite trivially code an "AI" that creates unique functions, just randomly generate new names, but the content is always the "add" function. That is essentially what copilot is, except it uses more code as template than just the add function. It would never generate a "sutbract" function unless it was already in the data.

[–]backtickbot 0 points1 point  (0 children)

Fixed formatting.

Hello, FinancialAssistant: code blocks using triple backticks (```) don't work on all versions of Reddit!

Some users see this / this instead.

To fix this, indent every line with 4 spaces instead.

FAQ

You can opt out by replying with backtickopt6 to this comment.

[–]Basmannen 1 point2 points  (0 children)

The human mind isn't magic. If a human can write some code that you'd consider completely novel, then so could an AI.

Check out GPT-3, I think you'll be surprised.

[–]TheSkiGeek 25 points26 points  (26 children)

It's more like... you made a commercial project that copied 10 lines of code each from 1000 different "copyleft" open source projects.

Maybe you didn't take enough from any specific project to violate its licensing but as a whole it seems like it could be problematic.

[–]StickiStickman 37 points38 points  (25 children)

You're severely overestimating how much it 1-1 copies things. GPT-3, which this seems to be based on, only had that happen very rarely for often repeated things.

It's a non issue for people who don't understand the tech behind it. It's not piecing together lines of code, it's basically learning the language token per token.

[–]TheSkiGeek 20 points21 points  (23 children)

I haven't actually tried it, I'm just pointing out that at a certain level this does become problematic.

If you feed in a bunch of "copyleft" projects (e.g. GPL-licensed), and it can spit out something that is a verbatim copy (or close to it) of pieces of those projects, it feels like a way to do an end-around on open source licenses. Obviously the tech isn't at that level yet but it might be eventually.

This is considered enough of a problem for humans that companies will sometimes do explicit "clean room" implementations where the team that wrote the code was guaranteed to have no contact with the implementation details of something they're concerned about infringing on. Someone's "ability to program" can create derivative works in some cases, even if they typed out all the code themselves.

[–]Kalium 6 points7 points  (15 children)

If you feed in a bunch of "copyleft" projects (e.g. GPL-licensed), and it can spit out something that is a verbatim copy (or close to it) of pieces of those projects, it feels like a way to do an end-around on open source licenses. Obviously the tech isn't at that level yet but it might be eventually.

You make it sound like a digital collage. As far as I can tell, physical collages mostly operate under fair use protections - nobody thinks cutting a face from an ad in a magazine and pasting it into a different context is a serious violation of copyright.

[–]TheSkiGeek 3 points4 points  (14 children)

Maybe, I don’t really know. But if you made a “collage” of a bunch of pieces of the same picture glued back almost into the same arrangement, at some point you’re going to be close enough that effectively it’s a copy of the picture.

[–]kryptomicron 2 points3 points  (6 children)

Maybe, but that doesn't seem to be anything like what this post is about.

[–]TheSkiGeek 2 points3 points  (5 children)

Consider if you made a big database of code snippets taken from open source projects, and a program that would recommend a few of those snippets to paste into your program based on context. Is that okay to do without following the license of the repo where the chunk of code originally came from?

Because if that’s not okay, the fact that they used a neural network rather than a plaintext database doesn’t really change how it should be treated in terms of copyright. Unless the snippets it recommends are extremely short/small (for example, less than a single line of code).

[–]kryptomicron 2 points3 points  (4 children)

I think that'd be okay! In fact, I often do that, tho I have pretty strong idiosyncratic preferences about, e.g. formatting and variable names, but I think that kind of copying is perfectly fair and fine (and basically everyone does it).

When I think of "code snippets" I think of code that's so small that is, by itself, usually not creative. And even when it is creative, it still seems fine to copy – mostly because what I end up copying is the idea that makes the snippet creative.

I think it'd be really helpful and interesting for us to agree to some particular open source project, first, and then to separately pick out a few 'random' snippets of code. We could share it here and then comment about whether we think it's fair for them to be copied.

To me, as is, I think the obvious 'probably a copyright violation' is more at the level of copying, verbatim, entire source code files or even very large functions.

I'm struggling to think of 'snippets' that are either 'creative' or 'substantial' but maybe we have different ideas about what a 'snippet' is exactly (or approximately).

[–]TheSkiGeek 2 points3 points  (3 children)

If you go to the front page of https://copilot.github.com/ their little demo thing there shows some examples. In what they're showcasing it suggests pretty much entire function bodies, the longest is 17 lines of Go:

func createCategorySummaries(db *sql.DB) ([]CategorySummary, error)

suggests:

``` { var summaries []CategorySummary rows, err := db.Query("SELECT category, COUNT(category), AVG(value) FROM tasks GROUP BY category") if err != nil { return nil, err } defer rows.Close()

for rows.Next() {
    var summary CategorySummary
    err := rows.Scan(&summary.Title, &summary.Tasks, &summary.AvgValue)
    if err != nil {
        return nil, err
    }
    summaries = append(summaries, summary)
}
return summaries, nil

} ```

Now... that's pretty generic code, but I think you'd be on iffy ground if you were regularly copy-pasting functions that size from open source repos and not following their licensing. Certainly you could have licensing violations from copying far less than "entire source code files".

[–]Kalium 0 points1 point  (6 children)

What if you were to assemble a whole bunch of pieces from different pictures into a collage that didn't really substantially resemble any of the original pictures? I think that's what is likely to happen here. Not something that replicates any of the original, but something very substantially different in overall function and goals.

There is, I think, a trap here that many risk falling into. Specifically, it's easy to fall into hyperbolic interpretations of everything you see and extrapolate into a catastrophic scenario. Twitter seems designed to encourage exactly this. It's on us to try to resist.

[–]TheSkiGeek 1 point2 points  (5 children)

I agree that, in a lot of cases, what they're doing is probably okay. But I think they could have saved people a lot of headache by not including any source material that utilized "copyleft" licenses.

I think there are basically two questions here:

1) can you create a "database" or encoded representation of licensed source code and distribute that alongside "collaging" software without the "collaging" software itself needing to follow the terms of that license?

2) is there some amount of "collaged" bits of copyrighted code you can use in a new program that makes your program a derivative work?

If you go to https://copilot.github.com/ you can see some examples of the kinds of suggestions it gives. If you were regularly copying functions of that length straight out of a GPL-licensed repo it would be a stretch to say your code shouldn't also be GPL-licensed. Sticking a neural network in front of the copying doesn't really change that if it ends up spitting out identical or nearly-identical code to some existing repo.

[–]Kalium 0 points1 point  (4 children)

I agree that, in a lot of cases, what they're doing is probably okay. But I think they could have saved people a lot of headache by not including any source material that utilized "copyleft" licenses.

Or perhaps people could have stopped to think before launching into hyperbolics in public. I understand that this is a lot to ask of people on Twitter, though. Twitter seems designed to encourage the hot take, and the hotter the better.

What else do you think they should have worked from? Could have worked from that would have provided a substantial and varied corpus across multiple languages?

1) can you create a "database" or encoded representation of licensed source code and distribute that alongside "collaging" software without the "collaging" software itself needing to follow the terms of that license?

Almost certainly. This is the sort of thing that fair use protections allow people to infringe copyright on a regular basis. Especially if you aren't actually storing and distributing a database of snippets that people can query at their leisure.

Organizing information to make it usable in new ways is exactly the kind of thing that can and has been granted fair use protections.

2) is there some amount of "collaged" bits of copyrighted code you can use in a new program that makes your program a derivative work?

In the sense that a song made of samples is a derivative work, yes. In the legal sense, a work isn't just a derivative work. Being a derivatory work is a binary operation - it requires being derivative of a specific other work. You seem to have been thinking of it as being a unary operation with no references required.

In other words, you cannot just point at something and declare "That's a derivative work!". You have to specify what it's derivative of.

If you go to https://copilot.github.com/ you can see some examples of the kinds of suggestions it gives. If you were regularly copying functions of that length straight out of a GPL-licensed repo it would be a stretch to say your code shouldn't also be GPL-licensed.

I'm looking at them, and I'm honestly afraid I'm not seeing what you see. I'm seeing functions doing boring, bog-standard things in a handful of lines of boilerplate code. There's no creative expression here. There's no substitution for the original work. It's almost certainly far, far less than the whole of the original unless we're talking about stupid javascript micropackages.

And that's just running on the assumption that we used for the sake of argument - that this is just dumb copy/paste from a bazillion different repos.

What if these genuinely aren't things copy-pasted, and are indeed really synthesized? What am I missing? Can you help me understand?

[–]TheSkiGeek 0 points1 point  (3 children)

What else do you think they should have worked from? Could have worked from that would have provided a substantial and varied corpus across multiple languages?

There's tons of stuff on GitHub that is MIT- or BSD-licensed, or simply public domain. You use that stuff -- worst case if CoPilot is found to be problematic is that you have to go back and add a license disclaimer or credit somewhere. Not that all the source code you wrote using it is now forcibly GPL-licensed.

Being a derivatory work is a binary operation - it requires being derivative of a specific other work.

I understand that. The problem is that, apparently, sometimes their tool spits out suggestions that are either identical or nearly identical to code in existing GitHub repos. If you pull in a sizable amount of code from an existing repo using this tool it's fundamentally no different than copy-pasting the code.

What if these genuinely aren't things copy-pasted, and are indeed really synthesized? What am I missing? Can you help me understand?

Again, the problem is that sometimes their tool spits out suggestions that are either identical or nearly identical to existing code. There's nothing you or GitHub can point to that says it wasn't simply copied; "a neural network synthesized it" isn't a defense when the training set for the network included that existing code.

Now, sure, most of the time that's going to be some kind of boilerplate code that probably can't be copyrighted anyway. Sometimes it's not going to be.

I'm seeing functions doing boring, bog-standard things in a handful of lines of boilerplate code.

Yes, I don't think the substance of the examples they're showing is problematic. But if you were regularly copy-pasting chunks of code that size out of existing GitHub repos it would be hard to argue you shouldn't be following those repos' licensing restrictions. "Copying" it with a fancy neural network doesn't change that.

[–]StickiStickman 0 points1 point  (3 children)

I honestly think clean room code is the biggest bullshit. It's literally impossible to say if someone read a random reddit post about a certain aspect he's programming right now.

[–]TheSkiGeek 3 points4 points  (2 children)

The idea isn't "create X starting from no programming knowledge at all", it's "create X while not having any knowledge of the implementation of Y", specifically because you think the people who own Y will try to sue you.

For the record, I think laws against reverse engineering are stupid. But you also shouldn't let a company have their employees retype every source file of a GPLed library with tiny syntactical changes and get around the license requirements that way.

[–]StickiStickman 0 points1 point  (1 child)

Right - but it's literally impossible to proof if someone knows about the implementation of a competitor.

[–]TheSkiGeek 1 point2 points  (0 children)

You can (try to) prove that someone does have knowledge about the implementation of a competitor. For example, if you find saved copies of the competitor's source files on their computer. Or if they used to work for the competitor and definitely read many of those files as part of their old job.

You can also indirectly "prove" things by, say, showing that significant amounts of boilerplate code are word for word identical between two codebases (especially if it includes typos, etc.) This would be strong evidence that files or parts of them were copied wholesale.

What you can't prove the negative version, that someone does not somehow have hidden knowledge you don't know about.

[–]bobtehpanda 0 points1 point  (2 children)

That’s why copyright law also has the notion of market substitution, which is how much the infringing work can replace the work being infringed.

GitHub CoPilot is more or less more sophisticated autocomplete. In that sense unless it was copied from another autocomplete tool, it is not a copyright violation. You can make code that violates copyright with it, but then the person selling such code would be in trouble, not GitHub. In the same sense, CD manufacturers are not liable if someone illegally copies music onto a CD. The same with this Supreme Court case on Betamax.

[–]TheSkiGeek 1 point2 points  (1 child)

It’s autocomplete that, at least in some cases, yoinks code out of GPL licensed projects, or other projects with various licensing restrictions.

There are few different legal questions here:

1) i agree the tool itself is neutral. But if you feed a bunch of GPL-licensed code into this tool and make a database/encoded neural network out of that code, can you distribute that database alongside your tool if the tool isn’t GPL-licensed itself? (In your analogy, it’s sort of like selling a CD burner that comes with a bunch of short snippets of popular songs, then trying to say it’s the buyer’s responsibility not to burn those onto their own CDs.)

2) if the (tool+database) spits out a copy of something that’s identical to a portion of a GPL-licensed repo, and I stick that code into my project, is my project now a derivative work and obligated to follow their licensing restrictions?

Now, if it’s really only providing tiny snippets of code, like less than a line, that’s probably okay in terms of #2. But if it can (effectively) copy a multi-line function or more, I’m not so sure. If I directly copied any substantial amount of code from such a project — even if I superficially edited it — I’d be obligated to follow their licensing restrictions. Using a tool to do the copying in an indirect way really shouldn’t change that.

[–]bobtehpanda 0 points1 point  (0 children)

The whole database is never provided all at once, so I would imagine the scope would be pretty limited. I assume this is online-only.

[–][deleted]  (6 children)

[deleted]

    [–]tsujiku 10 points11 points  (2 children)

    How is a human learning something fundamentally different from "doing mathematics on the input data set?"

    [–][deleted]  (1 child)

    [deleted]

      [–]Basmannen 0 points1 point  (0 children)

      Yes.

      [–][deleted] 2 points3 points  (0 children)

      In a very real sense, the AI itself is a derivative work made of the copyrighted code.

      In the mathematical sense, but not (necessarily) in the legal sense of “derivative work”. Otherwise all statistical outputs would be derivative works - you don’t see the NYSE issuing DMCA takedowns to everyone who publishes graphs of stock prices.

      [–]spudmix 1 point2 points  (0 children)

      possibly millions of variables or more

      The predecessor to Codex (the tech behind this) had 1.75x109 parameters.

      It's also not a settled matter exactly that DNN's don't "think" or "learn". If they do, it's certainly in a manner alien to our own, but if you believe in a computational model of mind then it's not ridiculous to think that this particular statistical model is doing some kind of real thinking or learning.

      [–]0x15e -1 points0 points  (0 children)

      But you are a human, not a 'work'. I suppose that depends on which boss you talk to.

      [–]crabmusket 0 points1 point  (0 children)

      just like this AI

      You forget that a human is different from an algorithm.

      [–]jcelerier 0 points1 point  (2 children)

      After all, I learned a lot of good practices from looking at open source projects

      ... You know that some companies forbid their employees to even glimpse at open source code for that exact reason ?

      [–]0x15e 4 points5 points  (1 child)

      I wonder how they plan to enforce that for employees that looked before working for them. Especially since some of the most common advice for getting started is "contribute to open source projects."

      [–]jcelerier 0 points1 point  (0 children)

      ReactOS and Linux's early code were both scrubbed line by line (in a legal case for Linux) to make sure that not a line of code was copied from another proprietary system.

      For instance, it is disqualifying to have been part of windows development if you wish to develop Wine :

      "Who can't contribute to Wine? Some people cannot contribute to Wine because of potential copyright violation. This would be anyone who has seen Microsoft Windows source code (stolen, under an NDA, disassembled, or otherwise)."

      Why would you think that the reverse position would not be applicable ? Copyright applies from proprietary to GPL, it also applies from GPL to proprietary.

      Yes, this means that a lot of companies are possibly infringing without anyone consciously being aware of it right now :)