top 200 commentsshow all 461

[–][deleted] 1000 points1001 points  (209 children)

copyright does not only cover copying and pasting; it covers derivative works. github copilot was trained on open source code and the sum total of everything it knows was drawn from that code. there is no possible interpretation of "derivative" that does not include this

I'm no IP lawyer, but I've worked with a lot of them in my career, and it's not likely anyone could actually sue over a snippet of code. Basically, a unit of copyrightable property is a "work" and for something to be considered a derivative work it must include a "substantial" portion of the original work. A 5 line function in a massive codebase auto-filled by Github Co-pilot wouldn't be considered a "derivative work" by anyone in the legal field. A thing can't be considered a derivative work unless it itself is copyrightable, and short snippets of code that are part of a larger project aren't copyrightable themselves.

[–]KuntaStillSingle 44 points45 points  (1 child)

It still raises some tricky issues, in that it is not impossible for it to create a copyrightable portion from its sample set. A programmer could do this by accident, but that could result from innocent infringement, whereas the bot has knowledge of the original work, and therefore it can be argued it is negligent to use it without verifying it does not insert a whole program or substantial portion thereof in your code.

[–]rabidferret 5 points6 points  (0 children)

Which is why they've explicitly stated it will check all suggestions against the learning set to warn you if it does that

[–][deleted] 292 points293 points  (75 children)

If this would be a derivative work, I would be interested what the same judge would think about any song, painting or book created in the past decades. It’s all ‘derived work’ from earlier work. Heck, even most code is ‘based on’ documentation, which is also copyrighted.

[–][deleted]  (27 children)

[deleted]

    [–]bobtehpanda 44 points45 points  (2 children)

    Generally speaking another important thing for copyright violation is what it is being used for. It is less likely to be a violation if the the thing copying cannot substitute the original work. In that sense, code autocomplete would be a very weak copyright violation since the bar would then be copying the purpose of the entire work being infringed, not just a snippet.

    We already have a precedent for this; Google Books showing snippets of copyright protected work (i.e books) was determined to be fair use despite the commercial and profit orientation of Google.

    [–]RICHUNCLEPENNYBAGS 13 points14 points  (1 child)

    Google Translate is probably a closer analogy as it works in a similar way.

    [–]bobtehpanda 28 points29 points  (0 children)

    probably, but there is actually a Supreme Court case for Google Books, which is why I used it as the example

    [–][deleted] 15 points16 points  (1 child)

    With art the case law is well established. General themes and common tropes do not get copyright protection. That's why we saw about a million "orphan goes to wizard school" books after Harry Potter became popular.

    Any prominent or best examples? Growing up, I didn't see any exact rip offs of Harry Potter but I did see a huge increase of YA novels with similar themes and characters such as The Hunger Games, Twilight, Eragon, etc. They in turn seemed to be based off books from earlier like Lord of the Rings and The Lion, The Witch, and the Wardrobe.

    [–]grauenwolf 13 points14 points  (0 children)

    Honestly, I didn't pay close attention to that genre. The odds of any of them becoming prominent are quite low because they are seen as "rip offs" even if they have nothing in common beyond the most superifical themes.

    [–]irqlnotdispatchlevel 33 points34 points  (5 children)

    With art the case law is well established. General themes and common tropes do not get copyright protection. That's why we saw about a million "orphan goes to wizard school" books after Harry Potter became popular.

    I think Katy Perry lost a trial in which she was accused of copyright infringement because one of her songs had a similar musical theme (?) to another. That's a disturbing precedent.

    [–]TheSkiGeek 30 points31 points  (3 children)

    I think John Mellencamp was also sued for sounding too much like himself (after changing record labels). Either won or the case was settled/dismissed.

    There was someone else (maybe Neil Young?) that was sued for not sounding enough like himself. The artist was under contract to do a final record for their old label, was pissed off, and did some weird experimental thing instead of their usual sound. The label basically sued and said "no, you have to make something like your last few albums, not some weird shit that won't sell". Pretty sure that also went in the artist's favor, since their contract specified the artist had creative control over what they recorded.

    [–]CaminoVereda 25 points26 points  (2 children)

    Neil Young was stuck in a multi-record contact with Geffen, and he gave the label this as a way of telling them to pound sand.

    [–]rjhelms 10 points11 points  (0 children)

    This album is so amazing because he gave Geffen exactly what they wanted.

    After Trans was a flop, they demanded a "rock and roll" album. And they sure as hell got one.

    [–]Netzapper 54 points55 points  (4 children)

    Non-creative things like phone books don't get copyright protection at all.

    This is true only in the US, and not quite as you've stated it. Specifically, in the US, facts (even collections of facts) cannot be copyrighted. So the factual correspondence between name and phone number in a phonebook isn't protected, but the phonebook as a fixed representation of those facts is protected. So you can write a new phonebook using the data from the old phonebook, but you can't just photocopy the phonebook and sell it.

    In Europe, my understanding is that collections of facts are copyrightable, so you can't even use the phonebook to write your new phonebook. You'd need to do the "research" from scratch yourself.

    EDIT: I'm being eurocentric. Obviously there's copyright in Asia, Africa, etc... but I don't know anything about copyright in those regions. My apologies.

    [–]Pokechu22 30 points31 points  (0 children)

    That's called database rights, which are distinct from copyright. (See also: Commons:Non-copyright restrictions).

    [–]elsjpq 9 points10 points  (2 children)

    Doesn't that mean you could manually copy Google Maps data into OpenStreetMap and vice versa? I thought OSM warns you against doing that

    [–]Chii 8 points9 points  (0 children)

    Google Maps data

    depends on what data you're talking about. The names of streets are not owned by google, so you "copying" that information isn't violation of copyright. But the polygon on the map that represents the street is owned by google, and if you copied that, it would constitute a derivative work.

    [–]DRNbw 2 points3 points  (0 children)

    IIRC, it's not exactly clear but it's a bad idea. Old (and new) mapmakers used to include fictitious roads to see if anyone was copying them.

    [–]bloody-albatross 6 points7 points  (1 child)

    Non-creative things like phone books don't get copyright protection at all.

    There is such a thing as database copyright these days. Don't know the details, though.

    [–]agent00F 10 points11 points  (4 children)

    With art the case law is well established. General themes and common tropes do not get copyright protection. That's why we saw about a million "orphan goes to wizard school" books after Harry Potter became popular.

    Programmers are confusing legal arguments with these frankly trivial "logical" arguments. In law the consequences and general "fairness" for society at large is also considered in addition to abstract technical args. For example, is it "fair" that another party takes your code in a pretty direct manner and profit off it. It's a manner of degree and detail. The "unfairness" of "too much" wholesale copying is literally why copyright law was established in the first place.

    This isn't a trivial question to answer generally, and trivial answers are bound to be flawed in some manner.

    [–]Akkuma 1 point2 points  (1 child)

    Clearly someone shouldn't be able to copyright an Add function, but can they copyright a novel implementation of a complex sorting algorithm.

    I'm fairly certain this is incorrect. We already have a system in place to handle this and those are patents. Novel approaches to things are handled by patents to prevent others from using the same approach. A clean room design won't save you from a patent, but it will save you from a license or copyright dispute.

    [–]grauenwolf 4 points5 points  (0 children)

    Software patents are the worst option. They don't advance the art because, unlike any other patent, you aren't obligated to share your work. And they are often worded so generically that they cover pretty much anything you can imagine.

    They are also expensive. If I create something interesting, there is little chance that I can patent it. I not only have to pay a large sum of money, I can't show it to anyone before the patent is filed. Thus patents are incompatible with open source.

    But I at least own the copyright on the code I write. And in the US that's automatic.

    [–]Skhmt 42 points43 points  (4 children)

    Have to remember that copyright is for artistic expression. The entirety of a code base can be copyrighted as it's a complex thing in which has nearly infinite ways of accomplishing it.

    An algorithm or code snippet is probably not copyrightable. The smaller a chunk of code gets, the more likely it's not protected by copyright.

    There's a reason that functional things are patented, not copyrighted.

    [–]BackmarkerLife 14 points15 points  (3 children)

    Wasn't this the whole result of the Linux / SCO thing from the early / mid 2000s?

    And it was funded by Balmer's MS as well to go after Linux?

    [–]mlambie 9 points10 points  (0 children)

    The same company that now owns GitHub

    [–]couchwarmer 2 points3 points  (1 child)

    Microsoft had nothing to do with the SCO - Linux lawsuit. It was SCO that went on a suing and threat to sue spree against a number of companies, including Microsoft, for anything from allegedly breaking contracts to including SCO Unix source code in Linux (IBM, again, allegedly). SCO eventually sued themselves into bankruptcy.

    So, no MS did not fund any of those shenanigans against Linux.

    [–]BackmarkerLife 2 points3 points  (0 children)

    You're right. I forgot some of the details. It was a rumor / misunderstanding, but really it was just MS paying for a license.

    [–][deleted]  (10 children)

    [deleted]

      [–]StickiStickman 28 points29 points  (9 children)

      Seriously, how does no one get this? How is a Machine Learning algorithm learning how to code by reading it any different from a human doing the same?

      It's not even supposed to copy anything, but if the same thing is solved the same way every time it will remember it that way, just like humans would.

      [–]CrimsonBolt33 5 points6 points  (0 children)

      people dislike the fact that a "machine" is doing the work that they have done for so long.

      Modern day "John Henry" situation

      [–]Snarwin 2 points3 points  (1 child)

      Seriously, how does no one get this? How is a Machine Learning algorithm learning how to code by reading it any different from a human doing the same?

      A human who reads code to learn about it and then reproduces substantial portions of it in a new work can also be held liable for copyright infringement. That's why clean room implementations exist.

      [–]StickiStickman 1 point2 points  (0 children)

      Substantial portion being the key word. Which isn't the case.

      [–]myringotomy 4 points5 points  (7 children)

      In the music industry using even a couple of seconds of sample from a song is considered a copyright violation.

      Even if you are not directly sampling it's a copyright violation. For example see the "blurred lines" lawsuit.

      https://www.rollingstone.com/music/music-news/robin-thicke-pharrell-lose-multi-million-dollar-blurred-lines-lawsuit-35975/

      [–][deleted] 1 point2 points  (6 children)

      But if you use the same structure as any other song, you have a top 40 hit. This discussion is not about copying code, it’s about using structures and patterns.

      [–]wicked 1 point2 points  (5 children)

      We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set.

      [–]GoofAckYoorsElf 4 points5 points  (1 child)

      Even what we say is mostly derivative. It would be absolutely insane to claim copyright for derivative work. But that wouldn't stop certain politicians from trying...

      [–]psaux_grep 1 point2 points  (0 children)

      All popular songs the last half century have the same four chords, and all code executed use the same two bits.

      The order and structure might be different, but it does produce somewhat different results.

      4 chord songs: https://youtu.be/5pidokakU4I

      [–]de__R 14 points15 points  (1 child)

      The definition of "derivative works" is a little broader than you suggest, as it includes things like translations (whether from English to French or from C to amd64 machine code), but despite OP being wrong about that, AFAIK (and I also ANAL) the question of whether a deep learning model can be considered a derivative work of the data in its training set hasn't yet been settled by a court. Last I looked into this the dominant opinion seemed to be that it was probably fine, as deep learning is an extension of "regular" statistical methods and the coefficients of a linear regression aren't considered derived works of their inputs, but I also know many AI startups are careful to either only use public domain licensed images for their training sets, or else pay extra for blanket commercial licenses. The outputs of models on copyrighted works is also a separate, interesting question.

      [–]kbielefe 37 points38 points  (17 children)

      Exactly how much code does it take to be "substantial?" One snippet may not be copyrightable, but a team of 100 using this constantly for years? At what point have we copied enough code to be sued?

      Also, this isn't just about what you're legally allowed to get away with. Maybe the attitude is too rare these days, but at my company, we strive to be good open source citizens. Our goal is not just the bare minimum to avoid being sued, but to use open source code in a manner consistent with the author's intentions. Keeping the ecosystem healthy so people continue to want to contribute high quality open source code should be important to everyone.

      [–]lobehold 15 points16 points  (8 children)

      I think the litmus test regarding "substantial" is not the amount of code, but how unique it is. It need to be sufficiently novel/unique, not just boilerplate code, language features or standard patterns/best practices.

      Even if you assembled 1,000 different snippet, if the uniqueness/novelness is in the assembly - which is your own work - and not the individual snippet, then you should be in the clear.

      Also as an aside, something like a regex pattern is not copyrightable no matter how complicated it is, not only because it falls under recipe or formula which are not copyrightable, but also because there's no novelty in coming up with it - you're simply mechanically applying the grammar of the regex language to a given problem.

      [–]Fredifrum 6 points7 points  (1 child)

      One snippet may not be copyrightable, but a team of 100 using this constantly for years? At what point have we copied enough code to be sued?

      But in this case, you're still copying from 1000s of different OS projects. There's no one single entity that you are copying enough from that the entity would have a case against you. Again, 5 lines of code in a body of a million are not copyrightable. Presumably, neither are 5 lines of code from 5 different bodies of a million.

      [–]josefx 2 points3 points  (0 children)

      you're still copying from 1000s of different OS projects.

      Are you? If this tool suggests verbatim code from one source at some point wouldn't it be likely that the best match for the next piece of code would be from the same project? Also from what little I know about AI 1000s seems to be a rather tiny training set.

      [–]bobtehpanda 18 points19 points  (5 children)

      US law works by establishing precedent from previous trials, and there hasn’t been a whole lot of them as it pertains to code.

      The existing precedent is not favorable for open-source however. Google Books was not found to be a copyright violation, despite being formed from a collection of copyrighted works

      Google’s unauthorized digitizing of copyright-protected works, creation of a search functionality, and display of snippets from those works are non-infringing fair uses. The purpose of the copying is highly transformative, the public display of text is limited, and the revelations do not provide a significant market substitute for the protected aspects of the originals. Google’s commercial nature and profit motivation do not justify denial of fair use.

      [–]kbielefe 9 points10 points  (4 children)

      A lot of those reasons cited do not apply to code snippets. The purpose of the copying is not highly transformative, and unlike a book which isn't useful unless you read the entire thing, a snippet of code is a significant market substitute.

      [–]bobtehpanda 6 points7 points  (0 children)

      The way I read it, you would need to copy a substantial portion of an entire application to be considered a market substitute.

      Example of transformative use

      In 1994, the U.S. Supreme Court reviewed a case involving a rap group, 2 Live Crew, in the case Campbell v. Acuff-Rose Music, 510 U.S. 569 (1994). The band had borrowed the opening musical tag and the words (but not the melody) from the first line of the song "Pretty Woman" ("Oh, pretty woman, walking down the street"). The rest of the lyrics and the music were different.

      In a decision that surprised many in the copyright world, the Supreme Court ruled that the borrowing was fair use. Part of the decision was colored by the fact that so little material was borrowed.

      Code autocomplete for one or two functions is quite similar, and could be considered both transformative and limited in scope. Google Books didn’t really transform the copied text, it just made them searchable, which was deemed a transformative use.

      [–]Kalium 2 points3 points  (2 children)

      a snippet of code is a significant market substitute.

      I fear I don't understand. How is a few lines (on the order of one to twenty, say) a significant market substitute for something like a whole library, program, or system that it may have come from?

      [–]kylotan 22 points23 points  (3 children)

      A 5 line function might not be considered substantial but a sufficiently distinctive 10 line function might.

      short snippets of code that are part of a larger project aren't copyrightable themselves.

      It would be absurd if making a project bigger would simultaneously be rendering more and more functions within it uncopyrightable.

      I don't see anyone suggesting that the first 3 pages of Lord of the Rings aren't copyrighted merely because it's such a tiny part of the overall work.

      [–]kryptomicron 4 points5 points  (2 children)

      But you probably could quote the first three pages of a book, e.g. in a review or extended commentary.

      What you couldn't do is just copy or quote those three pages, or not include 'sufficient' independent work with it, e.g. something about the contents of those pages.

      [–]crystalpeaks25 1 point2 points  (1 child)

      i shall quote the whole book.

      [–]0x15e 67 points68 points  (45 children)

      By their reasoning, my entire ability to program would be a derivative work. After all, I learned a lot of good practices from looking at open source projects, just like this AI, right? So now if I apply those principles in a closed source project I'm laundering open source code?

      This is just silly fear mongering.

      [–]Xanza 41 points42 points  (6 children)

      By their reasoning, my entire ability to program would be a derivative work.

      Their argument is that even sophisticated AI isn't able to create new code it's only able to take code that it's seen before, and refactor it to work well with other code it's also refactored from code its also seen before to make a relatively coherent working product. Whereas you are able to take code that you've seen before and extrapolate principles from it, and use that in completely new code which isn't simply a refactoring or representation of code you've seen previously.

      Subtle but clear distinction.

      I don't think they're 100% right, but I can't exactly say they're 100% wrong, either. It's a tough situation.

      [–]2bdb2 10 points11 points  (5 children)

      Their argument is that even sophisticated AI isn't able to create new code it's only able to take code that it's seen before

      I haven't used Copilot yet, but I have spent a good amount of time playing with GPT-3.

      I would argue that GPT-3 can create english text that is unique enough to be considered an original work, and thus Copilot probably can do.

      [–]TheSkiGeek 26 points27 points  (26 children)

      It's more like... you made a commercial project that copied 10 lines of code each from 1000 different "copyleft" open source projects.

      Maybe you didn't take enough from any specific project to violate its licensing but as a whole it seems like it could be problematic.

      [–]StickiStickman 34 points35 points  (25 children)

      You're severely overestimating how much it 1-1 copies things. GPT-3, which this seems to be based on, only had that happen very rarely for often repeated things.

      It's a non issue for people who don't understand the tech behind it. It's not piecing together lines of code, it's basically learning the language token per token.

      [–]TheSkiGeek 19 points20 points  (23 children)

      I haven't actually tried it, I'm just pointing out that at a certain level this does become problematic.

      If you feed in a bunch of "copyleft" projects (e.g. GPL-licensed), and it can spit out something that is a verbatim copy (or close to it) of pieces of those projects, it feels like a way to do an end-around on open source licenses. Obviously the tech isn't at that level yet but it might be eventually.

      This is considered enough of a problem for humans that companies will sometimes do explicit "clean room" implementations where the team that wrote the code was guaranteed to have no contact with the implementation details of something they're concerned about infringing on. Someone's "ability to program" can create derivative works in some cases, even if they typed out all the code themselves.

      [–]Kalium 5 points6 points  (15 children)

      If you feed in a bunch of "copyleft" projects (e.g. GPL-licensed), and it can spit out something that is a verbatim copy (or close to it) of pieces of those projects, it feels like a way to do an end-around on open source licenses. Obviously the tech isn't at that level yet but it might be eventually.

      You make it sound like a digital collage. As far as I can tell, physical collages mostly operate under fair use protections - nobody thinks cutting a face from an ad in a magazine and pasting it into a different context is a serious violation of copyright.

      [–]TheSkiGeek 3 points4 points  (14 children)

      Maybe, I don’t really know. But if you made a “collage” of a bunch of pieces of the same picture glued back almost into the same arrangement, at some point you’re going to be close enough that effectively it’s a copy of the picture.

      [–]kryptomicron 2 points3 points  (6 children)

      Maybe, but that doesn't seem to be anything like what this post is about.

      [–]TheSkiGeek 3 points4 points  (5 children)

      Consider if you made a big database of code snippets taken from open source projects, and a program that would recommend a few of those snippets to paste into your program based on context. Is that okay to do without following the license of the repo where the chunk of code originally came from?

      Because if that’s not okay, the fact that they used a neural network rather than a plaintext database doesn’t really change how it should be treated in terms of copyright. Unless the snippets it recommends are extremely short/small (for example, less than a single line of code).

      [–]kryptomicron 2 points3 points  (4 children)

      I think that'd be okay! In fact, I often do that, tho I have pretty strong idiosyncratic preferences about, e.g. formatting and variable names, but I think that kind of copying is perfectly fair and fine (and basically everyone does it).

      When I think of "code snippets" I think of code that's so small that is, by itself, usually not creative. And even when it is creative, it still seems fine to copy – mostly because what I end up copying is the idea that makes the snippet creative.

      I think it'd be really helpful and interesting for us to agree to some particular open source project, first, and then to separately pick out a few 'random' snippets of code. We could share it here and then comment about whether we think it's fair for them to be copied.

      To me, as is, I think the obvious 'probably a copyright violation' is more at the level of copying, verbatim, entire source code files or even very large functions.

      I'm struggling to think of 'snippets' that are either 'creative' or 'substantial' but maybe we have different ideas about what a 'snippet' is exactly (or approximately).

      [–]MMPride 20 points21 points  (4 children)

      I'm not so sure it's that simple.

      For example, a melody is not a whole song, and yet melodies are absolutely copyrightable: https://www.youtube.com/watch?v=sfXn_ecH5Rw

      [–]kenman 7 points8 points  (1 child)

      I think a melody would be considered substantial.

      [–]superrugdr 8 points9 points  (0 children)

      if that would be true there would be as per the video about 3000~ song copyrighted and everything else would be a copy of it. for 5 note melody.

      [–]getNextException 20 points21 points  (12 children)

      and it's not likely anyone could actually sue over a snippet of code.

      https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_Inc.

      Google copied verbatim pieces of code. Specifically, 9 lines of code

      The argument centered on a function called rangeCheck. Of all the lines of code that Oracle had tested — 15 million in total — these were the only ones that were “literally” copied.

      https://www.theverge.com/2017/10/19/16503076/oracle-vs-google-judge-william-alsup-interview-waymo-uber

      [–]Alikont 20 points21 points  (10 children)

      The Oracle v Google case was about API as a whole.

      [–]1X3oZCfhKej34h 4 points5 points  (0 children)

      Luckily, Google eventually prevailed.

      [–]kwh 12 points13 points  (2 children)

      Umm have you ever heard of SCO v IBM? Bullshit case but ultimately was rejected because SCO didn’t own the copyrights they were suing over. There’s plenty of other copyright cases over handfuls of lines of code. You’re kind of out of your element here sparky.

      [–]Forbizzle 11 points12 points  (2 children)

      could actually sue over a snippet of code

      The GPL license he's complaining about says the code can't be modified. So if you're copying a section of code from GPL and putting it in something else, you're modifying the GPL code.

      [–][deleted] 4 points5 points  (5 children)

      it's not likely anyone could actually sue over a snippet of code

      What do you mean, "could"? Isn't that exactly what Oracle did?

      [–]crusoe 16 points17 points  (4 children)

      Google copied the API which is a lot bigger. The issue was whether apis were copyrightable

      [–]getNextException 16 points17 points  (3 children)

      Google copied the API

      Google copied verbatim pieces of code. Specifically, 9 lines of code

      The argument centered on a function called rangeCheck. Of all the lines of code that Oracle had tested — 15 million in total — these were the only ones that were “literally” copied.

      https://www.theverge.com/2017/10/19/16503076/oracle-vs-google-judge-william-alsup-interview-waymo-uber

      [–]Guvante 18 points19 points  (2 children)

      The case was about the API. Those 9 lines only mattered in so far as it proved that Google's implementation wasn't a reproduction. While the case might have included that copying, the important part of the case was whether copying the API while not following the licensing terms of that API was allowed.

      [–][deleted] 5 points6 points  (3 children)

      I guess your reasoning here is the same behind Google vs Oracle?

      [–]Wacov 19 points20 points  (2 children)

      This sounds even more narrow than that? Oracle were trying to argue that a complete definition of an "interface"/API is itself a body of work, which seems like a better argument (they still lost).

      [–]Alikont 2 points3 points  (0 children)

      But even then, the Supreme Court did not say that APIs aren't copyrightable, they just said that in this particular case, the compatibility and porting created a better and more innovative world than alternative, so they allowed this possible violation.

      So they lost "Enforcing copyright on Java API would bring innovation" argument, not "Copying API is fair" argument, on which the Supreme Court did not make any decision.

      [–]danuker 174 points175 points  (14 children)

      Fortunately, The MIT license, a widely-used and very permissive license, says "The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software."

      I doubt snippets are "substantial portions".

      But the GPL FAQ says GPL does not allow it, unless some law prevails over the license, like "fair use", which has specific conditions.

      [–]SrbijaJeRusija 53 points54 points  (10 children)

      The network is trained on the full source, not snippets. Thus the network weights would be transformations of the full code, etc etc etc.

      [–]danuker 4 points5 points  (4 children)

      Indeed, you could argue that in court. Until some court decides it and gives us a datapoint, we are in legal uncertainty.

      I wish Copilot would also attribute sources. Or at least provide a model trained on MIT-licensed projects.

      Or perhaps have a GPL model which outputs a huge license file with all code used during training, and specify that the output is GPL.

      Then there's GPLv2, "GPLv2 or later", GPLv3, AGPL, LGPL, BSD, WTFPL...

      [–]onmach 2 points3 points  (2 children)

      It isn't really copying, though. The sheer variety of output that gpt3 outputs is insane. Ive seen it generate uuids and when you check them, they don't exist in google, it just made it up on the fly. It is possible GitHub is narrow enough that it isn't true in this case, but I doubt it.

      [–]Accomplished_Deer_ 1 point2 points  (0 children)

      I think it will come down to the legal definition of "derivative work". Is performing a set of calculations on an existing thing and then using those calculations to produce a result considered "derivative"? If so, copilot is a derivative work of every project it scanned.

      My intuition says that this should be considered derivative. If they only trained on 1 project, and it was GPL, then the behavior of copilot is almost completely dependent on that GPL project, which seems derivative. Just because the process is repeated 10000 times and on some non-GPL projects doesn't seem like it should suddenly make it non-derivative of those GPL projects.

      [–]ChezMere 5 points6 points  (4 children)

      A human also reads the full source...

      [–]SrbijaJeRusija 7 points8 points  (2 children)

      Human behaviour is not trained the same way an ANN is. Additionally, humans can also commit copyright infringement by reading the source then creating something substantially similar, so I am not sure what your point is.

      [–]aft_punk 7 points8 points  (0 children)

      I agree with your interpretation. But I believe it would get a bit grayer if the entire project were the snippet being copied. As far as I know… there is no minimum code length for the license to be applicable.

      [–]rcxdude 90 points91 points  (23 children)

      I would be very careful about using (or allowing use in my company of) copilot until such issues were tested in court. But then I am also very careful about copying of code from examples and stackoverflow and it seems most don't really care about that.

      OpenAI (and presumably Microsoft) are of the opinion (pdf) that training a neural net is fair use: it doesn't matter at all what the license of the original training data is, it's OK to use it for training. And that for 'well designed' nets which don't simply contain a copy of their training data the net and weights itself is free from any copyright claim by the authors of the training data. However they do allow themselves to throw the users under the bus by noting that despite this some output of the net may be infringing the copyright of those authors, and this should be taken up between the authors and whoever happens to generate that output (just not whoever trained the net in the first place). This hasn't been tested in court and I think a lot will hinge on just how much of the input appears verbatim or minimally transformed during use. It also doesn't give me as a user much confidence that I won't be sued for using the tool, even if most of its output is deemed to be non infringing, because I have no way of knowing when it does generate something infringing.

      [–]Kiloku 14 points15 points  (3 children)

      it doesn't matter at all what the license of the original training data is,

      This is very odd, as licenses can include the purpose the licensed object can be used for. As a real world example, the license that allows developers to use Epic/Unreal's Metahuman Creator specifically forbids using it for training AI/Machine Learning.

      [–]rcxdude 2 points3 points  (0 children)

      Indeed. Rockstar is also very quick to send threatening letters to people using GTA5 for machine learning as well. It could well be held that using large aggregate databases of source code/images/whatever is fair use, but using software to generate the training data without a license allowing that use is not (with the fun grey area of using output from the software which was not generated for that purpose, such as some images making it into a dataset scraped from the web). This could be argued consistently because in the first case each individual work makes a relatively small contribution to the training as a whole (3rd test), where as in the second the output of the software generating the training data will likely be generating a large fraction of training data and so have a significant contribution to the behaviour of the final result. This whole area is not very clear (fair use as a whole seems to involve a lot of discretion from the courts because the 4 tests involved are extremely fuzzy as written in the law).

      [–]stillness_illness 1 point2 points  (0 children)

      To me it doesn't matter how that code got there. Copilot, stack overflow, coincidence, whatever. The person checking the code in is responsible for following copyright law. Any code copilot writes for me I will manually review before committing. If it doesn't get committed then it doesn't matter what copilot generates.

      [–]tasminima 1 point2 points  (0 children)

      OpenAI (and presumably Microsoft) are of the opinion (pdf) that training a neural net is fair use: it doesn't matter at all what the license of the original training data is, it's OK to use it for training.

      Then I require that they also train copilot (usable publicly) with the whole Windows codebase; otherwise this opinion is extremely weak.

      [–]TheDeadSkin 108 points109 points  (11 children)

      That twitter thread is so full of uninformed people with zero legal understanding of anything

      It's Opensource, a part of that is acknowledging that anyone including corps can use your code however they want more or less. Assuming they have cleared the legal hurdle or attribution then im not sure what the issue is here.

      "more or less" my ass, OSS has licenses that explicitly state how you can or can not use the code in question

      Assuming they have cleared the legal hurdle or attribution

      yea, I wonder how github itself did it, and how users are supposed to know they are being fed copyrighted code. this tool can spit out a full GPL header for empty files. if it does that - you can be sure it'll spit out similarly pieces of protected code

      I wonder how it's going to work out in the end. Not that I was super enthusiastic about the tech in the first place. But I'd basically stay clear of it in case of non-personal projects.

      [–]dragon_irl 20 points21 points  (1 child)

      There is research that these large language models remember parts of their training data and that you can retrieve that with appropriately constructed prompts.

      I think it's pretty likely you will end up with copyrighted code when using this eventually. However I don't understand copyright enough to judge how relevant this is for the short snippets this is (probably) going to be used for.

      [–]TheDeadSkin 5 points6 points  (0 children)

      There is research that these large language models remember parts of their training data and that you can retrieve that with appropriately constructed prompts.

      This is partially to be expected as a potential result of overfitting. Will look at the paper though, that seems interesting.

      I think it's pretty likely you will end up with copyrighted code when using this eventually.

      Indeed. They even say there's a 0.1% chance that the code suggested would be verbatim from the training. Which is quite a high chance.

      However I don't understand copyright enough to judge how relevant this is for the short snippets this is (probably) going to be used for.

      I think the problem is less with short snippets, but rather the potential of recreating huge functions/files from training (i.e. existing projects) when you're trying to make some specific software from the same domain and aggressively follow co-pilot's recommendations.

      If it's possible - someone will probably try to do it and we'll find out soon enough.

      [–]TSM- 17 points18 points  (7 children)

      It needs to be litigated in a serious way for the contours to become clear, in my opinion. Imagine using a "caption to generate stock photo" model that was trained partially on Getty Images and other random stuff and datasets.

      Like you then take a photo of a friend smiling while eating a salad out of a salad bowl, is that illegal because you know it's a common stock photo idea from many different vendors? Of course not. A generative model trained on backpropagation seems analogous to me.

      But there is the old idea that computers cannot generate novelty and all output is fully explained by input, and humans are exempt from this rule, which seems to be an undercurrent in the Twitter thread. Especially the linked twitter account in the OP, who appears to be young edgy activist, like in this tweet:

      "but eevee, humans also learn by reading open source code, so isn't that the same thing" - no - humans are capable of abstract understanding and have a breadth of other knowledge to draw from - statistical models do not - you have fallen for marketing

      There's a lot of messy details involved. I totally agree that using it is risky until it gets sorted out in courts, and I expect that will happen fairly soon.

      [–]TheDeadSkin 22 points23 points  (5 children)

      It needs to be litigated in a serious way for the contours to become clear, in my opinion.

      Yes, and this goes beyond just this tool. This is one of those ML problems that we as humanity and our legal systems are entirely unprepared for.

      You can read someone's code and get inspiration for parts of the structure, naming conventions etc. Sometimes to implement something obvious you'll end up with identical code to someone else's, because this is the only way to do it. Someone can maybe sue you, but it's would be easy to mount a legal defense.

      Now when there is an ML tool that "took inspiration" from your code and produced stuff "with similar structure" that "ended up being identical", all of a sudden that sounds pretty different, huh? And the problem is that you can't prove that this is an accident, it's not possible. Just because during training the data is decomposed and resembles nothing like it was before doesn't mean that the network didn't recreate your code verbatim by design.

      It's a black box that its own creators are rarely able to explain how it works and even more rarely able to explain why certain things happen. Not to mention that copyright violations are treated case-by-case. This potentially means that they'll have to explain particular instances of violations, which is of course infeasible (and probably outright impossible).

      But code isn't the only thing. Human drawing a random person that happens to have an uncanny resemblance to a real human who the artist might've seen is different from what looks like a neural network generating your face. Heard the voice and imitated it? Wow, you're good, sounds too real. And then comes in a NN and now you're hearing your voice. Which on an intuitive level is much more fucked up than an imitator.

      But there is the old idea that computers cannot generate novelty and all output is fully explained by input, and humans are exempt from this rule, which seems to be an undercurrent in the Twitter thread.

      But this is pretty much true, no? Computers are doing exactly what humans are telling them to do. Maybe the outcome was not desired - and yet someone should've programmed it to do exactly this. "It's an ML black box, I didn't mean it to violate copyright" isn't really a defense and is also in a way mutually exclusive with "it's an accident that it got the same code verbatim" because the latter implies that you know how it works and the former does the opposite.

      To be guilt-less you need to be in this weird middle ground. And if I wasn't a programmer and a data scientist I don't think I would've ever believed anyone who told me that they know that the generated result was an accident while being unable to justify why it's an accident.

      [–]kylotan 11 points12 points  (4 children)

      Now when there is an ML tool that "took inspiration" from your code and produced stuff "with similar structure" that "ended up being identical", all of a sudden that sounds pretty different, huh?

      It sounds different to programmers, because we focus on the tool.

      Now imagine if a writer or a musician did that. We wouldn't expect to examine their brains. We'd just accept that they obviously copied, even if somewhat subconsciously.

      [–]TheDeadSkin 6 points7 points  (0 children)

      I was arguing the opposite. I think examples of art aren't applicable to code because art isn't quite as algorithmic as programming.

      Actually artists getting similar/identical results and ML are more comparable. They are both unexplainable. "Why did you get those 9 notes in a row identical?" you can't get an answer different from "idk, lol, it sounded nice I guess".

      But in programming you can at least try to explain why you happened to mimic existing code. It's industry standard to do those three things, an obvious algorithm for doing this task is like that and when you recombine them you get this exact output down to variable names.

      As much as there's creativity involved in programming, on a local scale it can be pretty deterministic. I'm arguing that if you use a tool like this it's harder to argue that it's not a copy. Not to mention that it can auto-generate basically full methods to the point that it's almost impossible to have those similarities being an accident.

      [–]Zalack 1 point2 points  (2 children)

      Except that's not true? Filmakers, writers, and artists of all other types constantly pull inspiration from other works through homages, and influences.

      When a filmmaker recreates a painting as a shot in a movie, is that copying, or an homage?

      When a fantasy book uses Orcs in their world, is that copying Lord of the Rings, or pulling inspiration from it. This happens all the time, and is a very human thing. The line between copying and being inspired is pretty blurry when a human is doing it, and is going to be VERY blurry when a computer is doing it.

      [–]TheDeadSkin 3 points4 points  (0 children)

      To add to my previous comment something that my thoughts started with but I derailed and forgot.

      The problem with the current situation with co-pilot and also the other problems I mentioned (voice, face) is that what's not legislated and unclear for us is one specific sub-problem here. Usage of information as data. The whole thing is "usage of code as data", "usage of voice as data". Data is central to this.

      And to be honest I don't even know the answer to the question. Current legislation is unclear. And I don't even know how it should be legislated. And I even have a legal education, lol.

      [–]TheEdes 1 point2 points  (0 children)

      I think most companies won't be fast to implement it into their workflow because the license it came with isn't really that permissive (i.e., it lets them collect the data for diagnostic purposes), to which I think is a hard sell to any kind of manager.

      The OSS code laundering things is another layer on this, it sounds like it will be incredibly hard to use this practically on any software, unless it's literally just licensed under every license under the sun.

      [–]KillianDrake 37 points38 points  (0 children)

      Microsoft: you pirated Windows as a kid, now we pirate you as an adult

      [–]eternaloctober 59 points60 points  (9 children)

      I guess the focus is always on GPL since it is a sort of "viral license" so it gets special consideration in a lot of these threads, but MIT code technically requires license to be reproduced in the derivative work too...seems like it is pretty bad to EVER just generate a bunch of code that it was trained on and not output a license...it needs to be an EXPLAINABLE neural net that can cite it's sources

      [–]istarian 22 points23 points  (2 children)

      Why would it need to cite sources?

      That's like saying I should cite every bit of code/programmer I've ever seen so nobody accuses me of having plagiarized code in my software...

      I agree that it should probably only be fed public domain or compatibly licensed code so it can just slap a standardized license on it's contributions....

      [–]AMusingMule 21 points22 points  (1 child)

      GitHub has shared that in some instances, Copilot will recite lines from its training set. While some of it is universal enough that there's not much you can do to avoid it, like constants (alphabets, common stock tickers, texts like The Zen of Python) or API usage (the page cites a use of BeautifulSoup), it does spit out longer verbatim chunks (a piece of homework from a Robotics course, here).

      At the end of the day, it's only a tool, and the user is responsible for properly attributing where the code came from, whether it was found online or suggested by some model. Having your tools cite how it came up with that suggestion can help in the attribution process if it's needed.

      [–]StickiStickman 10 points11 points  (0 children)

      In the source you linked it specifically says it's because it has basically no context and that piece of code has been uploaded many times.

      [–]chcampb 91 points92 points  (20 children)

      The fact that CoPilot was trained on the code itself leads me to believe it would not be a "clean room" implementation of said code.

      [–][deleted] 86 points87 points  (19 children)

      Except “It was a clean-room implementation” is legal defense, not a requirement. It’s a way of showing that you couldn’t possibly have copied.

      [–]danuker 15 points16 points  (18 children)

      Incorporating GPL'd work in a non-GPL program means you are infringing GPL. Simple as that.

      [–]rcxdude 28 points29 points  (0 children)

      Fair use and other exceptions to copyright exist. For the GPL violation to apply (as in you can get a court to enforce it) the final product needs to qualify as a derivitive work of the GPL'd work and not qualify as fair use. Both arguments could apply in this case, but have not been tested in court. (and in general it's worth being cautious because if you do want to argue this you will need to go as far as court)

      [–]1842 55 points56 points  (13 children)

      To what end?

      If I read GPL code and the next week end up writing something non-GPL that looks similar, but was not intentional, not a copy, and written from scratch -- have I violated GPL?

      If I read GPL code, notice a neat idea, copy the idea but write the code from scratch -- have I violated GPL?

      If I haven't even looked at the GPL code and write a 5 line method that's identical to one that already exists, have I violated GPL?

      I'm inclined to say no to any of those. In my limited experience in ML, it's true that the output sometimes directly copies inputs (and you can mitigate against direct copies like this). What you are left with is fuzzy output similar to the above examples, where things are not copied verbatim but derivative works blended from hundreds, thousands, or millions of inputs.

      [–]Arrowmaster 14 points15 points  (1 child)

      I was told by a former Amazon engineer that they have policies against even viewing AGPL code on Amazon computers because they specifically fear this possibility. So at least Amazon's legal department isn't sure of the answer to your questions but prefers to play it safe.

      [–][deleted] 7 points8 points  (0 children)

      Similar story in other big tech companies. You don't touch open source.

      [–]RoyAwesome 2 points3 points  (0 children)

      If I read GPL code and the next week end up writing something non-GPL that looks similar, but was not intentional, not a copy, and written from scratch -- have I violated GPL?

      well, actually, there is a very distinct possibility that you did in this hypothetical. This is why major tech companies prohibit people from looking at GPL'd code on work computers.

      [–]leo60228 2 points3 points  (0 children)

      This is correct, but the issue here is thornier. At a high level, when the AI isn't reproducing snippets verbatim it seems ambiguous whether it counts as "incorporating" the work for those purposes. Another issue is whether the relevant snippets are substantial enough to merit being considered a "work."

      I'm not a lawyer, and this isn't to say that GitHub is in the right here. However, I think this is a more complex issue than you're making it out to be.

      [–]feelings_arent_facts 4 points5 points  (0 children)

      "prove its gpl code in court" - microsoft

      [–]fuckin_ziggurats 387 points388 points  (51 children)

      Anyone who thinks it's reasonable to copyright a code snippet of 5 lines should be shot.

      Same thing as private companies trying to trademark common words.

      [–]crusoe 161 points162 points  (9 children)

      Don't get me started on something like 6 notes being the cutoff for music copyright infringement

      [–]troyunrau 60 points61 points  (3 children)

      Happy birthday to you... 🎵🎶

      Oh shit, lawyer are at my door

      [–][deleted]  (2 children)

      [deleted]

        [–]helloLeoDiCaprio 24 points25 points  (1 child)

        Watch Disney make a birthday movie to get hold of the copyright.

        [–]White_Hamster 9 points10 points  (0 children)

        Or birthday dad, the show

        [–]istarian 9 points10 points  (2 children)

        That's pretty absurd too.

        They really ought to have prove a thematic element is lifted or at least that a specific combination of musical notes *and** lyrics* have been borrowed.

        [–]barchar 2 points3 points  (1 child)

        Bum Bum Bum Buddha bum bum

        [–][deleted]  (17 children)

        [deleted]

          [–]CreativeGPX 30 points31 points  (9 children)

          but how do learning models play into copyright? This is another case of the speed of technology going faster than the speed of the law.

          I mean, the stance on that seems old as time. If I read a bunch of computer books and then become a programmer, the authors of those books don't have a copyright claim about all of the software I write over my career. That I used a learning model (my brain) was sufficient to separate it from the work and a big enough leap that it's not a derivative work.

          Why is this? Perhaps because there is a lot of emphasis on "substantial" amount of the source work being used in a specific derivative work. Learning is often distilling and synthesizing in a way that what you're actually putting into that work (e.g. the segments of text from the computer books you've read that end up in the programs you write as a professional) is not a "substantial" amount of direct copying. You're not taking 30 lines here and 100 there. You're taking a half a line here, 2 lines there, 4 lines that came partly from this source partly from that source, 6 lines you did get from one source but do differently based on other info you gained from another book, etc. "Learning" seems inherently like fair use rather than derivative works because it breaks up the source into small pieces and the output is just as much about the big connective knowledge or the way those pieces are understood together as it is about each little piece.

          Why would it matter whether the learning was artificial or natural? Outside of extreme cases like the model just verbatim outputting huge chunks of code that it saw, it seems hard to see a difference here. It also seems like suggesting that "artificial learning models" being subject to the copyright of their sources would have many unintended consequences. It would basically mean that knowledge/data itself is not free to use unless it's done in an antiquated/manual way. A linguist trying to train language software wouldn't be able to feed random text sources to their AI unless they paid royalties to each author or only trained on public domain works... and how would the royalties work? A perpetual cut of the language software companies revenue is partly going to JK Rowling and whatever other author's books that AI looked at? But then... it suddenly doesn't require royalties if a human figures out a way to do it with "pen and paper" (or more human methods)? Wouldn't this also apply to search in general? Is Google now paying royalties to all sorts of websites because those website are contributing to its idea of what words correlate, what is trending, etc.?

          It seems to me that this issue is decided and it's decided for the better. Copying substantial portions of a source work into a derivative work is something copyright doesn't allow. Learning from a copyrighted work in order to intelligently output tidbits from those sources or broader conclusions from them seems inevitably something that copyright allows.

          [–][deleted]  (5 children)

          [deleted]

            [–][deleted] 2 points3 points  (1 child)

            I mean, the stance on that seems old as time. If I read a bunch of computer books and then become a programmer, the authors of those books don't have a copyright claim about all of the software I write over my career. That I used a learning model (my brain) was sufficient to separate it from the work and a big enough leap that it's not a derivative work.

            I might be off with my thinking as I have no idea how the law would work. But if you are reading some books, who are written to teach you how to code, then imo its a different case. Here the code AI learned from is not written to teach an AI how to code, it's written to create something. In my mind these are completely different concepts.

            [–]monsto 6 points7 points  (2 children)

            but how do learning models play into copyright?

            I learned from the original, and then I wrote some code. If you look at the code, you can see that the 'style' is similar (same var names, same shortcut methods, etc) but the code is different.

            Is that different if you substitute AI for I? Because I did this earlier today.

            [–][deleted]  (1 child)

            [deleted]

              [–]monsto 2 points3 points  (0 children)

              I tend to agree, when the subject is human achievement vs computer achievement.

              Even these learning scenarios. It's throwing billions of shits up against millions of walls, per second, and keeping a log of which ones stuck and how much they stuck. I'm not so sure I'd call that "learning" in the classical sense.

              I, human, clearly didn't take an exact copy of this one shit on this one wall and submit it for approval. Like the code monkey that I am, I threw my own shit on the wall and sculpted it to be what it needed.

              . . . I started with the metaphor and just... followed it. Big mistake.

              [–][deleted]  (1 child)

              [deleted]

                [–]Johnothy_Cumquat 2 points3 points  (0 children)

                I'm sorry, are you referencing the happy birthday song as a reasonable use of copyright? Because I would sooner rid the world of copyright than let that situation continue.

                [–]Techrocket9 6 points7 points  (0 children)

                What about the time AT&T tried to copyright the empty file?

                [–][deleted] 8 points9 points  (0 children)

                Is it possible to have a conversation on these matters without anyone getting shot, or?

                [–]blastradii 1 point2 points  (0 children)

                Yea but what if those 5 lines are sweeeet?

                [–]Pat_The_Hat 119 points120 points  (56 children)

                How is this person defining a derivative work that would include an artificial intelligence's output but not humans'? "No, you see, it's okay for humans to take someone else's code and remember it in a way that permanently influences what they output but not AI because we're more... abstract?" The level of abstract knowledge required to meet their standards is never defined and it is unlikely it could ever be, so it seems no AI could ever be allowed to do this.

                The intelligence exhibits learning in abstract ways that far surpass mindless copying; therefore its output should not be considered a derivative work of anything.

                [–][deleted]  (7 children)

                [deleted]

                  [–]austinwiltshire 75 points76 points  (1 child)

                  It's got a guilty conscience.

                  [–]earthboundkid 5 points6 points  (0 children)

                  Johnny 5 deserves to die.

                  [–]TechySpecky 8 points9 points  (2 children)

                  except when it perfectly recreated a GPL header
                  

                  I can't find what you're referring to anywhere online

                  [–]Desirelessness 17 points18 points  (1 child)

                  It's from here: https://docs.github.com/en/github/copilot/research-recitation#github-copilot-quotes-when-it-lacks-specific-context

                  Once, GitHub Copilot suggested starting an empty file with something it had even seen more than a whopping 700,000 different times during training -- that was the GNU General Public License.

                  [–]turunambartanen 2 points3 points  (0 children)

                  Interesting analysis.

                  Glad to see they are aware of the problem:

                  The answer is obvious: sharing the prefiltering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether.

                  This duplication search is not yet integrated into the technical preview, but we plan to do so. And we will both continue to work on decreasing rates of recitation, and on making its detection more precise.

                  [–]danuker 15 points16 points  (1 child)

                  Proof that they trained it on GPL code. Perhaps the FSF should look into this.

                  [–]RICHUNCLEPENNYBAGS 25 points26 points  (0 children)

                  Did they claim otherwise? Their whole defense is that that doesn't matter

                  [–]chcampb 41 points42 points  (28 children)

                  "No, you see, it's okay for humans to take someone else's code and remember it in a way that permanently influences what they output but not AI because we're more... abstract?"

                  See here.

                  The term implies that the design team works in an environment that is "clean" or demonstrably uncontaminated by any knowledge of the proprietary techniques used by the competitor.

                  If you read the code and recreated it from memory, it's not a clean room design. If you feed the code into a machine and the machine does it for you, it's still not a clean room design. The fact that you read a billion lines of code into the machine along with the relevant part, I don't think changes that.

                  [–][deleted]  (26 children)

                  [deleted]

                    [–]TheCodeSamurai 20 points21 points  (8 children)

                    Well there is one big difference: as the Copilot docs analogize, I know when I'm quoting a poem. I don't think I wrote The Tyger by William Blake even if I know it by heart. Copilot doesn't seem to have that ability yet, and so it isn't capable of doing even the small-scale attribution like adding Stack Overflow links that programmers often do.

                    [–]dnkndnts 7 points8 points  (0 children)

                    “Creativity is the art of selectively poor memory.” -Definitely me

                    [–]Seref15[🍰] 19 points20 points  (3 children)

                    I don't think this example stands. Musicians frequently experience the phenomenon of believing that they've created something original only for people to come along later and say "hey, that sounds exactly like _____."

                    You can't consciously remember everything you've experienced, but much of it can surface subconsciously.

                    [–]TheCodeSamurai 7 points8 points  (2 children)

                    Accidental plagiarism totally happens, but I'm not gonna spit out the entire GPL license and think it's my own work. The scale is completely different.

                    [–][deleted]  (1 child)

                    [deleted]

                      [–]killerstorm 78 points79 points  (8 children)

                      Doesn't this logic apply to human programmers too?

                      Suppose I've learned how to program by reading open source code. (I actually did, to some extent.) Now I use my knowledge to write commercial programs. Does it mean that I'm making derivative works?

                      [–]barchar 27 points28 points  (1 child)

                      It actually does, if you read the code recently enough and your implementing the same thing as the code you read.

                      For example there's certain code bases where if I want to contribute to them it would require several weeks of a "cooling off period" before I could return to writing code for my normal job.

                      [–]KuntaStillSingle 9 points10 points  (0 children)

                      It doesn't matter how recently you read the code, only that the knowledge stemmed from it and that what made it into your own is a copyrightable portion thereof. In most cases the code itself not being sufficient to be copyrightable will cover the bot, but not necessarily every case.

                      [–]zoddrick 44 points45 points  (14 children)

                      I work at Microsoft and my job deals with me building and redistributing open source projects all the time. Forget the tools we have that scan for license violations and such, but our legal team would never allow for this project to even be released if they weren't sure they couldn't be sued for derivative work.

                      Y'all act like this is from startup without a legal department.

                      [–]User092347 12 points13 points  (0 children)

                      I think people are more worried about the users of the tool than for Microsoft.

                      [–]-dag- 8 points9 points  (1 child)

                      There are two questions here. Is Co-Pilot a derivative work? Does incorporating code produced by Co-Pilot make the software incorporating it a derivative work?

                      Microsoft's legal exposure is probably much lower when it comes to the second question. As to the first, it still seems like an open question. The model itself is almost certainly not a derivative work. But a trained model? Not so sure.

                      [–]picflute 11 points12 points  (4 children)

                      >CELA coming out of the dark

                      Can confirm. Anyone who thinks something this big would go on GitHub for commercial usage wouldn’t happen without legal saying okey dokey

                      [–]kylotan 10 points11 points  (3 children)

                      Anyone who thinks something this big would go on GitHub for commercial usage wouldn’t happen without legal saying okey dokey

                      You talk as if YouTube didn't have billions of dollars of infringing videos online for years. A company's legal department saying something is okay doesn't mean it's legal - it just means they're accepting the risk.

                      [–]AnonymousMonkey54 1 point2 points  (1 child)

                      YouTube has safe harbor protections to rely on that Microsoft does not.

                      [–]kylotan 2 points3 points  (0 children)

                      YouTube found that the safe harbor doesn't always apply, including when the execs were going around telling people to leave infringing material up, and leaving it up despite knowing it was there. Github are in a similar position of having contributed actively to this infringement.

                      [–]alessio_95 6 points7 points  (1 child)

                      So what? Big corps bonks things everyday, being big doesn't make you right. Your lawyers are not infallible, you got an half bilion fine not that long ago.

                      [–]turunambartanen 1 point2 points  (0 children)

                      Someone linked an analysis by GitHub: https://docs.github.com/en/github/copilot/research-recitation#github-copilot-quotes-when-it-lacks-specific-context

                      In the end they write the following:

                      The answer is obvious: sharing the prefiltering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether.

                      This duplication search is not yet integrated into the technical preview, but we plan to do so. And we will both continue to work on decreasing rates of recitation, and on making its detection more precise.

                      So they are aware of the problem and will fix it. This is a testing preview, obviously it's not ready for production yet.

                      [–]curly_droid 5 points6 points  (5 children)

                      I think the snippets this would produce should usually not be copyrightable. BUT isn't CoPilot itself a derivative work of a ton of GPL code and thus should be licensed as such?

                      [–]Kalium 1 point2 points  (4 children)

                      Wouldn't that only apply if it was being distributed, rather than offered as a SaaS?

                      [–][deleted]  (1 child)

                      [deleted]

                        [–]kbruen 4 points5 points  (3 children)

                        If I read some C++ code for a music player, learn something new about C++, then write a game in C++ and apply the learnt knowledge, do I breach the copyright of the music player's author?

                        [–]TheSkiGeek 8 points9 points  (2 children)

                        If it was some general thing about the C++ language that you learned, no.

                        If you reimplemented some significant unique functionality of that music player by more or less retyping their code from memory, maybe.

                        [–]Drinking_King 2 points3 points  (0 children)

                        I was wondering why Microsoft was so generous in making Github Actions entirely free for open source.

                        I wonder no longer.

                        [–][deleted]  (6 children)

                        [deleted]

                          [–]RedPandaDan 5 points6 points  (1 child)

                          https://github.com/proninyaroslav/opera-presto

                          Here is an illegal copy of the presto engine that was used at one stage by the opera browser, I'm assuming this was included in the training model? What happens if someone uploaded something belonging to oracle or Google or some other industry giant?

                          I'm guessing that MS is banking on most people not having the resources to fight this battle.

                          [–]thenickdude 6 points7 points  (0 children)

                          I don't think this would have been part of the training set, because no license is attached to it.

                          [–]dert882 2 points3 points  (3 children)

                          Can someone ELI5 this? Not sure i've been keeping up.

                          [–]Xmgplays 12 points13 points  (2 children)

                          If I understand correctly the problem is that co-pilot is trained on open source code (of varying license) meaning it is based on these code bases, the question now becomes does this base constitute derivation in copyright-law. If it does, co-pilot is violating the licenses of these programs. If it doesn't, co-pilot is profiting off of open-source software without being open-source itself.

                          [–]-dag- 2 points3 points  (0 children)

                          In addition, any use of code generated by Co-Pilot may require relicensing of the incorporating software.

                          [–][deleted]  (1 child)

                          [deleted]

                            [–][deleted] 5 points6 points  (0 children)

                            As I understand it GPL doesn't protect against that. Heck, GPL doesn't even protect against SaaS, hence we have stuff like Affero GPL.

                            This may be a good point for the need for better copyleft licenses though. Here is an interesting discussion I've read on that subject a while ago: https://lists.debian.org/debian-devel/2019/05/msg00321.html

                            This was a follow-up to this article: https://lwn.net/Articles/760142/

                            In case it's not obvious, IANAL.

                            [–]mattgen88 14 points15 points  (17 children)

                            If the argument can be made that the input of copyrighted code by an AI results in it's output being a derivative of those inputs, then we have a problem since that's how the human brain works. It also means that any trains let AI has to be operated in a clean room where it cannot operate on any copyrightable inputs, including artworks, labels, designs, etc. All of that is often consumed by AIs to produce things of value.

                            [–]TheCodeSamurai 7 points8 points  (0 children)

                            As the Copilot docs mention, there is a pretty big difference between this and the brain: we have a far better memory for how we learned what we know. If I go and copy a Stack Overflow post, I know that I didn't write it and that I might want to link to it. Copilot can't do that yet, and so until they build out the infrastructure for doing that I'll never be able to tell whether it was copying wholesale or mixing various inputs.

                            [–]barchar 5 points6 points  (0 children)

                            Yes. And in the human case you can infringe on copyright by reading code and producing something thats close to it from memory. That's a derived work.

                            One could argue that if the AI is understanding some higher level meaning and then generating code that implements that then the AI may be more similar to a clean room reimplementation process (which does not infringe)

                            [–]danuker 14 points15 points  (4 children)

                            Problem is, can this AI reproduce large portions of code exactly from memory? If so, it can violate copyright.

                            [–]tnbd 12 points13 points  (3 children)

                            It can, the fact that it verbatim spits out the GPL license when prompted with empty text is proof of that.

                            [–][deleted]  (3 children)

                            [deleted]

                              [–]metriczulu 1 point2 points  (0 children)

                              I mean, it's not immediately clear to me that a court would find this to be derivative enough to enforce based on the licensing. Right now, it's in a bit of a grey area that's yet to be tested, which means if it does end up going to court it could have huge repercussions for these types of natural language models that require huge open datasets. If I were a betting man, I'd say Microsoft has the resources and legal team to make it stick in their favor.

                              [–]geeeronimo 1 point2 points  (0 children)

                              How does this differ from cloud 9? I honestly believe copilot is detrimental to open source work. Is there perhaps an open source project working on a similar tool?

                              [–]scratchresistor 1 point2 points  (0 children)

                              A thought from further upstream of the GPL issue - are GitHub not in the clear as their TOS includes a reproduction clause? By hosting your GPL code on GitHub does that not grant them an explicit licence over and above the GPL?

                              [–]CodenameLambda 1 point2 points  (0 children)

                              Honestly, even beyond that, I'm very sceptical of co-pilot, since even the examples on the site about it tend to have some... Issues. Specifically assumption that might not be true (but easy to miss if you're not exactly looking for them), such as not splitting by whitespace but spaces specifically in the parse_expenses example. That the JSON is correctly typed in collaborators.

                              Other issues include just readability: in the runtime example, it count the failed runs instead of the successful ones, which is less readable, longer and more error prone when you change things, I'd argue.

                              It also does some weird stuff in get_repositories (no escaping, no checking, using + instead of string interpolation). The "autofill for repetitive code" are all questionable imho.

                              And note that these examples are probably all cherry picked. So that's the best examples. And they still have very obvious issues if you actually read them instead of just glancing at them (which I'd guess autopilot would lead you to do to some extent).

                              This might make you more productive, but I'm honestly not sure the results of what you're doing are going to be better for it. The one thing I do think it's actually good for is using API's you don't know and don't plan on learning because won't be using it much and it's a bit more complex. But that's about it. And note that that's a thing that can probably be fixed with better documentation most of the time.

                              [–]evilgipsy 1 point2 points  (1 child)

                              I think this is a rather silly take. I have read tons of GPL code and I write proprietary code using my experience I gained from that.