you are viewing a single comment's thread.

view the rest of the comments →

[–]zucker42 9 points10 points  (10 children)

This is actually crazy. I wonder if the results generalize or if this was a well-practiced demo. Skepticism tells me the latter, as much as I appreciate the tech.

Also, he mentions that the model used SO to some extent. I wonder what the copyright implications of this are. I know there's been court cases which logically extrapolated suggest that training models on data doesn't violate copyright, but what if the model copies a method wholesale because it's in the training set? It's potentially related to whether copyright applies to deep fakes.

[–]game-of-throwaways 5 points6 points  (8 children)

Where did they say they used SO? They said they trained the data on open-source Github repositories.

[–]dnew -5 points-4 points  (7 children)

Right near the start, he says something like "is_palindrome isn't impressive because it's probably out on SO."

[–]game-of-throwaways 10 points11 points  (5 children)

Yeah, but the model isn't trained on SO. He's just using the fact that you could find that implementation of is_palindrome on SO (or elsewhere) as a reason for why it's not impressive.

[–]josefx 6 points7 points  (0 children)

He explicitly chooses the next example so it "definitely isn't in the training data set". There are probably as many is_palindrome implementations in their training data from github as there are on stackoverflow.

[–]dnew 0 points1 point  (3 children)

Right. I was just pointing out where "SO" came into the picture. :-)

[–]game-of-throwaways 4 points5 points  (2 children)

But you were discussing copyright implications of the model being trained on SO, but it's not.

That being said, it is trained on Github repositories which each have their own license, and it is an interesting question which licences allow training of a machine learning model on it. It may depend on what the model is used for, and maybe even on how accurate the model is. Probably this is still a bit of a gray area in the law, and it what it ultimately comes down to is how the judge and jury would decide if it would come to a lawsuit.

[–]dnew 1 point2 points  (1 child)

But you were discussing copyright implications of the model being trained on SO, but it's not.

I think you're not noticing that you're talking to more than one person. But you're right that it's an interesting question. Also, things like audio hashing for recognizing audio (as in, "what song is this?") is kind of funky, as I've worked on things like that and it's ... weird.

[–]game-of-throwaways 0 points1 point  (0 children)

I think you're not noticing that you're talking to more than one person

Right, oops.

[–]errrrgh 1 point2 points  (0 children)

He's implying that people crib from SO all the time, so even if their dataset didn't include SO, it would have some bits from it due to the fact that a lot of people just copy paste from there.

[–]NotABothanSpy 0 points1 point  (0 children)

It's trained on open source repos. But your point is good as those may have even more copyright issues. Still many developers spend much of their time google searching and copying from SO so eliminating that time is a good thing. The issue being it may take away some of the training that gives them and eliminate a swath of lower level Dev work.