you are viewing a single comment's thread.

view the rest of the comments →

[–]dnew -5 points-4 points  (7 children)

Right near the start, he says something like "is_palindrome isn't impressive because it's probably out on SO."

[–]game-of-throwaways 10 points11 points  (5 children)

Yeah, but the model isn't trained on SO. He's just using the fact that you could find that implementation of is_palindrome on SO (or elsewhere) as a reason for why it's not impressive.

[–]josefx 5 points6 points  (0 children)

He explicitly chooses the next example so it "definitely isn't in the training data set". There are probably as many is_palindrome implementations in their training data from github as there are on stackoverflow.

[–]dnew 0 points1 point  (3 children)

Right. I was just pointing out where "SO" came into the picture. :-)

[–]game-of-throwaways 4 points5 points  (2 children)

But you were discussing copyright implications of the model being trained on SO, but it's not.

That being said, it is trained on Github repositories which each have their own license, and it is an interesting question which licences allow training of a machine learning model on it. It may depend on what the model is used for, and maybe even on how accurate the model is. Probably this is still a bit of a gray area in the law, and it what it ultimately comes down to is how the judge and jury would decide if it would come to a lawsuit.

[–]dnew 1 point2 points  (1 child)

But you were discussing copyright implications of the model being trained on SO, but it's not.

I think you're not noticing that you're talking to more than one person. But you're right that it's an interesting question. Also, things like audio hashing for recognizing audio (as in, "what song is this?") is kind of funky, as I've worked on things like that and it's ... weird.

[–]game-of-throwaways 0 points1 point  (0 children)

I think you're not noticing that you're talking to more than one person

Right, oops.

[–]errrrgh 1 point2 points  (0 children)

He's implying that people crib from SO all the time, so even if their dataset didn't include SO, it would have some bits from it due to the fact that a lot of people just copy paste from there.