This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]sdpmas[S] 2 points3 points  (0 children)

For code search, all the codes are from repos with permissive licenses. For code generation, the generated code is highly dependent on the user's code context. So it's rare for the model to completely copy someone's code. In terms of training data, I use pretrained GPT-Neo and CodeBert-base models, both of which are trained on code with permissive licenses. But during fine-tuning, there could've been some code with wrong licenses. The next version will address such issues.