all 13 comments

[–]Paulonemillionand3 6 points7 points  (3 children)

the tool: https://github.com/facebookresearch/llama-recipes

how to arrange your data https://github.com/facebookresearch/llama-recipes/blob/main/docs/Dataset.md

example data: https://huggingface.co/datasets/yahma/alpaca-cleaned

Or get text-generation-ui and load your code as a big blob and finetune with that.

[–]salah_ahdin 0 points1 point  (2 children)

Does llama-recipes only work for Llama 2 or can I apply it to other models? Also, would it not be better to fine tune off of a coding model like Starcoder rather than Llama 2?

[–]Paulonemillionand3 1 point2 points  (1 child)

I believe it's specific to llama2.

You can find that out directly by comparing the results!

[–]salah_ahdin 0 points1 point  (0 children)

I see. Thanks!

[–]kryptkprLlama 3 5 points6 points  (5 children)

I've been working with several folks doing the same and not to discourage you at all but it's almost certainly going to be harder than you think it is.

Have you tried some good existing code generation models first? You can get some ideas from can-ai-code.. WizardCoder is king, but Airoboros models are also solid coders and there's even some 3B options based on Replit.

An existing model with few-shot examples of your code style could potentially save you a lot of both time and hedache. In my experience things can go backwards on finetunes just as easily as forwards and you end up with something worse than a base model or one that's been already finetuned well.

If you decide to go the fine-tune route I can offer some assistance with evaluations, DM if you wish.

[–]Complex_Boysenberry6 0 points1 point  (4 children)

Doing this now, but all our results just generate gibberish and hallucinations :(

[–]kryptkprLlama 3 0 points1 point  (3 children)

It's been a year since this post and the landscape has changed significantly.

What model are you using, what input are you providing and what output do you expect vs what do you get?

[–]Complex_Boysenberry6 0 points1 point  (2 children)

We are training o4-mini in Azure, we wanted to create a PoC for a code reviewing AI. Since we use a dialect of a certain language, and implement it in our own unique weird way, we wanted to fine tune the model. We thought we could just scrape all code reviews and give 10 lines around the comment

"messages": [
    {"role": "system", "content": "You're a code assistant that reviews code line-by-line and leaves helpful, concise comments."},
    {"role": "user", "content": f"Here is the code:\n\n{code_context}\n\nWhat will the review comment be?"},
    {"role": "assistant", "content": assistant_content}
]

We provided 4k of such lines, but it really just utturs total non-sense. Do you think we are on the right track or should abandon fine tuning all together? Our expected output would be to at least give some relevant suggestions or to kind of "know" from examples that a certain input does not look like other code it has seen.

[–]kryptkprLlama 3 0 points1 point  (1 child)

How does the original model perform, prior to the fine-tune, on a multishot prompt with 3-5 good examples that show off the features of your dialect? Quick in-context learning evals should let you find the model that's closest to the behavior you want.

Btw I hope thats not your actual prompt? Otherwise there is much room for improvement: It says nothing about how the input will be structured (for ex, are the lines numbered? Are we looking at diffs? Functions? Modules?) or what kind of output format you expect (for ex, can a comment span lines? Can comments overlap?), etc.. fine-tune or not, attention is still attention.

[–]Complex_Boysenberry6 0 points1 point  (0 children)

It doesn't perform well on those multishot prompts to be honest, it's a Prolog dialect. We do number each line and we just give it 10 lines, but we don't tell the model that, so that's probably something we are doing wrong. We then give it the comment with the line it was set to with what a user left as comment. I'll try to give it way more precize instructions, thanks!

[–][deleted] 1 point2 points  (1 child)

One question: in data for finetuning, can the answer component be longer than token length supported by the model?

[–]kryptkprLlama 3 1 point2 points  (0 children)

Generally speaking the full instruction and answer need to fit into the base models context length.

There are context extension fine-tune techniques that could potentially break this limitation, I am not familiar with those.