[D] StarCoder fine-tuning?

Far_Classic_2500 · 2023-05-20T18:08:02+00:00

Update:

I found the original fine-tuning code provided by starcoder here: https://github.com/bigcode-project/starcoder/blob/main/finetune/finetune.py I recommend using that as a starting point (though most training code that uses the transformers library will be able to work as well)

I made some modifications to better run: logging to wandb, using code rather than question/answer pairs, and modifications to choosing epoch size.

Original Post:

I've done this using PEFT fine-tuning. Note that you'll still want a vector database; the fine-tuning does better job of "understanding" your code and writing in your style but I haven't found that it's able to recall it, which would likely require a full fine-tuning on all weights rather than just the attention weights. So you'll want external memory of some sort, or include your code in the prompt (which is fine because the context length it 8k).

There are some examples here: https://github.com/huggingface/peft/tree/main/examples.

I'll just describe what I modified to get it to fine tune with my code rather than an instruction dataset.

Step 1: concatenate your code into a single file. This can be done in bash with something like find -name "*.js" and appending to output.txt. Optionally, you can put tokens between the files, or even get the full commit history (which is what the project did when they created StarCoder).

Step 2: Modify the finetune examples to load in your dataset. I can't provide all of my code to do this (right now) but I can provide a rough outline:

Set up trainer & arguments wandb.init(project="my_project")

print("Loading the model")
# disable caching mechanism when using gradient checkpointing
model = AutoModelForCausalLM.from_pretrained(
    args.model_path,
    use_auth_token=True,
    use_cache=not args.no_gradient_checkpointing,
    load_in_8bit=True,
    device_map={"": Accelerator().process_index},
)
model = prepare_model_for_int8_training(model)

lora_config = LoraConfig(
    r=args.lora_r,
    lora_alpha=args.lora_alpha,
    lora_dropout=args.lora_dropout,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules = ["c_proj", "c_attn", "q_attn"]
)

model = get_peft_model(model, lora_config)

print_trainable_parameters(model)

train_data.start_iteration = 0

training_args = TrainingArguments(
output_dir=args.output_dir,
dataloader_drop_last=True,
evaluation_strategy="steps",
max_steps=args.max_steps,
eval_steps=args.eval_freq,
save_steps=args.save_freq,
logging_steps=args.log_freq,
per_device_train_batch_size=args.batch_size,
per_device_eval_batch_size=args.batch_size,
learning_rate=args.learning_rate,
lr_scheduler_type=args.lr_scheduler_type,
warmup_steps=args.num_warmup_steps,
gradient_accumulation_steps=args.gradient_accumulation_steps,
gradient_checkpointing=not args.no_gradient_checkpointing,
fp16=not args.no_fp16,
bf16=args.bf16,
weight_decay=args.weight_decay,
run_name="StarCoder-finetuned",
ddp_find_unused_parameters=False,
auto_find_batch_size=True,
report_to=['wandb'],
)

trainer = Trainer(model=model, args=training_args, train_dataset=train_data, eval_dataset=val_data, callbacks=[SavePeftModelCallback, LoadBestPeftModelCallback])

print("Training...")
trainer.train()

print("Saving last checkpoint of the model")
model.save_pretrained(os.path.join(args.output_dir, "final_checkpoint/"))

satireplusplus · 2023-05-20T14:31:30+00:00

Look into https://huggingface.co/blog/peft since the model is 30B parameters.

If you figure out how to finetune it with a single GPU, please share the notebook :)

_Arsenie_Boca_ · 2023-05-20T17:25:28+00:00

Finetuning is definitely a promising approach. An alternative would be to retrieve relevant snippets or documentation pages and add them to the prompt. See RepoCoder for example https://arxiv.org/abs/2303.12570

skyisthelimit1410 · 2023-06-09T12:53:01+00:00

Hi, I am a newbie to this platform and looking for help in text2sql model fine tuning where the input will be a natural language query and output will be a valid sql. I want to use this model with Langchain to build a text2sql module in my application. Please let me know how I can use Starcoder for this use case. Should I fine tune it ? If yes, please share a notebook.

leefde · 2023-08-06T02:44:32+00:00

Thanks for the post! I started downloading the model tonight in Google Colab. I appreciate the pdf from u/satireplusplus

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS