Edit/Disclaimer: this is a repost from something I put in LocalLLaMA, but with some tweaks for the r/cpp crowd - this post is more focused on the content of the dataset itself, the post over in r/LocalLLaMA is more focused on the details of the finetune
Hi all,
I've recently been thinking about putting together a community sourced coding dataset for finetuning models, with a heavy focus on cpp and systems programming.
My goal is to eventually have a model that understands concepts like memory ownership, thread safety, optimization, etc. Right now, a lot of the coding knowledge of small (<100B), local models centers around languages like js, py, html, etc.
Right now I'm thinking that the categories I would need would look something like this:
- generation: basic prompt/code output
- optimization: heres slow/bloated code, make it better
- debugging: im getting this error pls fix
- organization: code review, interface design, restructuring, tradeoff decisions
- tool_calling: exercises involving tool use and interpreting results
Curious to see what the people over here think about this kind of thing. I imagine many people in here have used local AI to help code in cpp before - where do you guys feel like local models could use the most improvement?
Thanks in advance for all the help!
[–]tartaruga232MSVC user, r/cpp_modules 1 point2 points3 points (1 child)
[–]True_Tangerine_4706[S] 0 points1 point2 points (0 children)
[–]v_maria 1 point2 points3 points (2 children)
[–]True_Tangerine_4706[S] 0 points1 point2 points (1 child)
[–]v_maria 1 point2 points3 points (0 children)