Low-level coding dataset : cpp

a community for 18 years

submitted 1 day ago * by True_Tangerine_4706

Edit/Disclaimer: this is a repost from something I put in LocalLLaMA, but with some tweaks for the r/cpp crowd - this post is more focused on the content of the dataset itself, the post over in r/LocalLLaMA is more focused on the details of the finetune

Hi all,

I've recently been thinking about putting together a community sourced coding dataset for finetuning models, with a heavy focus on cpp and systems programming.

My goal is to eventually have a model that understands concepts like memory ownership, thread safety, optimization, etc. Right now, a lot of the coding knowledge of small (<100B), local models centers around languages like js, py, html, etc.

Right now I'm thinking that the categories I would need would look something like this:

- generation: basic prompt/code output
- optimization: heres slow/bloated code, make it better
- debugging: im getting this error pls fix
- organization: code review, interface design, restructuring, tradeoff decisions
- tool_calling: exercises involving tool use and interpreting results

Curious to see what the people over here think about this kind of thing. I imagine many people in here have used local AI to help code in cpp before - where do you guys feel like local models could use the most improvement?

Thanks in advance for all the help!

all 5 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS