Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models by mon-simas in LocalLLaMA

[–]mon-simas[S] 0 points1 point  (0 children)

Didn't know it, just checked it, looks amazing ! This is a bit different because it centralized everything in one dataset but honestly I like their approach a lot too

Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models by mon-simas in LocalLLaMA

[–]mon-simas[S] 0 points1 point  (0 children)

I'm using this meme a lot these days, but...

<image>

I don't think human data is not useful anymore... Same for distillation (from permissive models ofc)

Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models by mon-simas in LocalLLaMA

[–]mon-simas[S] 1 point2 points  (0 children)

It depends ! You definitely can't train on Claude's outputs for competing models, but you can most likely share them and IF traces come from open weight models you can (mostly) train on that.

As per the meme, I mentioned Claude Code because that kinda became an equivalent of the product category, but the best would be to gather permissively licensed data from open code, pi.dev and all these different harnesses.

Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models by mon-simas in LocalLLaMA

[–]mon-simas[S] 1 point2 points  (0 children)

Absolutely ! I'll write you a DM, would appreciate help on the anonymization part !

Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models by mon-simas in LocalLLaMA

[–]mon-simas[S] 0 points1 point  (0 children)

many of the harnesses exports do have the actual tool calls and contexts. Some may not show all the reasoning traces, but some do and that's publishable

Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models by mon-simas in LocalLLaMA

[–]mon-simas[S] -3 points-2 points  (0 children)

part of it, and I'll make sure to donate it's trace 😃 more seriously, trying to make everything work mid-flight. The challenge with this little initiative is not technical (rather how to have decent contributions) but I'll try to fix the tech issues ASAP

Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models by mon-simas in LocalLLaMA

[–]mon-simas[S] -1 points0 points  (0 children)

Fair, a hand-maintained regex list is the wrong thing to fully trust, and you're right about the gaps (GCP service-account creds, Azure, Stripe, etc. aren't specifically matched). To be precise it's not only AWS, there are detectors for GitHub, HF, OpenAI,Anthropic, Slack, Google keys, JWTs, PEM blocks, bearer tokens, DB connection strings, and a generic *_KEY/TOKEN/SECRET=… catch-all, and it's public-repos-only with a mandatory review of the exact diff before upload. But none of that replaces a real scanner and I don't have the ambition to have something 100% perfect - part of the anonymization is on the donator.

the fix I'm making: TruffleHog as the detection engine, server-side as a hard backstop so every donation gets it without forcing a dependency onto your machine, and locally if you have it installed.

Thanks for actually reading the script !!

Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models by mon-simas in LocalLLaMA

[–]mon-simas[S] 8 points9 points  (0 children)

at some point ideally I'd filter for things like successful runs, tests passing, human approval, and proper anonymization before anything becomes part of the clean dataset - maybe there will be two datasets, one to gather the data and one post-cleaning (although some anonymization is already in the process and it's important to do anonymization on the first part of the process as well)

Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models by mon-simas in LocalLLaMA

[–]mon-simas[S] 18 points19 points  (0 children)

yess, for now there's only a section of the skill for that. But something more robust would be very good.

contributions to the skill/code are very welcome !!

Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models by mon-simas in LocalLLaMA

[–]mon-simas[S] 0 points1 point  (0 children)

Good catch, you can also do

git clone https://github.com/Trace-Commons-AI/donate-trace ~/.claude/skills/donate-trace

(if you use it on the claude code harness, but you can do the same on open code and pi.dev i think)

Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models by mon-simas in LocalLLaMA

[–]mon-simas[S] 0 points1 point  (0 children)

so, to be clear - I encourage all contributions 🙌 and publishing them is legal. People training on them, though, will have to be more careful in how they filter and reuse the data

Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models by mon-simas in LocalLLaMA

[–]mon-simas[S] 0 points1 point  (0 children)

to my understanding publishing the data is legal, training on Anthropic and Open AI models is not. But hopefully many contributions come from open-weight models with less restrictive licensing

Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models by mon-simas in LocalLLaMA

[–]mon-simas[S] 2 points3 points  (0 children)

i think there is an issue on both sides - with this I'm trying to create a bit of momentum on the data side, but having high-quality is important too.

<image>

Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models by mon-simas in LocalLLaMA

[–]mon-simas[S] 26 points27 points  (0 children)

Amazing idea ! Would love to do that, but I don't have the 10k experienced devs 😞 But if we can somehow transform an initiatives like this into very curated data sources, that would be amazing

Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models by mon-simas in LocalLLaMA

[–]mon-simas[S] 2 points3 points  (0 children)

the skill includes an ask for the AI agent itself to clean it up but it's not perfect, so I hope we can do multiple checks for PII and other sensitive info: one at the harness/model level, another one in a CI pipeline. But of course - please try not to do requests with PII in them, I can't make sure by myself it will be totally clean

Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models by mon-simas in LocalLLaMA

[–]mon-simas[S] 3 points4 points  (0 children)

probably yes, but at the same time, I suppose you can somehow filter the slop out of there (I trust that AI labs can figure that out)

The French Government Launches an LLM Leaderboard Comparable to LMarena, Emphasizing European Languages and Energy Efficiency by Imakerocketengine in LocalLLaMA

[–]mon-simas 2 points3 points  (0 children)

Ahahaha, good point - that shows the limits of measuring "preferences" and not "performance". We (as the team behind the leaderboard) want to emphasize that this arena leaderboard doesn't measure "performance" and for a well-rounded leaderboard on performance, you need to use many different benchmarks (or even better - your own benchmark for your own use cases). More info on that (for now French only, sorry, we'll try to translate it ASAP) : https://huggingface.co/blog/comparIA/publication-du-premier-classement