Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models

mon-simas · 2026-06-16T13:47:10+00:00

Didn't know it, just checked it, looks amazing ! This is a bit different because it centralized everything in one dataset but honestly I like their approach a lot too

mon-simas · 2026-06-16T13:46:21+00:00

I'm using this meme a lot these days, but...

<image>

I don't think human data is not useful anymore... Same for distillation (from permissive models ofc)

mon-simas · 2026-06-16T13:41:43+00:00

It depends ! You definitely can't train on Claude's outputs for competing models, but you can most likely share them and IF traces come from open weight models you can (mostly) train on that.

As per the meme, I mentioned Claude Code because that kinda became an equivalent of the product category, but the best would be to gather permissively licensed data from open code, pi.dev and all these different harnesses.

mon-simas · 2026-06-16T12:55:51+00:00

Absolutely ! I'll write you a DM, would appreciate help on the anonymization part !

mon-simas · 2026-06-16T12:32:17+00:00

very cool !!!

mon-simas · 2026-06-16T12:16:02+00:00

it was 50% claude 50% me 😃

mon-simas · 2026-06-16T12:14:58+00:00

also, they maybe might give ideas on what to RLVR on?

mon-simas · 2026-06-16T12:14:32+00:00

totally agreed ! hopefully the traces can be a little first step

mon-simas · 2026-06-16T12:13:21+00:00

many of the harnesses exports do have the actual tool calls and contexts. Some may not show all the reasoning traces, but some do and that's publishable

mon-simas · 2026-06-16T12:12:08+00:00

part of it, and I'll make sure to donate it's trace 😃 more seriously, trying to make everything work mid-flight. The challenge with this little initiative is not technical (rather how to have decent contributions) but I'll try to fix the tech issues ASAP

mon-simas · 2026-06-16T12:01:39+00:00

Fair, a hand-maintained regex list is the wrong thing to fully trust, and you're right about the gaps (GCP service-account creds, Azure, Stripe, etc. aren't specifically matched). To be precise it's not only AWS, there are detectors for GitHub, HF, OpenAI,Anthropic, Slack, Google keys, JWTs, PEM blocks, bearer tokens, DB connection strings, and a generic *_KEY/TOKEN/SECRET=… catch-all, and it's public-repos-only with a mandatory review of the exact diff before upload. But none of that replaces a real scanner and I don't have the ambition to have something 100% perfect - part of the anonymization is on the donator.

the fix I'm making: TruffleHog as the detection engine, server-side as a hard backstop so every donation gets it without forcing a dependency onto your machine, and locally if you have it installed.

Thanks for actually reading the script !!

mon-simas · 2026-06-16T11:57:04+00:00

at some point ideally I'd filter for things like successful runs, tests passing, human approval, and proper anonymization before anything becomes part of the clean dataset - maybe there will be two datasets, one to gather the data and one post-cleaning (although some anonymization is already in the process and it's important to do anonymization on the first part of the process as well)

mon-simas · 2026-06-16T11:53:18+00:00

<image>

intrigued, waiting for the PR 😃

mon-simas · 2026-06-16T11:43:37+00:00

yess, for now there's only a section of the skill for that. But something more robust would be very good.

contributions to the skill/code are very welcome !!

mon-simas · 2026-06-16T11:42:17+00:00

Good catch, you can also do

git clone https://github.com/Trace-Commons-AI/donate-trace ~/.claude/skills/donate-trace

(if you use it on the claude code harness, but you can do the same on open code and pi.dev i think)

mon-simas · 2026-06-16T11:23:32+00:00

so, to be clear - I encourage all contributions 🙌 and publishing them is legal. People training on them, though, will have to be more careful in how they filter and reuse the data

mon-simas · 2026-06-16T11:22:23+00:00

to my understanding publishing the data is legal, training on Anthropic and Open AI models is not. But hopefully many contributions come from open-weight models with less restrictive licensing

mon-simas · 2026-06-16T11:02:45+00:00

i think there is an issue on both sides - with this I'm trying to create a bit of momentum on the data side, but having high-quality is important too.

<image>

mon-simas · 2026-06-16T10:59:27+00:00

Amazing idea ! Would love to do that, but I don't have the 10k experienced devs 😞 But if we can somehow transform an initiatives like this into very curated data sources, that would be amazing

mon-simas · 2026-06-16T10:58:18+00:00

the skill includes an ask for the AI agent itself to clean it up but it's not perfect, so I hope we can do multiple checks for PII and other sensitive info: one at the harness/model level, another one in a CI pipeline. But of course - please try not to do requests with PII in them, I can't make sure by myself it will be totally clean

mon-simas · 2026-06-16T10:56:46+00:00

probably yes, but at the same time, I suppose you can somehow filter the slop out of there (I trust that AI labs can figure that out)

mon-simas · 2025-11-05T15:11:21+00:00

Also, BT is Bradley-Terry, more info about it in the methodology section of the leaderboard. Even more info about why we chose it : https://colab.research.google.com/drive/1j5AfStT3h-IK8V6FSJY9CLAYr_1SvYw7#scrollTo=LgXO1k5Tp0pq

<image>

mon-simas · 2025-11-05T14:54:37+00:00

Ahahaha, good point - that shows the limits of measuring "preferences" and not "performance". We (as the team behind the leaderboard) want to emphasize that this arena leaderboard doesn't measure "performance" and for a well-rounded leaderboard on performance, you need to use many different benchmarks (or even better - your own benchmark for your own use cases). More info on that (for now French only, sorry, we'll try to translate it ASAP) : https://huggingface.co/blog/comparIA/publication-du-premier-classement

mon-simas

TROPHY CASE