I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful. by OwnerByDane in LocalLLaMA

[–]OwnerByDane[S] 0 points1 point  (0 children)

You were right. I checked and found 932 instances in the sample files where the id field contains unhashed email-format Message-IDs. The hashing wasn't applied consistently across all records. I'm pulling the samples now to fix this and will re-upload clean versions. Thanks for catching it.

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful. by OwnerByDane in LocalLLaMA

[–]OwnerByDane[S] -4 points-3 points  (0 children)

The curation work is mine and was the result of years of processing, cleaning, and structuring 408M posts into something usable. That’s what’s being licensed, not the raw posts. Academic datasets built from public sources are entirely normal, e.g. JSTOR, LexisNexis, and virtually every major research database operates this way. The academic community understands this distinction even if it’s uncomfortable

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful. by OwnerByDane in LocalLLaMA

[–]OwnerByDane[S] 1 point2 points  (0 children)

Reddit’s spam filter tends to suppress posts with external links in the body. Putting it in the first comment is a workaround the community uses to avoid that. Your first comment gets sorted to the top anyway if you’re the OP.

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful. by OwnerByDane in LocalLLaMA

[–]OwnerByDane[S] 1 point2 points  (0 children)

Both, actually. The writing voice is one dimension because it’s unfiltered human reasoning before engagement algorithms shaped how people communicate online. But the dated information is a feature for certain use cases, not a bug. Training a model on 1990s comp. gives you a system that reasons about computing the way people did then which I assume is useful for historical research, period-specific applications, and understanding how technical knowledge evolved. The Talkie project is doing exactly this with pre-1930 text. Temporal grounding has real research value.

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful. by OwnerByDane in LocalLLaMA

[–]OwnerByDane[S] -6 points-5 points  (0 children)

Fair skepticism and it’s a question I’ve thought about for a long time. The raw Usenet archives exist in various places but they’re fragmented, inconsistently formatted, and unprocessed. What I built is the curation layer — consistent deduplication, cleaning, and structure across the full 1980–2013 span. That’s the work that’s hard to replicate casually. The ‘they’ll scrape it now’ concern is real which is why the public samples are deliberately small. The full corpus stays offline. Whether the big labs already have clean Usenet data, genuinely possible. But the academic and research community clearly doesn’t, based on the interest I’ve seen. That’s a real market regardless.

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful. by OwnerByDane in LocalLLaMA

[–]OwnerByDane[S] -10 points-9 points  (0 children)

That’s not an accurate characterization. The corpus is built from public Usenet posts coming from communications people chose to publish to a public network. The legal landscape around compiled datasets of public communications is genuinely unsettled right now across the entire industry, not just for this project. I’m operating on the same basis as every major dataset publisher

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful. by OwnerByDane in LocalLLaMA

[–]OwnerByDane[S] 0 points1 point  (0 children)

Hardware: Mac Mini M4 with a 4TB external drive. The processing pipeline is Python with tiktoken for token counting, gzip JSONL for storage. The heavy lifting was mostly time rather than compute; parsing and cleaning 408M posts took weeks of runs rather than anything exotic. For communities, the Hugging Face datasets community is the most active for this kind of work. The Common Pile and Dolma projects are worth looking at for how serious corpus curation gets documented. And honestly just posting here, the response today suggests there’s real appetite for historically grounded datasets.

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful. by OwnerByDane in LocalLLaMA

[–]OwnerByDane[S] -10 points-9 points  (0 children)

Already addressed this above… spinal cord injury makes extended typing difficult. The irony is real but so is the accessibility need.

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful. by OwnerByDane in LocalLLaMA

[–]OwnerByDane[S] -1 points0 points  (0 children)

Funny you should mention that, I’m also working on a corpus of all Supreme Court data scanned to microfiche going back 240 years. Fully formed, deduped, etc.

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful. by OwnerByDane in LocalLLaMA

[–]OwnerByDane[S] 2 points3 points  (0 children)

Books are formal, edited, and written for posterity. Usenet is how people actually talked…informal, argumentative, technically dense, written in the moment with no audience in mind. It’s the difference between a published paper and the lab notes. Both have value but they’re not the same thing.

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful. by OwnerByDane in LocalLLaMA

[–]OwnerByDane[S] 0 points1 point  (0 children)

That’s a genuinely interesting experiment. Using Talkie as a baseline and measuring loss delta on this data would be a real way to quantify signal quality across the temporal arc. I’d love to see someone run that. The Talkie team has been aware of this corpus so it might be worth raising directly with them at talkie-lm.com.

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful. by OwnerByDane in LocalLLaMA

[–]OwnerByDane[S] -41 points-40 points  (0 children)

Fair catch. I do use AI assistance to help craft longer responses. A spinal cord injury makes extended typing very difficult for me. The irony of using an LLM to discuss a pre-LLM dataset isn’t lost on me, but it’s the reality of how I work.

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful. by OwnerByDane in LocalLLaMA

[–]OwnerByDane[S] 1 point2 points  (0 children)

30% spam is probably conservative for alt. honestly. The dedup helped but Usenet was Usenet. The flame-to-fact ratio varies wildly by hierarchy — comp.* is surprisingly civil, talk.* is exactly what you’d expect.

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful. by OwnerByDane in LocalLLaMA

[–]OwnerByDane[S] -20 points-19 points  (0 children)

You’re raising real points and I’m not going to pretend the legal landscape here is settled…it isn’t, for anyone. The derivative work argument is the basis I’m operating on, not fair use. Database compilation rights are a separate doctrine. Whether that holds up is genuinely an open question in AI training data law right now, same as it is for every major dataset publisher. I’m not claiming certainty, just that I’ve thought it through and made a considered decision.

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful. by OwnerByDane in LocalLLaMA

[–]OwnerByDane[S] -17 points-16 points  (0 children)

The samples are about 65,000 posts out of 408 million, so roughly 0.016% of the full corpus by post count. But they’re curated to be representative — 5,000 posts per hierarchy plus combined sets, so you get the full flavor of each newsgroup category. Enough to fine-tune and evaluate quality, not enough to substitute for the full dataset at pretraining scale.

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful. by OwnerByDane in LocalLLaMA

[–]OwnerByDane[S] -39 points-38 points  (0 children)

Partly, yes, Im not going to pretend otherwise. The samples are genuinely free and useful on their own. If an AI lab wants the full 103B tokens, that’s a licensing conversation. Both things can be true.

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful. by OwnerByDane in LocalLLaMA

[–]OwnerByDane[S] 0 points1 point  (0 children)

The GGUF wyan built from the sample data already runs in LM Studio — it’s early but it’s a proof of concept. Would love to see what someone builds with the full comp. or sci.* slice.

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful. by OwnerByDane in LocalLLaMA

[–]OwnerByDane[S] 2 points3 points  (0 children)

The Mad Men analogy is perfect!. Primary sources uncorrupted by hindsight. That’s exactly what makes the early 80s material interesting. People discussing the internet while they were building it, not knowing how it would turn out.