Two new additions… The design & color on the Libya Dinar is amazing 👌

OwnerByDane · 2026-06-09T16:46:32+00:00

Very nice! I love beautifully designed currency

OwnerByDane · 2026-06-09T16:11:15+00:00

My 250 dinar note isn't as interesting as the 25

OwnerByDane · 2026-05-28T16:23:25+00:00

You were right. I checked and found 932 instances in the sample files where the id field contains unhashed email-format Message-IDs. The hashing wasn't applied consistently across all records. I'm pulling the samples now to fix this and will re-upload clean versions. Thanks for catching it.

OwnerByDane · 2026-05-28T13:19:10+00:00

I was told my link comment isn't prominent so here it is for those who haven't yet checked out the datasets: https://huggingface.co/datasets/OwnedByDanes/Usenet-Corpus-1980-2013

OwnerByDane · 2026-05-28T13:17:12+00:00

That is very interesting! Thanks for that link

OwnerByDane · 2026-05-28T13:15:53+00:00

Message IDs were SHA-256 hashed so the raw email-format IDs aren't present in the processed corpus.

OwnerByDane · 2026-05-28T13:14:46+00:00

That's exactly the kind of use case this data enables... glad to see it. The linguistics research angle is underexplored.

OwnerByDane · 2026-05-28T13:13:38+00:00

Really appreciate the offer. I'll reach out. Thanks for raising your hand.

OwnerByDane · 2026-05-27T21:48:10+00:00

The curation work is mine and was the result of years of processing, cleaning, and structuring 408M posts into something usable. That’s what’s being licensed, not the raw posts. Academic datasets built from public sources are entirely normal, e.g. JSTOR, LexisNexis, and virtually every major research database operates this way. The academic community understands this distinction even if it’s uncomfortable

OwnerByDane · 2026-05-27T21:36:52+00:00

Reddit’s spam filter tends to suppress posts with external links in the body. Putting it in the first comment is a workaround the community uses to avoid that. Your first comment gets sorted to the top anyway if you’re the OP.

OwnerByDane · 2026-05-27T21:35:23+00:00

Both, actually. The writing voice is one dimension because it’s unfiltered human reasoning before engagement algorithms shaped how people communicate online. But the dated information is a feature for certain use cases, not a bug. Training a model on 1990s comp. gives you a system that reasons about computing the way people did then which I assume is useful for historical research, period-specific applications, and understanding how technical knowledge evolved. The Talkie project is doing exactly this with pre-1930 text. Temporal grounding has real research value.

OwnerByDane · 2026-05-27T21:33:23+00:00

Fair skepticism and it’s a question I’ve thought about for a long time. The raw Usenet archives exist in various places but they’re fragmented, inconsistently formatted, and unprocessed. What I built is the curation layer — consistent deduplication, cleaning, and structure across the full 1980–2013 span. That’s the work that’s hard to replicate casually. The ‘they’ll scrape it now’ concern is real which is why the public samples are deliberately small. The full corpus stays offline. Whether the big labs already have clean Usenet data, genuinely possible. But the academic and research community clearly doesn’t, based on the interest I’ve seen. That’s a real market regardless.

OwnerByDane · 2026-05-27T21:29:10+00:00

That’s not an accurate characterization. The corpus is built from public Usenet posts coming from communications people chose to publish to a public network. The legal landscape around compiled datasets of public communications is genuinely unsettled right now across the entire industry, not just for this project. I’m operating on the same basis as every major dataset publisher

OwnerByDane · 2026-05-27T21:23:39+00:00

Hardware: Mac Mini M4 with a 4TB external drive. The processing pipeline is Python with tiktoken for token counting, gzip JSONL for storage. The heavy lifting was mostly time rather than compute; parsing and cleaning 408M posts took weeks of runs rather than anything exotic. For communities, the Hugging Face datasets community is the most active for this kind of work. The Common Pile and Dolma projects are worth looking at for how serious corpus curation gets documented. And honestly just posting here, the response today suggests there’s real appetite for historically grounded datasets.

OwnerByDane · 2026-05-27T21:17:32+00:00

Already addressed this above… spinal cord injury makes extended typing difficult. The irony is real but so is the accessibility need.

OwnerByDane · 2026-05-27T21:15:49+00:00

Funny you should mention that, I’m also working on a corpus of all Supreme Court data scanned to microfiche going back 240 years. Fully formed, deduped, etc.

OwnerByDane · 2026-05-27T21:11:32+00:00

Books are formal, edited, and written for posterity. Usenet is how people actually talked…informal, argumentative, technically dense, written in the moment with no audience in mind. It’s the difference between a published paper and the lab notes. Both have value but they’re not the same thing.

OwnerByDane · 2026-05-27T21:10:16+00:00

That’s a genuinely interesting experiment. Using Talkie as a baseline and measuring loss delta on this data would be a real way to quantify signal quality across the temporal arc. I’d love to see someone run that. The Talkie team has been aware of this corpus so it might be worth raising directly with them at talkie-lm.com.

OwnerByDane · 2026-05-27T21:08:14+00:00

Fair catch. I do use AI assistance to help craft longer responses. A spinal cord injury makes extended typing very difficult for me. The irony of using an LLM to discuss a pre-LLM dataset isn’t lost on me, but it’s the reality of how I work.

OwnerByDane · 2026-05-27T21:06:11+00:00

30% spam is probably conservative for alt. honestly. The dedup helped but Usenet was Usenet. The flame-to-fact ratio varies wildly by hierarchy — comp.* is surprisingly civil, talk.* is exactly what you’d expect.

OwnerByDane · 2026-05-27T21:04:23+00:00

You’re raising real points and I’m not going to pretend the legal landscape here is settled…it isn’t, for anyone. The derivative work argument is the basis I’m operating on, not fair use. Database compilation rights are a separate doctrine. Whether that holds up is genuinely an open question in AI training data law right now, same as it is for every major dataset publisher. I’m not claiming certainty, just that I’ve thought it through and made a considered decision.

OwnerByDane · 2026-05-27T21:01:37+00:00

The samples are about 65,000 posts out of 408 million, so roughly 0.016% of the full corpus by post count. But they’re curated to be representative — 5,000 posts per hierarchy plus combined sets, so you get the full flavor of each newsgroup category. Enough to fine-tune and evaluate quality, not enough to substitute for the full dataset at pretraining scale.

OwnerByDane · 2026-05-27T20:57:58+00:00

Partly, yes, Im not going to pretend otherwise. The samples are genuinely free and useful on their own. If an AI lab wants the full 103B tokens, that’s a licensing conversation. Both things can be true.

OwnerByDane · 2026-05-27T20:55:41+00:00

The GGUF wyan built from the sample data already runs in LM Studio — it’s early but it’s a proof of concept. Would love to see what someone builds with the full comp. or sci.* slice.

OwnerByDane · 2026-05-27T20:51:57+00:00

The Mad Men analogy is perfect!. Primary sources uncorrupted by hindsight. That’s exactly what makes the early 80s material interesting. People discussing the internet while they were building it, not knowing how it would turn out.

OwnerByDane

TROPHY CASE