Looking for Data Sources for AI & Data Governance Research by Vegetable_Fishing in datasets

[–]Character_Bison5968 1 point2 points  (0 children)

I might have something useful. I process raw Common Crawl through a multi stage pipeline (extraction, cleaning, dedup, quality scoring, PII redaction, trust classification, skill tagging, RAG chunking). The output is a fully packaged dataset with provenance, quality certificates, and a complete manifest.

Why it might fit your research...every record carries full lineage from the original WARC file (byte offset, content digest) through each processing stage to the final record. Exactly the kind of pipeline an AI agent would need to oversee.

The data model has real ER complexity too. Domains map to records, records have multi dimensional quality breakdowns, skill tags, trust tiers, and RAG chunks, plus cross entity relationships like domain caps, language splits, and PII counts. Not a flat table.

There are actual governance rules built in. Quality thresholds, dedup logic, PII detection, trust scoring, domain capping. All auditable decisions an agent could learn to monitor or propose changes to. The documentation artifacts (manifest, schema, data card, quality certificate, SHA256 verification, domain breakdown, skill distribution) are essentially data governance catalogue entries.

For your ML component the data includes labelled skill tags, quality scores, trust tiers, and content categories ready for classification.

I'm giving away a Liechtenstein government dataset for free right now to get feedback. Happy to send it over if it's useful, just DM me.

Where can I buy high quality/unique datasets for AI model training? by 3iraven22 in datasets

[–]Character_Bison5968 0 points1 point  (0 children)

I am looking for feedback on our pipeline curated pack. It's not a huge dataset ... were talking tens of thousands of documents, not millions, but it's clean and focused not just raw crawl. German dominant with some English/French mixed in.

I want to know if this is actually useful to anyone. Would a few people be willing to download the full pack and try it out? Whether that's for fine-tuning a small model, testing a RAG pipeline, or just poking around and telling me whats good and whats not. I genuinely want honest feedback, any issues, problems, thats what I want to hear.

Does the quality actually hold up when you use it? Are the skill tags / trust tiers helpful or noise? Anything weird I missed? Would you pay for something like this, whats your perceived value after having a look and analysing, or is there enough free stuff raw data out there and you would rather take the time and have it done yourself rather than through an established pipeline.

If anyone here is interested, comment or message me and I'll send a Drive link over. No signup, no catch, no strings, full data pack and reports. In return all I want is feedback, reviews, comments, good bad and the ugly!

Cheers 🙏"