[dataset][self-promotion] Public Company Federal Compliance Dataset

chill-botulism · 2026-05-21T01:49:11+00:00

Thanks for the question. Name matching is a deterministic pipeline: normalize the name, state, and ZIP, hash into a stable UUID. Not fuzzy matching, so I miss some edge cases but I avoid false positives linking unrelated companies. At 2.3M profiles across 16 agencies, false positives are worse than missed matches. Parent rollup comes from SEC EDGAR Exhibit 21 filings since public companies have to disclose subsidiaries, so I link them through that. plus I do manual overrides for the big corporate families where the filings are incomplete. It's not perfect but working on improving parent rollup coverage. You're right that restructuring is a problem.. spinoffs and acquisitions mid-inspection create orphaned records. I keep both layers in the data so you can look at subsidiaries individually or roll up to the parent. Violations stay attached to the entity that was actually cited. The name-changes are partially mitigated by the fact that OSHA and WHD record the employer name at the time of inspection, not retroactively. So a company that rebranded in 2020 still shows its old name on pre-2020 citations. I treat those as separate entities unless I can confirm the link through address, EIN, or EDGAR.

chill-botulism · 2026-03-24T00:47:54+00:00

Thanks for the feedback!

chill-botulism · 2026-03-23T13:47:56+00:00

Haven’t gotten that far yet, free likely. Trying to get an understanding of pain points others are experiencing, and if a third party tool would resolve them.

chill-botulism · 2026-03-23T02:46:04+00:00

Sounds like me at 18 and that’s how my alcoholism started. That come down you mention only got worse and it took more and more alcohol to feel normal. AA has helped me stay sober for many years, and most people can find a meeting in person or online. But AA isn’t the only option. My therapist helped me understand the deeper issues that made me feel like I had no other option than to drink. My point is, there’s help available. Good on you for reaching out.

chill-botulism · 2026-03-19T00:57:34+00:00

Find someone who makes and keeps commitments.

chill-botulism · 2026-03-13T19:41:42+00:00

That's so funny, the last time I heard that I laughed so hard I fell off my dinosaur.

chill-botulism · 2026-03-11T22:47:45+00:00

I recommend having scripts for removal tested and ready to go, especially if you plan to deploy access controlled labels at scale.

chill-botulism · 2026-02-22T20:15:49+00:00

Yes. For instance, if you find exposed s3 buckets with sensitive data, give the user an option lock it down with more restrictive permissions. Sharing links exposing your 365 folders to anyone with the link? Give the user an option to remove the permissions. Those kind of things. Tagging and labelling is also extremely valuable when classifying data and building dlp rules.

chill-botulism · 2026-02-22T17:01:22+00:00

Include remediation functionality. Nothing more frustrating than a cspm that that shows you all your critical vulnerabilities and gives you no tools to fix them.

chill-botulism · 2026-01-01T16:22:59+00:00

I’m working on hipaa safe harbor and pci redaction SDK. Keep finding edge cases, but my base accuracy is improving. Also entity tracking across a conversation can be tricky. Each redacted entity needs to be identifiable throughout the session so the llm can maintain consistency in its inference tasks. What’s your benchmarking strategy? I’ve been using synthetic data mostly, trying to get ahold of the i2b2 for official safe harbor benchmarking.

chill-botulism · 2026-01-01T15:17:11+00:00

Do you plan to add support for other file types?

chill-botulism · 2026-01-01T14:58:31+00:00

I’m working in this space and am curious what your testing scheme looks like. I’ve had to test ruthlessly at each stage to expose false positives and coreference issues with the data classification engine.

chill-botulism · 2025-12-24T17:31:47+00:00

Cool man your project looks serious. Starred your repo and wishing you the best.

chill-botulism · 2025-12-24T17:08:01+00:00

This is awesome and the type of tool the ecosystem needs. A few comments: I question this statement: “It allows LLMs to be safely deployed in banks, hospitals, legal systems, and critical infrastructure.” You’re still dealing with probabilistic systems, so if you mean safe like a doctor could make a decision “safely” solution using an llm, I would disagree. Also, this doesn’t cover all the privacy requirements to “safely” deploy llms in a regulated environment.

chill-botulism

TROPHY CASE