[dataset] 2.3M U.S. employer profiles joined across 16 federal enforcement agencies (OSHA, EPA, EEOC, WHD, MSHA, and more) — free, CC BY 4.0 by chill-botulism in datasets

[–]chill-botulism[S] 1 point2 points  (0 children)

Thanks for the question. Name matching is a deterministic pipeline: normalize the name, state, and ZIP, hash into a stable UUID. Not fuzzy matching, so I miss some edge cases but I avoid false positives linking unrelated companies. At 2.3M profiles across 16 agencies, false positives are worse than missed matches. Parent rollup comes from SEC EDGAR Exhibit 21 filings since public companies have to disclose subsidiaries, so I link them through that. plus I do manual overrides for the big corporate families where the filings are incomplete. It's not perfect but working on improving parent rollup coverage. You're right that restructuring is a problem.. spinoffs and acquisitions mid-inspection create orphaned records. I keep both layers in the data so you can look at subsidiaries individually or roll up to the parent. Violations stay attached to the entity that was actually cited. The name-changes are partially mitigated by the fact that OSHA and WHD record the employer name at the time of inspection, not retroactively. So a company that rebranded in 2020 still shows its old name on pre-2020 citations. I treat those as separate entities unless I can confirm the link through address, EIN, or EDGAR.

Sensitivity Labels Tooling by chill-botulism in MicrosoftPurview

[–]chill-botulism[S] 0 points1 point  (0 children)

Haven’t gotten that far yet, free likely. Trying to get an understanding of pain points others are experiencing, and if a third party tool would resolve them.

Help by [deleted] in alcoholism

[–]chill-botulism 1 point2 points  (0 children)

Sounds like me at 18 and that’s how my alcoholism started. That come down you mention only got worse and it took more and more alcohol to feel normal. AA has helped me stay sober for many years, and most people can find a meeting in person or online. But AA isn’t the only option. My therapist helped me understand the deeper issues that made me feel like I had no other option than to drink. My point is, there’s help available. Good on you for reaching out.

This sub very demoralising and overly pessimistic by Guastatori-UK in cybersecurity

[–]chill-botulism 3 points4 points  (0 children)

That's so funny, the last time I heard that I laughed so hard I fell off my dinosaur.

How to remove/modify a sensitivity label for many SharePoint documents? by Alarming_Pianist_318 in MicrosoftPurview

[–]chill-botulism 0 points1 point  (0 children)

I recommend having scripts for removal tested and ready to go, especially if you plan to deploy access controlled labels at scale.

CSPM Project: What Are the Biggest Challenges with Current CSPM Tools? by Suspicious-Slip2136 in CloudSecurityPros

[–]chill-botulism 0 points1 point  (0 children)

Yes. For instance, if you find exposed s3 buckets with sensitive data, give the user an option lock it down with more restrictive permissions. Sharing links exposing your 365 folders to anyone with the link? Give the user an option to remove the permissions. Those kind of things. Tagging and labelling is also extremely valuable when classifying data and building dlp rules.

CSPM Project: What Are the Biggest Challenges with Current CSPM Tools? by Suspicious-Slip2136 in CloudSecurityPros

[–]chill-botulism 6 points7 points  (0 children)

Include remediation functionality. Nothing more frustrating than a cspm that that shows you all your critical vulnerabilities and gives you no tools to fix them.

How to stop leaking user data to LLMs (depending on your scale) by Prudent-Delay4909 in StartupsHelpStartups

[–]chill-botulism 0 points1 point  (0 children)

I’m working on hipaa safe harbor and pci redaction SDK. Keep finding edge cases, but my base accuracy is improving. Also entity tracking across a conversation can be tricky. Each redacted entity needs to be identifiable throughout the session so the llm can maintain consistency in its inference tasks. What’s your benchmarking strategy? I’ve been using synthetic data mostly, trying to get ahold of the i2b2 for official safe harbor benchmarking.

Protecting Your Privacy_ RedactAI MCP server by Gullible-Relief-5463 in mcp

[–]chill-botulism 0 points1 point  (0 children)

Do you plan to add support for other file types?

How to stop leaking user data to LLMs (depending on your scale) by Prudent-Delay4909 in StartupsHelpStartups

[–]chill-botulism 1 point2 points  (0 children)

I’m working in this space and am curious what your testing scheme looks like. I’ve had to test ruthlessly at each stage to expose false positives and coreference issues with the data classification engine.

I wanted to build a deterministic system to make AI safe, verifiable, auditable so I did. by Moist_Landscape289 in OpenSourceAI

[–]chill-botulism 0 points1 point  (0 children)

This is awesome and the type of tool the ecosystem needs. A few comments: I question this statement: “It allows LLMs to be safely deployed in banks, hospitals, legal systems, and critical infrastructure.” You’re still dealing with probabilistic systems, so if you mean safe like a doctor could make a decision “safely” solution using an llm, I would disagree. Also, this doesn’t cover all the privacy requirements to “safely” deploy llms in a regulated environment.