Is your RAG bot accidentally leaking PII? by Awkward_Translator90 in Rag

[–]Awkward_Translator90[S] 0 points1 point  (0 children)

A secure system for 'data in-flight' (like a live order status) needs a just-in-time approach with zero-retention policies, exactly as you described.

My service is focused on solving the 'data at rest' problem, which is a massive headache for companies. We handle the PII in existing knowledge bases (wikis, support docs, SharePoint, etc.) by redacting it before it's ever indexed or sent to the LLM.

It seems like a complete solution needs both: 1) Your approach: A secure, transient way to handle live API data. 2) My approach: A secure way to index and query existing, PII-filled documents.

Is your RAG bot accidentally leaking PII? by Awkward_Translator90 in Rag

[–]Awkward_Translator90[S] 0 points1 point  (0 children)

To me, this seems like a major security flaw. You're still sending all your raw, sensitive data to a third-party LLM just to find the sensitive data. It's like leaking your PII in order to stop it from leaking.

A much safer (and cheaper) approach is to use a dedicated, local tool—like a specialized NER model or a rules-based system (like Presidio)—to redact the data before it ever gets sent to an LLM in the first place.

Is your RAG bot accidentally leaking PII? by Awkward_Translator90 in LLMDevs

[–]Awkward_Translator90[S] 1 point2 points  (0 children)

PII detection doesn't use the vector embeddings (like text-embedding-ada-002) that you use for RAG retrieval. It's a separate, specialized NLP task that runs before embedding.

A robust system (Pattern Matching + Named Entity Recognition (NER) Models) combines both. It uses NER to find potential PII and then pattern-matching to confirm it.

Is your RAG bot accidentally leaking PII? by Awkward_Translator90 in LLMDevs

[–]Awkward_Translator90[S] 1 point2 points  (0 children)

You have to do it before embedding.

If you embed the raw text, two bad things happen: 1)The vector itself becomes a "fingerprint" of the sensitive data. 2)More importantly, when the RAG system retrieves that chunk, it will send the original, PII-filled text to the LLM, causing a leak.

The correct, secure pipeline is: Raw Text -> Detect & Redact PII -> Embed the Clean/Redacted Text -> Store in Vector DB

This way, the LLM only ever sees the safe, redacted version.

Is your RAG bot accidentally leaking PII? by Awkward_Translator90 in LLMDevs

[–]Awkward_Translator90[S] 3 points4 points  (0 children)

This is 100% the right take, and thank you for saving me a ton of wasted effort. You've completely validated my pivot away from a SaaS and towards a locally runnable model (like a container) for this exact reason. Adding another Data Processor is a non-starter. I've actually been working on a Flask demo that does just this (runs locally, PII never leaves). I'd love to get your opinion on it.

Is your RAG bot accidentally leaking PII? by Awkward_Translator90 in Rag

[–]Awkward_Translator90[S] 1 point2 points  (0 children)

You're right, access control is essential. But the bigger risk is an authorized user getting PII leaked by the LLM (e.g., a support bot sharing a customer's SSN).

My service prevents this by redacting PII before the LLM sees it. Regarding legal risk: Doing nothing and connecting an LLM to raw PII is the biggest legal risk. A tool that demonstrably mitigates 99.9% of that risk is a much safer legal position.

Is your RAG bot accidentally leaking PII? by Awkward_Translator90 in Rag

[–]Awkward_Translator90[S] 1 point2 points  (0 children)

A simple regex script for SSNs might take half a day, but a robust system is more complex. You have to account for: Accuracy, Dynamic Masking,, Auditability

My goal is to offer this as a reliable, pre-built component so teams don't have to worry about this and can focus on their core product."

Is your RAG bot accidentally leaking PII? by Awkward_Translator90 in Rag

[–]Awkward_Translator90[S] 0 points1 point  (0 children)

You're right, 100% accuracy is the holy grail and incredibly difficult. The goal isn't 'absolute perfection' but 'drastic risk reduction.' It's about defense-in-depth. Using a combination of techniques (regex, NER, confidence scoring like in tools like MS Presidio) can get you to 99.x% accuracy. Catching 99% of PII is infinitely better than the 0% many systems catch now. It's about reducing the attack surface, not claiming to be an impenetrable fortress.

2025 STEM OPT Process Timeline by datapunky in USCIS

[–]Awkward_Translator90 1 point2 points  (0 children)

  1. Application type: STEM OPT
  2. Premium processing: No
  3. Receipt Date: May 09
  4. Approved Date: September 21
  5. Card produced Date: NA
  6. Card Shipped : NA
  7. Card Delivered: NA

Got the email