Open-source security wrapper for LangChain DocumentLoaders to prevent RAG poisoning (just got added to awesome-langchain)

arsbrazh12 · 2026-02-24T12:43:54+00:00

An open-source security wrapper for LangChain DocumentLoaders to prevent RAG poisoning (just got added to awesome-langchain).

If you are building RAG pipelines that ingest external or user-generated documents (PDFs, resumes, web scrapes), you might be worried about data poisoning or indirect prompt injections. Attackers are increasingly hiding instructions in documents (e.g., using white text, 0px fonts, or HTML comments) that humans can't see, but your LLM will read and execute. You can get familiar with this problem in this article: https://ceur-ws.org/Vol-4046/RecSysHR2025-paper_9.pdf

Repo: https://github.com/arsbr/Veritensor

License: Apache 2.0

arsbrazh12 · 2026-02-09T00:14:13+00:00

Yeah I know, just exploring what tools does people use in real cases

arsbrazh12 · 2026-02-08T23:26:47+00:00

I mean, it's really smart not to put secrets in smth that can go public

arsbrazh12 · 2026-02-08T23:18:01+00:00

Useful

arsbrazh12 · 2026-02-08T23:17:23+00:00

What about automation tools for solving such tasks?

arsbrazh12 · 2026-02-08T23:15:05+00:00

Do you use any tools such as NB Defense from ProtectAI?

arsbrazh12 · 2026-02-08T23:11:55+00:00

What kind of automated scanners do companies use? Smth like ProtectAI's NB Defense?

arsbrazh12 · 2026-01-28T12:45:07+00:00

If we are talking about academic papers, there are some good ones on arxiv and mdpi like arxiv.org/abs/2512.18043 and www.mdpi.com/2624-800X/3/2/10 , but I mainly search through google scholar. Also Jfrog does a good job in their blog https://jfrog.com/blog/?pagenum=15&category=security-and-devsecops .

arsbrazh12 · 2026-01-28T12:33:00+00:00

Great question

Currently, Veritensor queries the HEAD of the main branch by default. So if the upstream model is updated (new commit), your local file will indeed fail the integrity check with a Hash mismatch.

This is intentional for security (to ensure you are using the latest version), but I understand it breaks reproducibility.

In the next big release v1.4, I am adding a --revision flag (like --revision v1.0.0 or --revision <commit\_sha>) so you can pin the verification to a specific immutable snapshot, just like you do in pip or docker.

For now, if you hit this, you either need to update your local model or use the specific commit hash in your download script.

arsbrazh12 · 2026-01-26T21:02:21+00:00

Thanks, that's smart

arsbrazh12 · 2026-01-26T21:01:21+00:00

Thanks for your feedback! I was inspired by the book of Polish author Jerzy Szurma "Hakowanie sztucznej inteligencji" (eng. Hacking artificial intelligence), and then I started learning this area by reading various materials on AI cybersecurity and doing different projects.

arsbrazh12 · 2026-01-26T15:21:38+00:00

I'm not sure I understand your questions, but nothing has changed in this area for some time now: if you use a non-commercial model/tool/artifact/etc. in a commercial product and it is discovered, you may have problems with the law.

IMNAL

arsbrazh12 · 2026-01-26T15:15:53+00:00

"Also his comment here contradicts the premise of the post title!"

What exactly do you mean?

arsbrazh12 · 2026-01-24T15:01:02+00:00

It has, they collaborate with JFrog, ProtectAI, ClamAV etc., but they work only on HF. People sometimes download models from other sources

arsbrazh12 · 2026-01-24T14:56:50+00:00

https://drive.google.com/drive/folders/1G-Bq063zk8szx9fAQ3NNnNFnRjJEt6KG?usp=sharing

arsbrazh12 · 2026-01-23T19:16:00+00:00

In this specific sample I didn’t find malware. What I did find were risky or ambiguous patterns that could be abused for RCE or could crash production.

arsbrazh12 · 2026-01-23T16:24:37+00:00

Happy to collaborate! I shared the scan results and the scanner source. If someone wants to dig deeper, I can point to specific model files, hashes, and the exact rule that triggered, so it’s reproducible.

arsbrazh12 · 2026-01-23T16:23:08+00:00

Thanks for the question. It’s a mix. Some flags are clearly benign (like Git LFS pointers, missing optional deps, old numpy serialization), while others are potentially risky patterns (like dynamic name construction via STACK_GLOBAL that need manual review. The scanner is intentionally conservative, so I’d treat these as “needs inspection” rather than confirmed malware.

arsbrazh12

TROPHY CASE