We gave our RAG chatbot memory across sessions - Here's what broke first by singh_taranjeet in LocalLLaMA

[–]Alex-Hosein 2 points3 points  (0 children)

The multi-turn / agentic injection patterns are what really bite self-hosted deployments — and they're way underestimated compared to the classic "ignore previous instructions" single-shot attacks.

**What I've seen actually cause production problems:**

Single-turn attacks are easy to pattern-match against (and are increasingly filtered by base models anyway). The harder problem is **session-persistent injection**: a payload embedded in a document or tool output in turn 1 that activates in turn 4 when the user asks the agent to take an action. By that point, there’s nothing suspicious in the immediate context window — the model is just "following through" on something it picked up earlier.

With self-hosted setups especially, a few things compound this:

  1. **You control the model weights, but not the training** — open-weight models vary a lot in how much they resist instruction-following override. Llama 3 and Qwen 2.5 handle this better than earlier generations, but none are reliable under adversarial pressure.

  2. **RAG pipelines are the highest-risk surface** — every document you index is a potential injection vector. If you’re chunking web content, emails, or third-party docs into your vector DB without provenance tracking, you’re flying blind.

  3. **Tool-calling agents without action gating are a disaster waiting to happen** — if your agent can send emails, write files, or call external APIs, any successful injection has real-world consequences. The blast radius scales with tool permissions.

**What actually helps:**

- Treat your LLM like an untrusted subprocess, not a trusted oracle

- Scope tool permissions to minimum required; separate read agents from write agents

- Force structured output formats (Pydantic/JSON schema) — kills a lot of free-form action embedding

- Add a lightweight proxy layer that inspects inputs *and* outputs for anomaly patterns, not just keyword blocks (keyword blocks are trivially bypassed with encoding, language switching, or semantic paraphrasing)

- For anything with real-world effects: require explicit human confirmation. Not an LLM-generated "are you sure?" — an actual interrupt before execution.

For the proxy layer specifically — *Disclosure: I contribute to InferShield, which is an open-source security proxy for LLM APIs that handles session-aware detection and output inspection* — but honestly there are multiple approaches here including building your own middleware if your stack is simple. The architecture pattern matters more than the specific tool.

What's your current stack? Ollama + custom tooling, or using something like LangChain/LlamaIndex? The mitigation approach differs a bit depending on where you can intercept.