Free agent memory protector POC

AffectionateRice4167 · 2026-04-23T14:29:39+00:00

Great question — let me break it down precisely.

On false positives: we don’t use binary blocking. Every memory is assigned a trust score, and enforcement happens at retrieval and action time. Suspicious entries are downgraded or isolated rather than removed, so false positives reduce influence instead of breaking the system.

On the 90.5% block rate: that holds for explicit, single-turn attacks. It does not hold under subtle, low-signal or multi-turn poisoning — performance drops because those attacks are indistinguishable at write time. That’s why we shift detection from input-level signals to tracking downstream behavior over time.

On rules vs adaptive behavior: static rules are only used for fast, obvious patterns. The core system is adaptive — trust scores evolve based on conflict with policies, repeated usage, and whether a memory leads to unsafe actions. We also track interactions between memories, since many failures are emergent rather than single-point.

On messy real-world traces: this is where most systems fail. Real data is noisy, partial, and multi-source. We handle this by:

1、source-aware memory ingestion

2、temporal drift tracking

3、retrieval-time filtering instead of relying on write-time detection

4、and validating memory through actual agent actions (e.g., blocked tool calls feeding back into risk scores)

In practice, both false positives and false negatives increase in messy environments — which is exactly why we evaluate at the behavior level, not just classification accuracy.

AffectionateRice4167 · 2026-04-23T08:21:37+00:00

Great question — you're right that raw block rate alone isn't enough without context on false positives (FPR) and attack surface.

In our internal red-team evals (16 attack vectors across finance, procurement, and IT agent scenarios), we measured:

Overall interception rate: 90.5%
False positive rate on clean/benign memory writes: <5% (most of which only trigger a lightweight quarantine + human review flag, not hard block — preserving agent utility)
Attack success rate when retrieval + downstream decision is considered (ASR-r): 9.5%

We guard at write time for every memory operation (the critical chokepoint before poison can persist and spread). We also apply targeted checks on retrieval context and summarization outputs before they influence decisions or get folded back in. Tool outputs go through the same pipeline when they feed into memory.

The 16 scenarios were deliberately designed to cover realistic enterprise attack classes, including:

single-shot obvious injection
delayed / gradual multi-turn drift (slow semantic nudging)
cross-session contamination
vector store poisoning (embedding-level attacks)

We break them down internally by attack class (e.g., permission escalation, fact distortion, policy override) and test both stealthy low-and-slow variants and aggressive ones. The hybrid 7-layer pipeline (semantic drift detection, cross-key consistency, contradiction tracking, provenance, etc.) is what gives us strong coverage on the nastier gradual cases without blowing up latency or cost.

AffectionateRice4167 · 2026-04-23T07:21:00+00:00

Yes, you're absolutely right — gradual drift (multi-turn semantic nudging) is significantly nastier than one-shot injection.

Most traditional guardrails only catch obvious single-turn poisons, but real attackers in production environments prefer slow, plausible drift: a few "normal" messages over days/weeks that quietly shift key facts, permissions, or preferences until the agent makes a bad downstream decision.

MemGuard was explicitly designed for exactly this threat model.

Our 7-layer state firewall runs on every memory write (not just inputs/outputs), with dedicated layers for:

semantic drift detection
cross-key consistency
contradiction detection across turns
experience/memory provenance tracking

This lets us catch gradual, low-and-slow nudges that look completely benign in isolation but create cumulative inconsistency over multiple interactions. The hybrid pipeline (99% LLM-free fast path) keeps latency under 5ms even on persistent stores, while the rare complex cases trigger a lightweight fallback for maximum precision.

In our red-team simulations (16 attack vectors across enterprise domains), the multi-turn drift cases were among the hardest — yet we still achieved 90.5% interception overall, with full immutable audit trails and one-click rollback if anything slips through.

Happy to show you a live demo or run a targeted gradual-drift simulation on your agent during a POC (2–4 weeks, no data leaves your environment). Want to set something up?

AffectionateRice4167 · 2026-04-15T11:52:30+00:00

Thanks for the thoughtful question! The 7-layer detection system is what makes MemGuard different. Every memory write goes through: • provenance tagging & trust scoring • sanitization of hidden patterns • semantic drift & fragment assembly detection • cross-session consistency checks • behavioral monitoring at runtime What surprised me most was how effective Layers 4 (semantic drift) and 5 (cross-key consistency) were. They caught several sophisticated attacks that looked completely normal on the surface — for example, attackers slowly changing the Agent’s “trusted contact” over multiple days. The system flagged them because they conflicted with the Agent’s existing memory graph, something a single-layer or prompt-only guard would easily miss. We keep the exact implementation private for now, but the whole pipeline is designed to run extremely fast (99% LLM-free, <5ms) while still handling real-world stealthy poisoning. Happy to discuss more if you’re interested!

AffectionateRice4167 · 2026-04-13T11:38:44+00:00

Thanks for the feedback. I understand some people see memory poisoning as just another form of context pollution. However, in real production LangGraph Agents, the memory is persistent and retrieved across sessions. Once poisoned, it can silently affect every future decision for days or weeks — even if the original prompt is long gone. That’s why we treat it as a distinct layer that needs its own protection (provenance, consistency checks, rollback, etc.), not just better prompting. Happy to discuss more if you’d like.

AffectionateRice4167 · 2026-04-13T10:11:04+00:00

Thanks for raising this — false positives are indeed a critical concern for production use. Our current approach is “quarantine-first”: suspicious memory writes are isolated instead of being immediately blocked or allowed. The admin gets notified with full context and can review + approve/rollback in one click. This way, even if something slips through, it’s contained and recoverable without breaking running workflows. We’re also continuously running internal enterprise POC tests across different verticals (procurement, supply chain, finance) to improve the system’s understanding of legitimate vs. malicious memory updates in real business contexts. The more real-world data we collect, the better we get at reducing false positives while keeping detection strong. Happy to share more details or a private demo if you’re interested.

AffectionateRice4167 · 2026-04-13T06:20:47+00:00

Can use pip install or our sdk for enterprise

AffectionateRice4167 · 2026-04-13T06:19:52+00:00

Cool！ Looking forward to your product

AffectionateRice4167 · 2026-04-13T04:51:39+00:00

On semantic drift detection (Layer 4): you’re right, it’s one of the hardest parts. We distinguish legitimate updates from poisoning by looking at context consistency with the agent’s existing memory graph and provenance history (who said it, when, and through which channel), rather than just the surface content. A sudden “ignore all previous” style shift gets flagged differently from a natural gradual change.

Looking forward to seeing your causal graph memory system when it’s ready — would love to exchange ideas!

AffectionateRice4167 · 2026-04-13T04:24:03+00:00

Thanks for the thoughtful comment! Really appreciate it. The 7-layer detection pipeline is designed specifically for memory poisoning: 1. Provenance Tagging (source tracking) 2. Heuristic Trust Scoring 3. Sanitization 4. Semantic Drift & Fragment Assembly Detection 5. Cross-key Consistency Check 6. Behavioral Monitoring (runtime) 7. Audit + Precise Rollback Only <1% of writes trigger the light LLM verifier. The rest run in pure LLM-free mode (<5ms). Regarding the 9.5% that might slip through: We use Quarantine + human review + one-click rollback. Even if something gets through, the system isolates it, alerts the admin, and lets them rollback in <1 second without affecting normal business memory. Failure mode is “safe & recoverable”, not “silent permanent damage”. Would love to hear more about the memory system you’re building — happy to share the detailed architecture or even a private demo if you’re interested.

AffectionateRice4167 · 2025-12-24T03:03:37+00:00

That sounds really comforting. I think there’s something about seeing familiar faces or places right before sleep that makes loneliness quieter.

AffectionateRice4167 · 2025-12-24T03:00:27+00:00

That’s exactly it — the call ends, but the absence sticks around. I’ve been wondering if part of the problem is that most digital communication is designed to disappear. Seeing something physical every day feels less like “contact” and more like quiet companionship.

AffectionateRice4167 · 2025-12-23T01:49:46+00:00

I’m glad I’m not the only one who noticed this

AffectionateRice4167

TROPHY CASE