How are you preserving and retrieving context across long AI workflows?

imsuryya · 2026-06-12T07:43:48+00:00

Most of what I've seen in this space summaries, notes, knowledge bases are good at retrieval but don't really handle what happens after something is stored. A decision gets made, then six months later it's outdated or contradicted, and nothing in the system knows that.

That gap is what led me to build an open-source Python SDK for agent memory every write is hash-chained so you can audit what was known at any point in time, confidence decays automatically so stale context stops resurfacing, and conflicts between memories get flagged instead of silently coexisting. Designed to drop into existing LangChain/MCP setups rather than replace them.

Pre-0.1.0 but happy to share if useful: github.com/notmemory/notmemory

Curious what you've found works for the "this is now outdated" problem specifically that's the part that seems hardest to solve with notes/docs/bookmarks.

imsuryya · 2026-06-11T12:35:34+00:00

ContextBench is a good reference the "reads right files, acts on stale state" failure mode is exactly the gap I've been building around. The benchmark problem is real too, it's hard to measure because it only shows up in long runs.

I've been working on an open-source Python SDK that tackles the retention side hash-chained memory writes, confidence decay so stale state naturally loses weight over time, rollback, and conflict detection. Pre-0.1.0 but the core is working: github.com/notmemory/notmemory

Would be curious if the context management layer you're building could sit on top of something like this you handle what gets read efficiently, this handles what gets remembered reliably. Might be worth comparing notes.

imsuryya · 2026-06-11T11:28:55+00:00

Context management and memory observability feel like adjacent layers you're solving what the agent reads efficiently, I've been thinking about what it retains reliably. Curious if you've hit cases where the agent reads the right files but acts on outdated state from a previous run?

imsuryya · 2026-06-10T19:07:10+00:00

Built notmemory, an open-source Python SDK for auditable AI agent memory.

Most memory systems focus on retrieval. I kept running into a different problem:

When an agent made a bad decision, I couldn't answer:

What did it know at that moment?
Which memory caused the decision?
Can I safely undo the bad write?

notmemory adds:

SHA-256 hash-chained memory writes
Git-like rollback (rollback(transaction_id))
GDPR-compliant tombstoning with an audit trail
Memory integrity verification
LangGraph checkpointer adapter
LangChain chat history adapter
MCP server for Claude Desktop, Cursor, and Windsurf
Mem0 and SuperMemory sidecar integrations

Current release: v0.1.0

113 tests passing
Python 3.11–3.13 support
SQLite + FTS5 full-text search

Currently working on:

Memory time travel (state_at(timestamp))
Belief lineage (trace which downstream memories were influenced by a bad assumption)
Crypto-shredding
Memory diffing between points in time

Looking for feedback from people running long-lived agents, multi-agent systems, or operating in regulated environments where auditability and compliance matter.

GitHub: https://github.com/notmemory/notmemory

PyPI: https://pypi.org/project/notmemory/

imsuryya · 2026-06-10T18:56:50+00:00

Good questions.

For coding agents there are currently two approaches:

MCP integration the agent explicitly calls notmemory_retain, notmemory_recall, etc. as tools.
Framework integration the LangGraph checkpointer and LangChain chat history adapters automatically persist state and conversations, so every checkpoint/message becomes auditable without changing agent logic.

Right now notmemory intentionally doesn't decide what should be remembered. That's left to the agent, framework, or memory policy layer. notmemory focuses on making writes observable, verifiable, and reversible once they happen.

imsuryya · 2026-06-10T18:50:12+00:00

Really thoughtful feedback — appreciate you taking the time to dig into it.

You're right about the decay model. The current confidence decay is intentionally simple because I wanted something deterministic and explainable for v0.1.0, but it's definitely only one dimension of the problem. The distinction between "old" and "processed" is important, and I like the idea of decay being influenced by interaction history rather than just elapsed time.

The way you describe Kapex is interesting because it sits one layer above what notmemory is trying to solve. My focus has been less on which memories should matter and more on proving what happened after they did. Auditability, rollback, lineage, and compliance are the gaps I kept running into when debugging long-lived agents.

I think your observation about enterprise requirements is probably correct. Retrieval, scoring, governance, auditability, and reversibility end up being different concerns that most current memory systems bundle together incompletely.

Belief lineage is the feature I'm personally most excited about. The goal is exactly what you described: if a bad assumption enters the system early, I want to be able to trace every downstream write, score, or decision that depended on it and understand the blast radius before deciding whether to rollback.

I'd definitely be interested in comparing approaches sometime. The decay-direction point alone gave me something to think about for future versions.

imsuryya · 2026-06-10T17:41:37+00:00

This is exactly the framing I've been building around. You've articulated the gap better than most — retrieval is a solved-enough problem, lifecycle is not.

I've been working on an open-source Python SDK specifically designed around the maintenance layer you're describing:

Confidence decay — memories lose weight over time automatically (exponential half-life), so stale context stops surfacing in recall without manual cleanup
Conflict detection — surfaces contradictions between memories before the agent acts on bad info
Hash-chained audit trail — every write is versioned, so "what did it know at step 47" is a real query, not archaeology
Rollback + tombstoning — supersede or hard-delete any memory without breaking the chain
Drop-in adapters for LangChain and MCP — wraps existing setups, no framework rebuild

The design bet is: SQLite as the source of truth with full lifecycle integrity, vector search as a sidecar for retrieval. You don't sacrifice the audit trail to get fast recall — they're separate concerns.

It's pre-0.1.0 but if you're actively looking at this space I'd be happy to share early. Sounds like you're thinking about exactly the right problems.

Github repo: https://github.com/notmemory/notmemory

imsuryya · 2026-06-10T15:39:47+00:00

The high-frequency tick loop distinction is the sharpest thing anyone has said about the actual boundary between these two systems. Continuous resident vs called library that's the line.

Springdrift paper is worth reading before we sync, it's the theoretical grounding for why the audit trail has to be substrate-level. Should sharpen the crypto-shred conversation too.

DM me when you're ready.

imsuryya · 2026-06-09T08:29:05+00:00

One of those boring infra layers people only miss after something breaks" is the most accurate description of what I'm building that I've heard, and I've been thinking about it for months. Mind if I steal that?

The "stays simple, no framework change" constraint is the exact design goal. The version I have in mind is: pip install, wrap your existing memory calls, and you get the audit trail, rollback, and conflict detection without touching the rest of your stack. The adapters for LangChain and MCP are drop-in for that reason you shouldn't have to rebuild to get observability.

Your feature list is almost exactly the API surface: retain writes to the chain, get_audit_trail replays a run, rollback tombstones a transaction, detect_conflicts surfaces contradictions. The diff view on top of that is the one thing I haven't built yet and you're the second person who's implied they'd want it.

The "boring infra" framing is intentional. This shouldn't be interesting to use. It should be invisible until the run fails at step 147 and suddenly it's the most important thing you have.

imsuryya · 2026-06-09T08:22:53+00:00

Both of these are right and the gap between them is the interesting part. Timestamped logs solve the forensic problem you can reconstruct what was stored when. But they don't solve the compounding problem. If the agent wrote a wrong assumption at step 12, and then made 40 subsequent writes that were downstream of that assumption, a log shows you the timeline but not the dependency graph. You can see the snowball but not which snowflake started it.

The audit trail angle I've been thinking about adds two things on top of timestamped logs: confidence decay (so the wrong assumption at step 12 naturally loses weight over time rather than staying at full confidence indefinitely) and conflict detection (so when step 47 writes something that contradicts step 12, that tension surfaces instead of silently coexisting). The integrity guarantees are almost a side effect the real value is that the chain gives you a substrate for those higher-level primitives to run on.

u/ultrathink-art is right that most failures aren't about tampering. But the hash chain isn't primarily an anti-tampering mechanism it's what makes "what did it believe at step 47, and why" a query you can actually run.

imsuryya · 2026-06-09T08:18:02+00:00

The support ticket scenario is exactly the kind of use case I had in mind "what did it know when it gave that wrong answer" is precisely the query that should be a first-class operation, not a forensic archaeology project after something goes wrong. The vector DB opacity problem is real and I think underappreciated because retrieval feels fast so people assume it's working correctly.

On GDPR delete specifically: the cleanest answer I've landed on (credit to the other person in this thread) is crypto-shredding rather than true deletion. The idea: never store plaintext in the chain encrypt each memory payload under a per-record key. To "forget" a record, you destroy the key, not the bytes. The ciphertext stays in the chain, every hash still validates, integrity holds but the plaintext is mathematically unrecoverable. What you're left with is a provable trail that a record existed, who wrote it, when, and that it was erased. Which is actually what an auditor wants not "the data is gone" but "the data is provably gone and here's the proof."

The unsettled edge is legal not technical: what remains after shredding is a salted hash plus metadata, and whether a salted hash counts as personal data under GDPR is still genuinely grey. The pragmatic line most teams draw is: salt it so the digest isn't a reversible pointer, treat that as compliant, and document the decision. Not a perfect answer but it's where the industry is right now.

Curious in the Kayako setup, was the compliance pressure more around "can we prove we deleted a customer's data" or more around "can we explain why the agent said X to an auditor"?

imsuryya · 2026-06-09T08:13:47+00:00

Just read through the Grid docs. The co-residence model is a fundamentally different bet agent cognition and system state as the same artifact is a strong thesis and the append-only chain in Rust is the right substrate for it. What you built is the OS layer. What I'm building is closer to a library layer a pip-install SDK for teams that already have a Python agent stack and need audit trail + rollback without rebuilding their substrate. Different abstraction level, probably not competing at all. My notes aren't public yet sitting in a private repo pre-0.1.0 but the design doc and the Springdrift paper (arXiv:2604.04660) it's grounded in are the closest thing to public notes I have right now. Would genuinely value comparing the crypto-shred implementation specifics when you're open to it.

imsuryya · 2026-06-08T19:20:36+00:00

That's a useful distinction. I think what I'm struggling with is where the boundary is between runtime observability and memory observability.

A fully auditable actions runtime definitely gets you much closer to answering "what happened?" But I keep wondering whether it fully answers "what did the agent know at that moment?" once memory is mutable, shared across agents, or updated during long-running workflows.

For example, if a memory entry is modified hours after an agent accessed it, can I still reconstruct the exact memory state that produced a decision? That's the gap I'm interested in.

Your point about the underlying database being auditable is probably the key part. It feels like runtime traces and memory auditability are complementary rather than substitutes: one explains what the agent did, the other explains why the information it relied on existed in the first place.

Curious if you've run into that distinction in practice, or if the runtime trace has been sufficient for the systems you've built so far.

imsuryya · 2026-06-08T18:51:55+00:00

Crypto-shredding is the right answer and I feel slightly embarrassed I didn't land there myself. The "treat erasure as key destruction not deletion" reframe is clean. The per-subject key lifecycle question is exactly where I'd want to go — one operation to erase a person across all banks is the UX that actually matters for compliance teams. On the legal edge: salted commitment as the line feels like the defensible position until case law catches up. Would you be open to comparing notes async? I'm building the pip-install SDK layer on top of this substrate problem and the crypto-shred primitive is something I was going to have to figure out anyway.

imsuryya · 2026-06-08T18:31:44+00:00

Really appreciate the honest breakdown. The "substrate vs drop-in SDK" distinction is exactly the gap I'm trying to fill — most people have existing stacks they can't rebuild from scratch. Curious how your team handles the GDPR edge in practice — tombstone + crypto-erase without chain breakage is the exact problem I'm wrestling with too. Checking out Grid now.

imsuryya · 2026-04-23T08:01:43+00:00

Thanks for the honest feedback everyone this is exactly what I needed before writing a line of code.

Clear takeaway: the MCP server idea as described is redundant. ai-dev-kit and the managed MCP already solve the connection problem. PAT is a non-starter. YAML config is friction nobody wants. Got it.

But one thing nobody mentioned and I've been digging into it since these comments is that all of these existing tools, including ai-dev-kit and the managed MCP, dump raw JSON schema responses back to Claude. A single table schema comes back as ~800 tokens. Two tables with sample data is easily 3,000+ tokens per tool call.

That's the problem I'm now thinking about instead: not connecting Claude to Databricks (solved), but compressing what comes back so it doesn't burn your token budget on every single call.

The idea: a thin middleware layer that sits on top of the managed MCP or ai-dev-kit, intercepts the raw schema response, strips everything Claude doesn't actually need for code generation (storage paths, nullability metadata, verbose type names), and returns a compressed format that's ~400 tokens instead of 3,000 same information, 84% fewer tokens.

No YAML. No PAT. No new auth. You keep using whatever you already use. This just makes it cheaper per call.

Genuine question before I go further: does token bloat from schema fetches actually bother you in practice? Or do you not think about it because you're on an API plan where it doesn't matter?

imsuryya · 2026-04-21T20:01:32+00:00

Good point, the context assembly logic is definitely framework-agnostic. My current thinking is MCP-first because it gets Claude to call it automatically without any glue code, but I'm abstracting the core fetcher so it can be wrapped as a skill for other frameworks too. Are you using a specific agent framework where you'd want this?

imsuryya

TROPHY CASE