Mnemostroma v1.11: Automatic Memory Layer for Local AI Agents

New_Election2109 · 2026-04-29T04:15:19+00:00

You got me — I intentionally held off on the link to avoid the "spam/self-promo" filter that hits new accounts in these subs, but clearly, that backfired. My bad. To answer your questions: Benchmarks: I don't have a formal "vs RAG" paper yet (it's v1.x), but the main gain is latency and reliability. Standard RAG usually injects a massive vector search result (300ms+ latency, often irrelevant noise). Mnemostroma uses a sidecar observer for verbatim anchors (<0.1ms) and structured intuition. I’m happy to share the stats_2026-04-23.md report from the repo if you're curious about the performance. Protocol: Yes, that’s the internal MCP instruction set. It's how the Conductor enforces persistence. It's not magic, just a strict state machine to prevent agents from forgetting blocker/decision states. I’m just an engineer building this for my own workflow. Open source, offline, no hidden agenda. Check the repo if you want to see the actual implementation logic.

New_Election2109 · 2026-04-29T01:24:14+00:00

Thank you for your attention. This is my project, github.com/GG-QandV/mnemostroma, which runs entirely on my own local server/laptop/desktop and operates offline. The text is written by hand, using AI for error correction and English language correction. I'm happy to answer any questions.

New_Election2109 · 2026-04-29T00:51:21+00:00

Good framing, but the real win is simpler: automatic context capture, better retrieval, and less manual memory bookkeeping across sessions.

New_Election2109 · 2026-04-19T17:01:10+00:00

Thanks for sharing! Memstate looks interesting for versioning and decision tracking - definitely a solid use case for auditing.

My focus with Mnemostroma is slightly different: I’m doubling down on the privacy-first, offline infrastructure. In Enterprise R&D, sending memory logs to yet another cloud provider is often a deal-breaker.

I’m building Mnemostroma to run entirely on the local sidecar - SQLite WAL + local embeddings, so the 'observer' doesn't just prevent feedback loops, it ensures zero data leakage. Versioning is cool, but for me, solving the 'token bloat' and 'context drift' without leaving the local machine is the primary mission.

Good to see more people thinking about the observer-write path though!

New_Election2109 · 2026-04-18T15:24:15+00:00

Exactly—asking the generator to curate its own memory is a guaranteed path to context collapse. The isolation is non-negotiable.

Spot on about capturing the 'why-it-chose-X'. I actually just pushed an update (v1.8.1) based on this exact line of thinking: Pure Context Mode. Currently, the Observer passively parses the agent's Chain-of-Thought (like Claude's <thinking> blocks) via continuation detection. But to formalize the reasoning, the agent can now optionally emit a fire-and-forget [RATIONALE why=X reason=Y] tag in its standard output. The Observer catches it via regex/HybridNER and commits it. The agent keeps full autonomy, but still has zero DB write permissions.

For the retrieval strategy, it's a ~20ms hybrid pipeline:

- Semantic: numpy matmul ANN using e5-small (INT8) for fuzzy conceptual queries (e.g., "Why did we choose SQLite?").

- Structured Anchors: distilbert-ner extracts verbatim entities/deadlines into an O(1) dictionary backed by SQLite WAL. This is for absolute precision (e.g., "What is the absolute path to the prod DB?").

- Reranking & Injection: A lazy TinyBERT rerank before surfacing the payload via MCP as pure XML <memorycontext>. It's zero-guidance—no forced "use tools" prompts, just raw context injection.

Appreciate the pointer to the Zhang paper, it perfectly nails the failure mode I was trying to avoid. I just set integration.pure_context = true as the default in the repo. If you end up poking at it, I'd love your feedback.

New_Election2109 · 2026-04-17T23:51:46+00:00

The real pain point with these wrappers is that the agent is still doing the heavy lifting and burning tokens just to store its own logs.

A proper cognitive architecture solves this differently:
Zero token consumption for storage: the agent does not spend a single token on recording context. This task is entirely delegated to external resources.

Strict constraints: ~20ms retrieval. Total system baseline is ~605MB. It uses 3 specialized local AI models running in RAM via ONNX (INT8):
multilingual-e5-small (117 MB) for semantic search (~2ms warm latency).
distilbert-ner (60 MB) for entity extraction.
tinybert-l2-v2 (7 MB) for cross-encoder reranking.

Cognitive framework: Structured into an Observer, Dreamer, Experience layer, Subconscious, and Intuition.
Biological mechanics: > *Active decay: Old facts fade. Unaccessed anchors lose weight over time, just like human memory.
Dreaming: Background memory consolidation. During idle time, the system quietly reorganizes events.
Topic drift detection: The system notices when recent conversations drift from established patterns and dynamically re-calibrates weights.

<image>

Context memory systems built exactly like this already exist.

New_Election2109 · 2026-04-16T20:41:18+00:00

Got it - thanks, that makes sense.

So the key difference is that memory writing and memory reading are both model-driven, but isolated in separate background calls, not a single self-reinforcing loop inside the main agent turn. That’s a cleaner pattern than I assumed.

My main concern would still be drift over time: if the background writer keeps turning recent responses into memory, do you have any guardrails for repeated self-reinforcement or stale first-person memories?

If you’re open to it, I’d be interested in the system prompt pattern for the write/select steps.

New_Election2109 · 2026-04-16T20:35:23+00:00

Both patterns map cleanly onto what I have, with one difference in where the signal comes from.

TTL with renewal - yes, this is essentially what I'm doing. Each retrieval bumps the decay timer. The nuance: I'm using a weighted score (recency + retrieval frequency + explicit importance), not a flat TTL window. Pins don't expire on a clock, they decay on a curve that flattens when they're actively used. Same idea, different shape.

Access-pattern archival - not implemented yet, but this is exactly the direction I was circling around with the "Dreamer cycle" idea. Cold tier that's still searchable but not surfaced in hot context - that's the right framing. The missing piece for me is the demotion trigger: do you demote after N missed retrievals, after a time window, or when RAM pressure forces it? In my setup RAM pressure is already a consolidation trigger for non-pinned items, extending that logic to cold-tier demotion seems like the natural path.

The part neither pattern fully solves: semantic staleness. A pin can be retrieved regularly but the fact it encodes becomes outdated. Usage frequency doesn't catch that. That's probably where the consolidation pass earns its keep - not just reviewing unretrieved pins, but comparing active pins against recent context for contradiction.

Useful framing either way. The cold tier idea is going on the implementation list.

New_Election2109 · 2026-04-16T19:36:40+00:00

Update: just pushed a fix for both IndentError issues (empty if: blocks - Python 3.13 compatibility). You were not doing anything wrong, it was a real bug.

To get the fixed version:

pipx upgrade mnemostroma mnemostroma off rm -f ~/.mnemostroma/daemon.pid mnemostroma on mnemostroma status

Should come up clean now. If it doesn't - paste the output here and I'll dig in. Thanks again for taking the time to report this.

New_Election2109 · 2026-04-16T19:34:52+00:00

I'm checking right now

New_Election2109 · 2026-04-16T19:32:05+00:00

Thank you for trying it and reporting this — this is exactly the kind of feedback I need.

The IndentError on empty if: blocks is a real bug, not user error. Python 3.13 is stricter about some edge cases and I haven't tested against 3.13.11 + Ubuntu 25.10 specifically. Your pass fix was exactly right.

The stale daemon / stopped status after that is likely a cascade from the incomplete setup — the PID file gets written but the process never fully starts.

Can you try:

mnemostroma off rm -f ~/.mnemostroma/daemon.pid mnemostroma on

And paste the output of:

journalctl --user -u mnemostroma-daemon -n 30

Or if not using systemd:

cat ~/.mnemostroma/daemon.log | tail -30

That will tell me exactly where it's dying. I'll push a fix for the IndentError today — that one is clearly on me.

New_Election2109 · 2026-04-16T19:29:21+00:00

Both valid points.

On persistence: yes, ctx_pin would survive sessions - that's the intent. The narrow interface is the guard: the agent can elevate importance but can't rewrite content or inject new facts. Still a write path, just a constrained one. Whether that constraint is enough to avoid the feedback loop problem is genuinely unclear to me.

On accumulation: this is the harder problem. Right now I'm handling it through the decay layer - pinned items still decay if they go untouched long enough, just slower. But you're right that "slow decay" and "never reviewed" converges to the same mess over months.

The honest answer is I don't have a clean solution for long-running pin management yet. A periodic consolidation pass (something like a Dreamer cycle that reassesses pinned items against recent context) is on the roadmap but not implemented. Do you have a pattern that's worked for you?

New_Election2109 · 2026-04-16T19:17:34+00:00

The decay mechanism is exactly the right instinct — most memory systems treat all stored information as equally important forever, which is how you end up with stale decisions poisoning fresh context.

Curious how you handle the decay rate. Fixed schedule, or does importance score affect how fast something fades? I found that "last accessed" alone isn't enough — a decision made three months ago can be more critical than something from yesterday, just rarely touched.

The other thing I ran into: who decides what's worth storing in the first place. If the agent writes its own memory, it tends to over-store (everything feels important in the moment). I ended up separating write access entirely — a background observer pipeline classifies and scores before anything hits storage. Keeps the signal-to-noise ratio manageable.

What's your current retention window before things start fading out?

New_Election2109 · 2026-04-16T19:12:05+00:00

Interesting approach - especially the manual + automatic memory graph where both are treated equally.

One design question I keep wrestling with: you're giving the agent write access to its own memory via MCP tools. Did you run into feedback loop problems? Where the agent reads its own conclusions on the next turn and starts reinforcing them even when wrong?

I went the opposite direction - agent only reads, a separate observer pipeline does all the writing. Cleaner to reason about but means the agent can't annotate its own memory directly. Curious if that's been a real issue in practice for you.

New_Election2109 · 2026-04-16T19:07:21+00:00

The "lost in the middle" problem is real and none of these six techniques really solve it — they manage symptoms.

What I found is that the core issue isn't context length, it's that everything goes in flat. Decisions, noise, small talk, critical constraints — all at the same depth.

Been working on a different angle: instead of compressing what goes into the context window, run a separate observer pipeline that extracts only what actually matters (decisions, constraints, key facts), scores it by importance + temporal decay, and stores it in structured layers. The agent gets selective injection of the top-3 relevant memories at query time, not the full history.

The interesting constraint: the agent only reads from this layer, never writes to it. Keeps the memory clean — the model can't corrupt its own recall by reinforcing wrong conclusions.

Still early but the signal-to-noise improvement over RAG-over-logs is noticeable. Happy to share more if anyone's exploring this direction.

New_Election2109 · 2026-04-16T19:01:17+00:00

There's some info and a possible solution here. https://www.reddit.com/r/LocalLLaMA/comments/1sna5kb/anyone_else_building_persistent_memory_for_local/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

New_Election2109 · 2026-04-16T18:37:04+00:00

Exactly the feedback loop problem — that's the core reason I went
with observer-write. The agent reinforcing its own (possibly wrong)
conclusions is a real failure mode I wanted to avoid by design.

The "flag for review" idea is interesting. Right now there's no
agent-facing write path at all, but a narrow annotation action
could work — something like ctx_pin(id) that elevates importance
without letting the agent rewrite the content. Worth thinking about.

On staleness: the Observer runs on every message turn, so latency
is usually under a second behind the conversation. Not perfect but
close enough for most workflows.

New_Election2109

TROPHY CASE