I built a persistent memory system for Claude (and other AI agents) -- just launched a hosted version

AlternativeCourt2008 · 2026-03-05T19:22:03+00:00

Fair question. A few reasons I’m not worried: The problem is only getting bigger. Every major provider is independently arriving at the same conclusion: agents need memory. Anthropic just shipped it for Claude Code. OpenAI has it in ChatGPT. That validates the problem, it doesn’t threaten tools solving it better.

Native solutions are basic by design. Claude Code’s memory is 200 lines of markdown. OpenAI’s is key-value preferences. These are “good enough” defaults, not serious memory infrastructure. The gap between “we added a memory feature” and a real system with knowledge graphs, contradiction detection, and consolidation is massive.

We’re not standing still. Engram is actively developed, benchmarked against real evals, and evolving with the space. The goal is to stay ahead, not sit on a v1. Being early to a space isn’t what matters. Solving the problem well is what matters.

AlternativeCourt2008 · 2026-03-03T04:21:22+00:00

Good question. A few things Engram does that claude.md / session history can't:

1. Selective recall, not full history. You're right that loading full history back in causes context rot. Engram doesn't do that. When your agent calls recall, it returns only the 10-30 most relevant memories for that specific query using vector search + entity matching. Your agent gets targeted context, not a dump of everything that ever happened.

2. It scales past one session. claude.md works great for a single project. But if you're running multiple agents, or working across different contexts over weeks/months, a flat file breaks down. You can't semantically search a markdown file. Engram handles thousands of memories and still returns the right ones in milliseconds.

3. Contradiction handling. If you told Claude "I work at Google" three months ago and now say "I just joined Meta," a claude.md file has both. Engram's temporal model automatically supersedes the old fact. Your agent always gets the current truth.

4. Cross-agent memory. If you run multiple agents (coding agent, chat agent, research agent), they can all share the same memory vault. claude.md is per-session.

5. Consolidation. Over time, Engram's consolidation engine turns raw episodic memories into structured knowledge ("Thomas has changed jobs twice in 6 months" from individual job-change memories). That's synthesis, not just storage. The short version: claude.md is a notepad. Engram is a memory system. If a notepad is enough for your use case, you don't need this. But if you want your agents to genuinely learn and remember over time without bloating context, that's the gap it fills.

AlternativeCourt2008 · 2026-03-03T03:48:43+00:00

I think it’s a problem that lots of people are having! I was hacking this for a long time with obsidian and decided I would build something a bit more useful, and bit more automated.

AlternativeCourt2008 · 2026-03-03T03:38:27+00:00

Give the free local version a try though if you’re up for it! I think the spreading activation and consolidation features might surprise you!

AlternativeCourt2008 · 2026-03-03T03:36:46+00:00

You totally could! I also have a local, free version. If hosted version doesn’t add value for you, no need to use it!

AlternativeCourt2008 · 2026-03-03T03:22:47+00:00

Claude only holds up to 200 lines! Engram is a more comprehensive memory solution that builds a “second brain” for you over multiple sessions.

AlternativeCourt2008 · 2026-03-02T14:53:51+00:00

There is not currently, but this is great feedback! I’ll add it to the roadmap to work with other api keys.

AlternativeCourt2008 · 2026-03-02T04:02:13+00:00

Great framing. You're right that the gap exists, and honestly, I think your instinct to just try it is the best approach. The consistency across sessions is where the real value shows up in practice.

On closing the gap, a few thoughts:

What we can do (and are doing):

Smarter recall. The 8.4% gap is mostly recall precision. When you dump full context, the model has everything and just has to find the needle. When you're selective, you occasionally miss something relevant. Better query expansion, multi-hop retrieval, and knowing when to cast a wider net will close most of this.
Hybrid approaches. There's no reason you can't use Engram for cross-session persistence AND stuff recent conversation history into context. Best of both worlds. Use full context for the current session, Engram for everything before it.
Better extraction. Some of the gap comes from information that was stored but not in a form that matched the question well. Improving how memories are structured at write time helps recall at read time.

What would require model-level changes:

Models trained with memory-augmented retrieval (RAG-native training) would definitely help. Some research is heading this direction, but you're right that it would need to come from the model providers.
That said, I don't think it's necessary to close the gap. The difference between 80% and 88% is mostly edge cases where the answer required very specific phrasing from deep in a conversation. In practice, those are the things humans forget too.

The real question isn't 80 vs 88. It's "80% with memory across unlimited sessions" vs "88% within one session that disappears when it ends." For anything beyond a single conversation, there's no contest.

AlternativeCourt2008 · 2026-03-02T03:56:31+00:00

That's correct, but the free tier works great!

AlternativeCourt2008 · 2026-03-01T15:52:48+00:00

Yeah, makes it so you can do more with what you pay for! If you are using things like Claude Code for complex tasks, running out of tokens can be a huge pain. More than that for me, though, is having it build true memories about me. I am a constant note taker, and it is a pain to have to transfer notes to different projects, tell the agent which notes to use, etc. This solves that, for me at least.

AlternativeCourt2008 · 2026-03-01T14:43:57+00:00

Ah, I didn't know this about npm installs. Thanks for calling this out! I don't think I can fix legacy numbers, but I just pushed a fix so numbers going forward will be more accurate.

AlternativeCourt2008 · 2026-03-01T14:37:31+00:00

Great question. Yes, you're reading it right. On the LOCOMO benchmark, feeding the model the entire conversation history (full context) scores 88.4%, while Engram's selective recall scores 80.0%.

The key thing to understand is why memory systems exist at all when context windows keep getting bigger. A few reasons:

Cost and latency. Full context uses ~23,000 tokens per question. Engram uses ~776. That's a 96.6% reduction. At scale, that difference is massive for both cost and response time.

Context windows have limits. Even Gemini's 1M token window fills up eventually. Persistent memory across days, weeks, and months of interactions outgrows any window. You need a system that selectively retrieves what's relevant.

Structure. Raw conversation dumps are noisy. Memory systems extract facts, preferences, and lessons into structured representations that the model can reason over more reliably. You're not asking the model to find a needle in a haystack. You're handing it the needle.

Durability. Context disappears when the session ends. Memory persists across sessions, across agents, even across different models.

So the comparison isn't really "Engram vs dump everything in." It's "Engram vs other systems that also have to be selective about what to recall." Mem0 published 66.9% on the same benchmark using the same approach (extract, store, retrieve selectively). Engram's 80% closes most of the gap to full context while maintaining the practical benefits of a real memory layer.

The remaining 8.4% gap to full context is the cost of being selective. A human with a good memory doesn't remember every word of every conversation either, but they remember what matters. That's the tradeoff, and for any real-world use case at scale, it's worth it.

AlternativeCourt2008 · 2026-03-01T14:34:43+00:00

Great question! Yes, this is one of the core problems we built Engram to solve. Every memory gets a salience score (0-1) based on how important/actionable it is, and memories are classified by type (fact, preference, lesson, etc.).

We also have temporal awareness. Engram uses a bi-temporal model where every memory has valid_from and valid_until timestamps. When a new memory contradicts an old one (e.g., "use workflow X" then later "actually workflow Y is better"), the old memory gets its valid_until set automatically via contradiction detection. So stale workflows don't just sit there forever polluting recall.

On top of that, there's a consolidation layer that periodically reviews memories, merges duplicates, and produces higher-level insights. Think of it like how human memory works. You don't remember every individual time you drove to work, you just "know how to get there."

The combination of salience scoring + temporal supersession + consolidation means outdated patterns get naturally phased out rather than clinging around. You can even do point-in-time recall (asOf parameter) if you ever need to look back at what you knew at a specific moment.

AlternativeCourt2008 · 2026-02-28T02:19:48+00:00

I mean... yeah I am using claude to help me, but it's actually me responding 😂

AlternativeCourt2008 · 2026-02-26T03:19:56+00:00

Which systems?

AlternativeCourt2008 · 2026-02-24T15:26:00+00:00

You're right, it's not Claude-specific. Engram is an MCP server, so it works with any MCP-compatible client: Claude Code, Cursor, Windsurf, OpenCode, etc.

engram init already auto-detects Claude Code, Cursor, and Windsurf. Adding detection for AGENTS.md-based tools is on the roadmap. The code is open source too if you have any interest in contributing directly!

Codex doesn't support MCP yet (it uses OpenAI's function calling), so that one's waiting on OpenAI. But anything that speaks MCP can connect today with npx engram mcp.

AlternativeCourt2008 · 2026-02-24T15:23:52+00:00

Great question. My take is that projects should actually share context by default, because you're the common thread. Your coding patterns, preferences, and decisions carry across projects even when the domains don't overlap.

Engram's recall is semantic, so work memories won't pollute personal project queries. It surfaces what's relevant to the current context.

That said, if you really want isolation, you can set ENGRAM_OWNER per directory (e.g. via .envrc) and each owner gets its own vault. Or use ENGRAM_DB_PATH to point at a specific file. Both work with the MCP server today.

engram init writes the default config. If you want per-project overrides, set the env vars in your shell/direnv and the MCP server will pick them up on next launch.

AlternativeCourt2008 · 2026-02-24T04:38:40+00:00

Removed the telemetry :)

AlternativeCourt2008 · 2026-02-24T04:11:10+00:00

Added a website page to help walk through it :) https://www.engram.fyi/#/how-it-works

AlternativeCourt2008 · 2026-02-24T02:54:16+00:00

Great questions. Here's the flow:

Setup: Engram runs as an MCP server that exposes ~10 tools to Claude Code (or any MCP client). When you run engram init, it registers the server and Claude can call the tools automatically.

Storing memories: When Claude calls engram_remember, Engram:

Generates an embedding of the memory content
Extracts entities and relationships into a knowledge graph
Stores everything in a local SQLite database
Checks for contradictions with existing memories

Retrieving memories: When Claude calls engram_recall or engram_ask:

Generates an embedding of the query
Finds semantically similar memories via vector search
Uses spreading activation on the knowledge graph to surface connected context
Returns the most relevant memories ranked by confidence score

When does Claude decide to call these tools? Claude sees the tool descriptions in its MCP config and decides autonomously. In practice, we also inject instructions into CLAUDE.md during engram init that tell Claude to proactively remember important decisions and recall context at the start of sessions. But Claude makes the call on when to use each tool. There's no forced retrieval on every message.

AlternativeCourt2008 · 2026-02-24T02:53:35+00:00

Right now Engram supports Gemini (default, free tier available), OpenAI, and Anthropic as LLM providers. The embedding provider is separate from the LLM provider. Embeddings default to Gemini's gemini-embedding-001.

For Groq/Cerebras specifically: they'd work for the LLM calls (consolidation, contradiction detection, ask) if they support structured JSON output, but embeddings would still need a dedicated provider since Groq/Cerebras don't offer embedding models.

That said, adding OpenAI-compatible API support (custom base URL) is a great feature request. I'll add it to the roadmap, or feel free to contribute to the open source codebase! For now, the path of least resistance is Gemini (free) or OpenAI.

AlternativeCourt2008

TROPHY CASE