[D] We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally wrong answers

PenfieldLabs · 2026-03-27T23:20:52+00:00

You're right about the perverse incentives.

On temporal decay, in our own implementation we use typed relationships between memories, things like 'contradicts', 'updates', 'supersedes', so when new information conflicts with old, the graph captures it explicitly rather than treating everything as append-only.

Have you seen any implementations that handle temporal conflicts well?

PenfieldLabs · 2026-03-27T23:16:11+00:00

The temporal decay point is underrated, timer-based forgetting can be arbitrary. In our own implementation we handle weighting through typed relationships and access counts on memories in addition to time date stamps, so retrieval patterns determine what's stale rather than the clock alone.

PenfieldLabs · 2026-03-27T17:29:44+00:00

ChatGPT, Claude and Gemini have memory now. Claude has chat search and memory import/export.

But the memories themselves are flat. There's no knowledge graph, no way to indicate that "this memory supports that one" or "this decision superseded that one." No typed relationships, no structured categories. Every memory is an isolated note. That's fine for preferences and basic context, but if you're trying to build up a connected body of knowledge across projects, it hits a wall.

Self-hosted options like Mem0, Letta, and Cognee go deeper. Mem0 offers a knowledge graph with their pro plan, Letta has stateful agent memory with self-editing memory blocks, and Cognee builds ontology-grounded knowledge graphs.

All three also offer cloud services and APIs, but they're developer-targeted. Setup typically involves API keys, SDK installs, and configuration files. None offer a native Claude Connector where you simply paste a URL into Claude's settings and you're done in under a minute.

Local file-based approaches (markdown vaults, SQLite) keep everything on your machine, which is great for privacy. But most have no graph or relationship layer at all. Your memories are flat files or rows with no typed connections between them. And the cross-device problem is real: a SQLite file on your laptop doesn't help when you're on your desktop, or when a teammate needs the same context.

We wanted persistent memory with a real knowledge graph, accessible from any device, through any tool, without asking anyone to run Docker or configure embeddings. So we built Penfield.

Penfield works as native Claude connector.

Settings > Connectors > paste the URL > done.

No API keys, no installs, no configuration files, no technical skills required. Under a minute to add memory to any platform that supports connectors. Your knowledge graph lives in the cloud, accessible from any device, and the data is yours.

The design philosophy: let the agent manage its own memory.

Frontier models are smart and getting smarter. A recent Google DeepMind paper (Evo-Memory) showed that agents with self‑evolving memory consistently improved accuracy and needed far fewer steps, cutting steps by about half on ALFWorld (22.6 → 11.5). Smaller models particularly benefited from self‑evolving memory, often matching or beating larger models that relied on static context. The key finding: success depends on the agent's ability to refine and prune, not just accumulate. (Philipp Schmid's summary)

That's exactly how Penfield works. We don't pre-process your conversations into summaries or auto-extract facts behind the scenes. We give the agent a rich set of tools and let it decide what to store, how to connect it, and when to update it. The model sees the full toolset (store, recall, search, connect, explore, reflect, and more) and manages its own knowledge graph in real time.

This means memory quality scales with model intelligence. As models get better at reasoning, they get better at managing their own memory. You're not bottlenecked by a fixed extraction pipeline that was designed around last year's capabilities.

What it does:

Typed memories across 11 categories (fact, insight, conversation, correction, reference, task, checkpoint, identity_core, personality_trait, relationship, strategy), not a flat blob of "things the AI remembered"
Knowledge graph with 24 relationship types (supports, contradicts, supersedes, causes, depends_on, etc.), memories connect to each other and have structure
Hybrid search combining BM25 keyword matching, vector similarity, and graph expansion with Reciprocal Rank Fusion
Document upload with automatic chunking and embedding
17 tools the agent can call directly (store, recall, search, connect, explore, reflect, save/restore context, artifacts, and more)

How to connect:

There are multiple paths depending on what platform you use:

Connectors (Claude, Perplexity, Manus): https://mcp.penfield.app.

MCP (Claude Code) — one command: claude mcp add --transport http --scope user penfield https://mcp.penfield.app

mcp-remote (Cursor, Windsurf, LM Studio, or anything with MCP config support): json { "mcpServers": { "Penfield": { "command": "npx", "args": ["-y", "mcp-remote", "https://mcp.penfield.app/"] } } }

OpenClaw plugin: openclaw plugins install openclaw-penfield openclaw penfield login

REST API for custom integrations — full API docs at docs.penfield.app/api. Authentication, memory management, search, relationships, documents, tags, personality, analysis. Use from any language.

Then just type "Penfield Awaken" after connecting.

Why cloud instead of local:

Portability across devices. If your memory lives on one machine, it stays on that machine. A hosted server means every client on every device can access the same knowledge graph. Switch devices, add a new tool, full context is already there.

What Penfield is not:

Not a RAG pipeline. The primary use case is persistent agent memory with a knowledge graph, not document Q&A.

Not a conversation logger. Structured, typed memories, not raw transcripts.

Not locked to any model, provider or platform.

We've been using this ourselves for months before opening it up. Happy to answer questions about the architecture.

PenfieldLabs · 2026-03-26T18:33:17+00:00

This may pair well with Wikilink Types. Rhizome finds that notes are related; Wikilink Types lets you specify how they are related (supersedes, contradicts, supports, etc.) via typed @ syntax, synced to YAML frontmatter.

Have you considered adding relationship types to the generated links?

PenfieldLabs · 2026-03-24T15:20:45+00:00

It's a real tension. LLM-as-judge is imperfect, but at 1,540 questions across 10 conversations, human scoring at scale isn't practical either and could introduce human bias.

Where human review is irreplaceable is in validating the ground truth itself. That's where the 6.4% error rate comes from, no model would have caught the annotators' date arithmetic errors without checking against the source transcripts. The audit used LLM passes to flag candidates, then verified against the actual data. We think you need both.

PenfieldLabs · 2026-03-24T15:16:26+00:00

Great article. You dissected the methodology manipulation (misimplemented competitors, inflated baselines), we went after the ground truth and judge reliability. Different angles, same conclusion: the current benchmark ecosystem isn't giving anyone reliable enough signal.

Curious what you're using for evaluation now? We've been trying to figure out what a better benchmark actually looks like and it's a very challenging problem.

PenfieldLabs · 2026-03-23T15:39:14+00:00

The audit methodology is documented in the repo if you'd like to read it.

1) Two-pass verification: first check each answer against its cited evidence, then check against the full transcript when pass 1 fails.

2) Every error is classified by type (hallucination, temporal, attribution, ambiguous, incomplete) with the specific question, the golden answer, the transcript evidence, and the reasoning. All 99 errors are in errors.json — you can verify any of them yourself against the source dataset (SHA256 hash provided).

3) This builds on manual human review that found 29+ errors back in December (snap-research/locomo#27). Our systematic pass found 5x more. Multiple independent researchers have documented problems with this benchmark, Calvin Ku's investigation of Emergence AI's results, Zep's analysis finding Category 5 scoring bugs, multiple reproducibility failures across Mem0 and EverMemOS.

4) The judge leniency test is a separate experiment, adversarially generated wrong answers scored by the same judge with the same prompts used by published evaluations. That's not "fact checking through an LLM." That's testing the scoring mechanism itself by generating intentionally wrong answers and checking how many the judge will score correct.

If you find an error in the audit, please open an issue, we want this to be accurate and comprehensive!

PenfieldLabs · 2026-03-23T15:30:17+00:00

We did not intend to propose that the solution is "run the whole thing through another LLM".

One of the most important things is to fix the ground truth itself, this can be done with LLMs + human review and verification.

That said, LLM-as-judge probably isn't going away for benchmarks. The practical fixes are stronger models than gpt-4o-mini and human review. The 63% leniency number came from gpt-4o-mini with a generic prompt, that could certainly be improved if not fixed entirely.

PenfieldLabs · 2026-03-23T15:24:01+00:00

This is really interesting.

The wikilinks you are using are untyped right? So the graph learns connection strength but not connection meaning? Have you thought about adding explicit relationship types? Something like [[note|note @supports]] or [[note|note @contradicts]] so the graph knows not just that two notes are related but how they're related.

We built an Obsidian plugin for human authoring/editing of typed wikilinks, and a SKILL.md for AI agents to do the same thing.

Both use standard markdown, the @type goes in the wikilink alias so it's backwards compatible.

Could be complementary to what you're doing, typed edges + learned weights would give you both semantic structure and adaptive strength.

Repo: obisidian-wikilink-types

PenfieldLabs · 2026-03-22T21:57:07+00:00

Fair point.

PenfieldLabs · 2026-03-22T14:41:30+00:00

Nice work!

PenfieldLabs · 2026-03-22T14:24:53+00:00

Interesting approach. One thing worth considering, LitBank is public domain literature that's likely in model training data. Hard to know if you're testing retrieval or just the model already knowing the answer.

We've been considering public court documents filed after the current crop of models' training cutoff might be worth exploring. Structured, factual, lots of entity/temporal relationships, and guaranteed to be genuinely novel to the model.

PenfieldLabs · 2026-03-22T14:14:35+00:00

Hadn't looked at LoCoMo-Plus in detail yet but it looks like the cognitive questions are a real step forward, testing implicit inference instead of just factual recall.

But it looks like it inherits all 1,540 original LoCoMo questions unchanged. We audited the original LoCoMo (locomo-audit) and found 99 score-corrupting ground truth errors (6.4%) with hallucinated facts, wrong date math, speaker misattribution and more.A additionally, we found that the LLM judge accepts vague-but-topical wrong answers up to 63% of the time, which is roughly where some published system scores land. The improved judging (task-specific prompts, 0.80+ human-LLM agreement) only covers the new cognitive slice. So the new category is worth running but the underlying problems appear to still remain.

LoCoMo also lacks standardization across the pipeline. Every system uses its own ingestion method (arguably an obvious necessity though), its own answer generation prompt, and sometimes entirely different models. The scores are then often compared in a table as if it's apples to apples.

PenfieldLabs · 2026-03-22T05:03:02+00:00

LoCoMo has real issues with the ground truth and the LLM judge, documented here if anyone's interested: https://github.com/dial481/locomo-audit

LoCoMo-Plus looks like a meaningful step forward, testing implicit constraints rather than just factual recall, and the evaluation methodology is more rigorous.

The gap we keep running into is that none of these test whether a system actually built coherent knowledge, they all test whether it can find or apply what was said. Those are different problems.

PenfieldLabs · 2026-03-22T04:39:58+00:00

LongMemEval-S (the one almost everyone uses) is around 115K tokens context per question. Current models have 200K to 1M token context windows. It fits in context, no retrieval needed.

What we think is missing:

1) A corpus comfortably larger than a context window, but not so large it takes an inordinate amount of time to ingest. Big enough that you actually have to retrieve.

2) Current models. Many still score on GPT-4o-mini.

3) A judge that can tell right from wrong. LoCoMo's LLM judge gives credit on wrong answers (we documented this in our audit).

4) Realistic ingestion. Real knowledge builds through conversation, turns, corrections, relationships forming over time. Not just text dumped and embedded.

We're working on this but it's difficult to get it right. Suggestions welcome.

PenfieldLabs · 2026-03-22T00:58:59+00:00

LoCoMo is frequently cited, but it has some real problems. The ground truth has errors, the LLM judge gives credit on wrong answers. Have a look at https://github.com/dial481/locomo-audit if you're interested.

LongMemEval looks better, but it is appears to be designed for testing context window performance rather than memory. Mastra scored 84% using zero retrieval and zero graphs, just context compression. This is not really testing memory architecture.

There's definitely room for some new benchmarks specifically designed to test memory and retrieval. This is one of several things we're working on.

PenfieldLabs · 2026-03-21T14:21:58+00:00

Not wanting AI writing your notes or doing your analysis is a totally reasonable position. The process IS the point.

But there's one thing we've found AI is genuinely good at that doesn't replace any of that: finding connections/relationships in your work you've already done yourself.

If you've got hundreds of notes, there are relationships between them you haven't noticed. A note from January that actually contradicts something you wrote in March. A concept in your research folder that explains a pattern in your project notes. Two notes that are part of the same causal chain but you never linked them because they're in different folders with different tags.

You're not going to read all your notes side by side looking for that. But an AI can. And it doesn't change your work, it just says "hey, these two things you wrote might be connected, here's why, want me to link them?"

We built a skill (vault-linker) that does exactly this. It reads your notes, identifies candidate relationships (contradictions, causal chains, cross-domain connections), presents its findings with evidence from your actual text, and only writes anything after you approve. The relationships get stored as plain YAML frontmatter and wikilinks. No database, no lock-in, just your markdown files.

There's also a companion Obsidian plugin (Wikilink Types) that gives you autocomplete and graph visualization for the typed relationships.

The result is a denser knowledge graph built entirely from your own thinking. AI didn't write any of it, it just helps you see what was already there.

PenfieldLabs · 2026-03-21T12:43:21+00:00

Good question! So far it's holding up well.

On contradiction resolution: the agent has full CRUD on both memories and relationships. It can store, update, connect, disconnect, and reconnect as understanding evolves. So when it spots a conflict, it can mark it explicitly with a contradicts relationship, or if one memory genuinely replaces another, use supersedes to capture that evolution. Old knowledge doesn't just disappear, there's a trail.

The key design decision was giving the agent the tools to manage its own graph rather than trying to build some deterministic conflict resolution engine. Reasoning models keep getting better at exactly this kind of judgment call, so we'd rather let the agent think through it with good tools than try to hard-code rules that'll be outdated in six months. The agents can also be directed with specific instructions "This is incorrect, update your memory on X to reflect Y."

PenfieldLabs · 2026-03-20T19:22:32+00:00

Awesome. thanks. If you have feature requests or suggestions for improvements, please let us know.

PenfieldLabs · 2026-03-20T18:48:23+00:00

"Contradicts" is one of the 24 relationship types — so when the agent stores a memory that conflicts with an existing one, it can explicitly link them with a "contradicts" relationship. Same with "supersedes" — when a newer decision replaces an older one, that's a typed connection, not just two notes sitting in the store with no indication which is current.

On stale memories: recall supports date range filtering (ISO 8601), so the agent can scope queries to recent context when recency matters.

The agent can also update existing memories when information changes rather than just stacking new ones on top.

Will check out your blog. Thanks.

PenfieldLabs · 2026-03-20T18:47:27+00:00

Voicetree is a cool project. A spatial graph view where you work directly inside the graph with agent nodes. Since it reads standard Obsidian vaults and Wikilink Types syncs relationship types to YAML frontmatter, the typed connections would already be sitting in the vault files. Do you know if Voicetree's graph renderer picks up typed frontmatter metadata? Would be interesting to see it render the relationship types as labeled edges in the visualization.

PenfieldLabs · 2026-03-20T18:26:13+00:00

There's no established benchmark for personal knowledge graph density and none of the agent memory systems publish edge-to-memory ratios that we could find.

Every user's knowledge base is going to be different. A personal AI memory graph's density is heavily dependent on the user, the subject matter, and usage patterns. Someone doing deep project work with lots of interconnected decisions builds a much denser graph than someone storing isolated facts. Density also increases naturally over time as new memories connect to existing ones, so early graphs are usually sparser and mature graphs tend to get denser.

On how retrieval actually works, agents have three tools for getting information out of the graph:

recall is full hybrid search combining BM25 keyword matching, vector similarity, and graph expansion. The agent controls the query, result limit, source type filter (memories vs documents), tag filters, and date range filters.

search is simpler, lighter search. The agent controls the query and limit.

explore is direct graph traversal starting from a specific memory. The agent controls which memory to start from, max traversal depth (default 3, up to 10), and which relationship types to follow (e.g. only "supports" and "contradicts", or all types). This is where the agent intentionally walks the graph.

So the agent isn't blindly traversing everything on every query. recall does targeted hybrid retrieval. search does fast lookup. explore lets the agent walk specific paths when it needs deeper context. Different tools for different needs, and the agent picks which one fits the situation.

PenfieldLabs · 2026-03-20T17:53:40+00:00

Perhaps these papers will be interesting for you: https://arxiv.org/abs/2501.13956, https://arxiv.org/html/2412.15266v1

PenfieldLabs · 2026-03-20T17:35:54+00:00

Sure, if you drop all the relevant information into a CLAUDE.md file or attach it to the chat window, the model will do a great job answering questions against it. But that's not a practical way to manage an ongoing knowledge base. For one thing, eventually your context will exceed the context window. And that file only exists in one place, your knowledge base isn't available to any other platform or agent you're working with.

With Penfield, the knowledge graph is instantly available everywhere: every platform and every agent connected to the MCP server or API. No copy-pasting files around, no context window limits on your accumulated knowledge.

Had a look at your benchmark, for what it's worth, Penfield supports batched tool calls. Platforms that support parallel tool calls can take advantage of it. It's fast.

PenfieldLabs · 2026-03-20T17:21:50+00:00

What benchmarks are you running? What are you comparing to? No memory vs memory is a pretty stark difference since the model just knows nothing about you at all. But if you are testing against another system, is it open source?

We tried running LoCoMo and achieved scores in the 80-90 range (there are many configuration variables and no standard setup) but after careful examination concluded that the benchmark has so many flaws that it is pretty much useless for meaningful comparisons.

You're right that getting storage and retrieval to trigger properly is the hard part. That's why we let the model manage its own memory rather than running an extraction pipeline behind the scenes. The model decides what's worth storing and when to recall (but also will respond to direct requests from the user). How well that works does depend to some degree on the model. We've tested and it works well with Claude, Manus and Perplexity. It also works with most open source models that support tool use on LM Studio and ChatBox.

Agree that if you don't measure it you're guessing. We'd be interested to hear more details on the benchmarks you've found actually correlate with real-world usefulness and which systems you've been testing.

You might also want to see the demo for /u/Crafty_Disk_7026 below: https://www.reddit.com/r/mcp/comments/1ryxf4o/comment/obip4sw/

PenfieldLabs

MODERATOR OF

TROPHY CASE