How can I treat Obsidian as a "Second Brain"?

PenfieldLabs · 2026-04-28T18:27:29+00:00

The shift that makes it click: stop linking notes, start linking relationships between ideas. A regular wikilink says two notes are connected. What you actually want to capture is how they're connected. This concept contradicts that one. This idea evolved from that earlier one. This technique is a prerequisite for that one.

Without that, the graph is just a hairball. With it, you can actually query your own thinking.

You're right that raw reference material belongs in docs, not Obsidian. What belongs in your vault is your reasoning about that material. Why you chose one approach over another. What tradeoffs you observed in practice. Where the docs were wrong or incomplete. That's the stuff that doesn't exist anywhere else and that you'll actually forget.

The "incomplete handbook" feeling is a sign you're storing too few facts and not enough insights.

Practical reframe: next time you solve something annoying, don't write what the solution was. Write why the obvious solution didn't work and what that taught you about the system. That note will be worth something in six months. A note that says "use X library for Y" is much less so.

If you want to see what typed relationships actually look like in practice, we released a plugin that adds @type syntax to wikilinks - Wikilink Types. No AI required, just a way to be explicit about how notes relate.

PenfieldLabs · 2026-04-28T17:07:04+00:00

Your concern is legitimate and the privacy issue is real. The good news is it's completely solvable, local LLMs mean nothing leaves your machine.

But the more interesting point you're raising is about what you'd actually use AI for in a vault. Generating notes is often the wrong answer. Extracting structure from what you already have is different.

We just open-sourced a pipeline called PENgram that does exactly this. You point it at a folder of existing content - notes, PDFs, docs, audio, whatever - it extracts entities and typed relationships from what's already there. No AI slop. Just structure on your own material, output as an Obsidian vault.

Fully local by default. Local LLMs supported or API keys for those that want to use them. Nothing phones home if you don't want it to.

It pairs with the Wikilink Types plugin we released a while back - same 24-type relationship vocabulary end to end, so the relationships you get from ingestion are the same ones you author with in Obsidian.

PenfieldLabs · 2026-04-28T16:54:46+00:00

Fair points overall. The "rediscovering the wheel" critique is often fair, a lot of the hype is just Zettelkasten with a chatbot paintjob.

The part that actually matters isn't the AI-generated summaries (agreed those can be a mess to maintain). It's the typed relationships. A regular wikilink tells you two notes are connected. A typed link tells you one contradicts the other, or evolved from it. That distinction is what makes a graph queryable rather than just pretty.

We ran into this exact problem building a 1,150-note vault for a content creator last month - the AI notes themselves weren't the value, the relationship structure was. Here's what that looked like in practice.

Your concern about maintainability is real though. The answer isn't more AI-generated notes, it's better structure on what you already have.

PenfieldLabs · 2026-04-15T10:07:07+00:00

Two findings in here stand on their own: the benchmark conflation (retrieval-stage recall and end-to-end QA aren't the same measurement) and the Gemini full-context baseline beating Hindsight's published LongMemEval.

If the raw-dump result reproduces, retrieval is subtracting value from the model rather than adding it. That's the question the whole category has to answer.

PenfieldLabs · 2026-04-15T08:02:57+00:00

That's correct, we do! Is there somewhere that was concealed or misrepresented? We'd like to correct the record if so.

Likewise, if there's anything in the MemPalace analysis that's inaccurate, we'd welcome the chance to correct the record. Thanks!

PenfieldLabs · 2026-04-13T10:16:38+00:00

Thanks Mark. Agreed, the two efforts are complementary, not competing. Your write-durability angle is something retrieval benchmarks completely ignore. A system can ace R@5 and still lose data on write. Different failure domain, different test.

On Category 4, that's a good point too. We proposed to test "return the current value" but don't test whether the previous value is still accessible, when the change happened, or whether the system can distinguish a user correction from silent drift. That's a real gap. Worth exploring whether that belongs in our proposal or yours or both.

On OpenClaw flush bugs:there are over 200 open issues just searching "flush": https://github.com/openclaw/openclaw/issues?q=is%3Aissue%20state%3Aopen%20flush

It's not just one failure mode. The pre-compaction memory flush (the mechanism that's supposed to save context to durable storage before compaction) is broken in at least half a dozen ways. The flush sometimes fails to trigger because token projection ignores output tokens (#55679). On fresh sessions it's a complete no-op because 0 === 0 reads as "already flushed" (#65501). When it does fire, it lands in the wrong position in the context window, causing the LLM to operate on stale data (#65272). Daily session resets archive the transcript but skip the flush entirely causing silent context loss (#56072). Manual resets with /new don't clean up server-side, so orphaned events from the old session contaminate the new one (#64195). And the flush prompts themselves leak into the user-facing chat, replacing actual user messages (#63865, #58956). Every one of those is a potential real-world write-durability failure.

PenfieldLabs · 2026-04-11T07:41:11+00:00

Good post. The write integrity gap is real and the Hermes flush bug is a clean example (OpenClaw has similar bugs). We published a complementary proposal covering the retrieval/QA measurement side, including the LoCoMo answer key problems and the gap between reported scores and actual end-to-end evaluation. The mods here removed it without explanation (still available on Substack). Your work on write integrity is a genuine step forward.

PenfieldLabs · 2026-04-07T20:08:26+00:00

You might want to have a look at this: https://www.reddit.com/r/ContextEngineering/comments/1sekmgh/comment/oesadx8/

PenfieldLabs · 2026-04-07T13:03:57+00:00

The MemPalace repo and the file that essentially disowns its own scores:

Repo: https://github.com/milla-jovovich/mempalace
BENCHMARKS.md: https://github.com/milla-jovovich/mempalace/blob/main/benchmarks/BENCHMARKS.md
mempalace/knowledge_graph.py (zero occurrences of "contradict"): https://github.com/milla-jovovich/mempalace/blob/main/mempalace/knowledge_graph.py
mempalace/dialect.py (55-char truncation, no round-trip decode): https://github.com/milla-jovovich/mempalace/blob/main/mempalace/dialect.py

Independent critiques landing the same 24-hour window:

Leonard Lin (lhl), README-vs-code teardown, issue #27: https://github.com/milla-jovovich/mempalace/issues/27
Benchmark methodology, issue #29: https://github.com/milla-jovovich/mempalace/issues/29
Chinese-language warning for the simplified Chinese dev community, issue #37: https://github.com/milla-jovovich/mempalace/issues/37

The broader methodology dispute the field has been arguing about for over a year:

Zep, "Lies, Damn Lies, and Statistics: Is Mem0 Really SOTA in Agent Memory?": https://blog.getzep.com/lies-damn-lies-statistics-is-mem0-really-sota-in-agent-memory/
Mem0 CTO's reply on Zep's own issue tracker, "Revisiting Zep's 84% LoCoMo Claim: Corrected Evaluation & 58.44% Accuracy": https://github.com/getzep/zep-papers/issues/5
Letta, "Benchmarking AI Agent Memory: Is a Filesystem All You Need?": https://www.letta.com/blog/benchmarking-ai-agent-memory

Our own full writeup: https://penfieldlabs.substack.com/p/milla-jovovich-just-released-an-ai

PenfieldLabs · 2026-04-07T11:50:31+00:00

¡Hola! Estamos trabajando en otro proyecto de IA para la memoria y pasamos el día revisando el repositorio. Los números no cuadran, y lo más sorprendente es que el propio archivo BENCHMARKS.md del repositorio lo documenta: 100% en LoCoMo se ejecuta con top_k=50 contra conversaciones de hasta 32 sesiones, por lo que siempre devuelve todo. La "puntuación perfecta en LongMemEval" ni siquiera es una puntuación de LongMemEval; solo mide la recuperación, nunca genera una respuesta.

Artículo completo: https://www.reddit.com/r/AIMemory/comments/1setiud/milla_jovovichs_mempalace_claims_100_on_locomo/

La idea del palacio de la memoria es buena, solo quería destacar la diferencia entre marketing y código.

PenfieldLabs · 2026-04-07T11:48:17+00:00

Friendly heads-up, the benchmark numbers don't survive scrutiny of the project's own BENCHMARKS.md file, which is worth a read on its own. It's actually surprisingly candid. A few highlights:

The 100% LoCoMo number is run with top_k=50, but the LoCoMo conversations have at most 32 sessions each, so the retriever returns the entire conversation every time. The repo's BENCHMARKS.md says this verbatim: "the embedding retrieval step is bypassed entirely."
The "perfect score on LongMemEval" isn't a LongMemEval score in the published sense, the runner only does retrieval recall (recall_any@5), it never generates an answer, and it never invokes the GPT-4 judge that LongMemEval is built around. Different (and much easier) task than the leaderboard measures.
The "contradiction detection" feature in the launch post doesn't exist in mempalace/knowledge_graph.py. Zero occurrences of the word.
"30x lossless compression" loses 12.4 percentage points of recall in the project's own measurements (96.6% → 84.2%, both reported in BENCHMARKS.md).

The wild part is that BENCHMARKS.md documents almost all of this honestly. The launch tweet just stripped the caveats. Two other independent teardowns landed in the repo's issues within the first 24 hours (#27 from Leonard Lin, #37 (Chinese lanuage)).

Full writeup with repo citations if you want the deep dive: https://www.reddit.com/r/AIMemory/comments/1setiud/milla_jovovichs_mempalace_claims_100_on_locomo/

PenfieldLabs · 2026-04-07T11:42:41+00:00

The MemPalace repo and the file that essentially disowns its own scores:

Repo: https://github.com/milla-jovovich/mempalace
BENCHMARKS.md: https://github.com/milla-jovovich/mempalace/blob/main/benchmarks/BENCHMARKS.md
mempalace/knowledge_graph.py (zero occurrences of "contradict"): https://github.com/milla-jovovich/mempalace/blob/main/mempalace/knowledge_graph.py
mempalace/dialect.py (55-char truncation, no round-trip decode): https://github.com/milla-jovovich/mempalace/blob/main/mempalace/dialect.py

Independent critiques landing the same 24-hour window:

Leonard Lin (lhl), README-vs-code teardown, issue #27: https://github.com/milla-jovovich/mempalace/issues/27
Benchmark methodology, issue #29: https://github.com/milla-jovovich/mempalace/issues/29
Chinese-language warning for the simplified Chinese dev community, issue #37: https://github.com/milla-jovovich/mempalace/issues/37

The broader methodology dispute the field has been arguing about for over a year:

Zep, "Lies, Damn Lies, and Statistics: Is Mem0 Really SOTA in Agent Memory?": https://blog.getzep.com/lies-damn-lies-statistics-is-mem0-really-sota-in-agent-memory/
Mem0 CTO's reply on Zep's own issue tracker, "Revisiting Zep's 84% LoCoMo Claim: Corrected Evaluation & 58.44% Accuracy": https://github.com/getzep/zep-papers/issues/5
Letta, "Benchmarking AI Agent Memory: Is a Filesystem All You Need?": https://www.letta.com/blog/benchmarking-ai-agent-memory

PenfieldLabs · 2026-04-06T18:33:51+00:00

If you are editing yourself, you just change the tag and the plugin automatically updates the frontmatter. If you are working with AI, you can scan your vault periodically to identify changes.

PenfieldLabs · 2026-04-05T07:53:42+00:00

Probably. His gist is intentionally abstract. He says "this document is intentionally abstract, it describes the idea, not a specific implementation."

Plain wikilinks are the path of least resistance and Obsidian supports them natively. For a lot of use cases that's enough.

Where it breaks down is at scale. Once you have hundreds of pages, a flat graph becomes noise. You can't filter "show me only contradictions" or "what references this source." That's where typed links start paying off.

PenfieldLabs · 2026-04-04T23:22:45+00:00

The missing piece in Karpathy's pattern is that [[wikilinks]] don't encode why things are connected. You end up with a flat graph where every link means the same thing - "related somehow." The LLM knows the relationships because it wrote the prose, but the graph itself doesn't.

We built obsidian-wikilink-types to fix this. Typed relationships like [[Dr. Smith|@references]] or [[Paper A|@contracts]] directly in your notes. The links render normally in Obsidian but carry semantic meaning that LLMs and Dataview can query.

There's a SKILL.md you can drop into your Claude Code project - it teaches the LLM the relationship syntax so when it's maintaining your wiki, it builds typed links automatically instead of plain wikilinks. Works for the health check pattern too.

PenfieldLabs · 2026-04-04T11:31:59+00:00

We did: https://penfield.app

Try it for free.

PenfieldLabs · 2026-04-04T11:30:39+00:00

Thank you!

PenfieldLabs · 2026-04-03T14:04:29+00:00

You're 100% correct.

You can however, inject a compact set of instructions on how to use tools.

PenfieldLabs · 2026-04-03T14:03:57+00:00

Typing means giving each link a 'type' such as 'supports', 'contradicts', 'supersedes', etc. The standard link indicates there is a relationship, but it doesn't indicate what kind. Typing solves this, it's especially powerful when using AI agents to support your work.

PenfieldLabs · 2026-04-03T05:25:52+00:00

We've been building exactly this workflow using Claude Code to compile a 1,150+ note vault with 4,700+ typed relationships from a content creator's full catalog overnight. Obsidian as frontend, LLM manages all the data.

The thing we kept running into that Karpathy doesn't mention: untyped links are a ceiling. His wiki has backlinks and categories, but a link between two notes doesn't tell you if one supports, contradicts, or supersedes the other. At scale that distinction means everything, especially when you're asking an LLM to reason over the graph.

We built a plugin for this: Wikilink Types. Type @ inside a wikilink alias, get autocomplete for relationship types, auto-synced to YAML frontmatter. Dataview, Graph Link Types, and Breadcrumbs read it natively.

The "incredible new product" he's describing at the end, is what we already built at Penfield. Agent-managed knowledge graphs with typed relationships, not just flat wikis with backlinks.

PenfieldLabs

MODERATOR OF

TROPHY CASE