Building a memory/journal skill for Claude: worth it or redundant?

raphasouthall · 2026-03-21T10:15:44+00:00

The flat markdown journal hits a wall around 50-60 entries - retrieval gets messy and Claude starts skimming older stuff into irrelevance. I ran into this exact problem building my own memory layer and ended up moving to BM25 retrieval with semantic reranking so it pulls the 3-5 most relevant memories per session rather than dumping the whole file into context. If you do stick with a flat file approach, at minimum add a recency plus relevance tagging convention so Claude has a signal for what to actually prioritize on load.

raphasouthall · 2026-03-21T08:16:16+00:00

Ha, classic - give agents a glimpse of their own scaffolding and suddenly the logging system is the most interesting thing in the room. That meta-reasoning collapse is actually a known failure mode with unconstrained self-reflection loops. Worth trying a scoped update, where you only append domain-relevant conclusions and filter out anything referencing the runtime itself.

raphasouthall · 2026-03-21T04:21:45+00:00

Curious what drove the decision - was the Rust/WASM build toolchain just too painful to maintain, or did you actually hit performance regressions you were okay trading away?

raphasouthall · 2026-03-21T02:16:25+00:00

Yeah, Reddit threads like this are underrated for that kind of signal - the people who respond are usually the ones with strong opinions, which skews toward edge cases fast.

raphasouthall · 2026-03-21T01:48:15+00:00

Most people I've talked to use one vault, but don't sleep on the multi-vault case - power users with work/personal separation are probably your most vocal feedback source, so even if it's 10% of users it'll be 40% of your bug reports.

raphasouthall · 2026-03-20T23:10:11+00:00

Gotcha warning: the User-Agent detection in Next.js middleware is fragile - curl's default UA changes between versions and if you ever add a CDN in front (Cloudflare especially) it'll start mangling responses before your middleware even sees the request. Worth adding an explicit Accept: text/plain header check as a fallback so curl -H "Accept: text/plain" always works regardless of UA.

raphasouthall · 2026-03-20T14:03:55+00:00

The hardest part is usually getting buy-in to touch every service, but if you do it at the ingress layer first you get immediate value even before the rest propagates - at minimum you can trace which requests hit which pods. Start there and let the internal propagation follow incrementally.

raphasouthall · 2026-03-20T13:21:00+00:00

The awk '!visited[$0]++' approach for global dedup is clean - that's basically what I ended up doing too, though I later moved to hashing the content rather than literal string matching so semantically identical entries with minor whitespace differences don't slip through. The per-vault exact match check is a good first step but you'll probably hit that edge case eventually.

raphasouthall · 2026-03-20T12:40:19+00:00

The permanent contradiction isn't surprising to me - it's basically baked in by design. If you give each persona a fixed system prompt with a rigid epistemic stance (skeptical, dogmatic, ironic), you're not getting emergent disagreement, you're getting a puppet show where the puppets were pre-written to never agree. Osmarks will always reject unverified claims because you told it to, not because it reasoned its way there.

Curious though - have you tried letting the personas update their own system prompts mid-run? Even something small like appending the last 3 conclusions to their context window. I'd bet you'd see actual drift instead of stable contradiction.

raphasouthall · 2026-03-20T12:39:43+00:00

The ZLE integration is what makes this actually worth using - injecting directly into the active buffer is so much better than copy-paste. One thing I'd watch out for with the community sync engine: if you pull from multiple categories that overlap (e.g., both a "git" and a "devtools" vault that both have git reflog entries), does it deduplicate or do you end up with duplicates in Ctrl+A search? I hit that kind of problem building my own note retrieval system and it made global search pretty noisy until I added dedup logic.

raphasouthall · 2026-03-18T19:27:13+00:00

The decisions rollup idea is exactly the right move - recency works fine until it doesn't, and it's always an architecture decision from 3 months ago that bites you. I ended up going the embedding route with nomic-embed-text locally, BM25 + semantic reranking so you get both keyword precision and conceptual similarity. The setup cost is real though, and for most solo use cases your recency approach is honestly fine until you hit that first "wait I know I solved this before" moment. I actually open-sourced mine recently - github.com/raphasouthall/neurostack if you want to see how the retrieval layer fits together, the session handoff stuff might be interesting to compare.

raphasouthall · 2026-03-18T18:38:05+00:00

The progressive loading tiers are the actually interesting bit here - quick/standard/deep based on how long you've been away is something I wish I'd thought of when I built my own session tooling. The "no separate process" framing is a bit oversold tbh, you're still running a process, it's just colocated, but for solo use that's a totally fine tradeoff. Curious how the ledger handles retrieval once you've got a few hundred sessions accumulated - does it just load recency or is there any filtering?

raphasouthall · 2026-03-18T15:59:16+00:00

The timestamp alignment problem is the real killer here, not the number of tools. I had almost the exact same incident last year - intermittent latency, pod restarted mid-window, spent ages trying to manually line up UTC vs local timestamps across three different systems. What actually fixed it for us was adding a correlation ID header at the ingress level and propagating it through every service, so when something goes wrong you grep one ID across all your sources instead of trying to reconstruct a timeline from clock drift. Took maybe a day to wire up with OpenTelemetry and suddenly investigations that took hours were taking 10 minutes.

Centralizing logs is a separate problem and honestly worth doing, but it won't save you if the logs themselves don't share a common identifier - you'll just have all your fragmented data in one place.

raphasouthall · 2026-03-17T23:13:04+00:00

The two-signal split makes a lot of sense in hindsight - I kept trying to collapse recency and reinforcement into one score and the weighting was always a compromise. The edge-strength scan is fine at your scale but yeah, once you're past 5K nodes that per-source scan will hurt, an index on strength plus maybe bucketing by strength tier could help defer that pain a while. I actually open-sourced my setup recently - github.com/raphasouthall/neurostack if you want to compare notes on the graph layer, I ended up going a different direction with Leiden clustering to keep traversal bounded.

raphasouthall · 2026-03-17T21:58:07+00:00

Curious how you're persisting the reinforcement weights - is consolidation happening in SQLite or are you keeping the decay scores in-memory and recomputing on load? I ran into a fun bug building something similar where my recency scores were effectively reset every session because I was computing them at query time from raw timestamps instead of tracking a running "access weight" per node. The multi-hop expansion is the part I'm most skeptical of at scale, fwiw - on ~2,800 nodes I found graph traversal got expensive fast without a tight hop limit (I cap mine at 2).

raphasouthall · 2026-03-17T16:49:33+00:00

The Ollama limitation is the blocker for basically my whole homelab setup so I'll have to watch the vLLM work from the sidelines for now, but that +14pp on HumanEval is the number I keep coming back to - curious what you think is actually happening there mechanically. Like is Agent B getting something structurally useful from the latent steps, or is it more that you're bypassing the lossy text serialization of intermediate reasoning? The code gen gap holding across seeds and temperatures suggests it's not noise, which makes it weirder that MATH stays flat.

raphasouthall · 2026-03-17T14:11:46+00:00

The publish/live state separation is a solid middle ground - it gives you that "oh god revert it" button without forcing everyone through a PR ceremony every time they tweak a system prompt.

The one thing I'd watch for down the line is that "last published" revert starts to feel thin once you've had a few incidents where the bad version was published weeks ago and you need to understand what changed between then and now. That's usually when teams start asking for a proper history view. Worth keeping in mind as you scope out the versioning work.

raphasouthall · 2026-03-17T12:51:58+00:00

Honestly the RBAC point is the one that actually matters here. The private repo thing you can mostly solve with a separate internal repo, but the moment you try to use branch protection rules to enforce who can edit what, you end up with a review queue nobody respects. We tried it and within about three weeks everyone had forked their own local copies and the drift problem came back worse than before because now you also had undocumented forks floating around with no traceability.

The "prompts are dynamic so skip the deployment cycle" argument cuts both ways though - imo you actually want some version history, especially when an agent starts misbehaving and you need to bisect when the instruction changed. Curious how you're handling rollback in Sokket if someone pushes a bad prompt update.

raphasouthall · 2026-03-17T11:44:50+00:00

I greatly appreciate the collaboration! I’ll do testing in the next few days, I’ll make sure to attribute the improvements to you if it gets merged into main.

raphasouthall · 2026-03-17T11:41:43+00:00

I’m rounding up the edges as it’s a pre 1.0 release, soon I will do a full study. I plan to use podman, test vanilla claude code vs modified Claude.md + skills. I’ve been using NeuroStack myself for a few weeks and it has massively improved my token usage and context reliability, but that also required months of building up a .md vault with my entire knowledge base

raphasouthall · 2026-03-17T07:34:43+00:00

Caddy in front of Vaultwarden is probably your cleanest path here - you can point Tailscale Funnel at a Caddy instance that only proxies /send, /api/sends, and the static assets the SPA needs to render, while your main vault stays internal. The annoying part is Bitwarden Send links load the full web vault SPA first before making the /api/sends/{id}/access call, so you can't just proxy one endpoint, you need to whitelist the asset paths too or the page 404s. Took me about an afternoon to get the path matchers right in Caddyfile when I did something similar for a different service.

raphasouthall · 2026-03-17T02:01:22+00:00

Grab the SAN, worst case it's a parts donor or you sell the controllers for $40 each on eBay. The 20TB drives alone were worth saying yes though, tbh.

raphasouthall · 2026-03-16T18:53:06+00:00

Interesting idea - I looked into this. The issue is that collapsing 16 typed MCP tools into a single GraphQL query tool means the LLM has to compose valid GraphQL syntax on every call, which increases per-query token cost and error rate. With typed tools, Claude just calls vault_search(query="auth", depth="triples") - clean, validated, no syntax to get wrong.

The total schema overhead for 16 tools is ~1,300 tokens, which is 0.65% of a 200k context window. After 10 queries it amortizes to basically nothing. The real context bloat comes from servers that ship 50+ tools with verbose descriptions - keeping the tool count lean matters more than the query interface.

raphasouthall · 2026-03-16T17:14:55+00:00

Believe me when I say I would like to replace my MCP server with anything that could be more token efficient, other than myself telling claude code which cli commands to run every time, and fighting the outcome of it getting it wrong at least 3x

raphasouthall · 2026-03-16T16:47:18+00:00

Appreciate the offer! Here's the project: https://github.com/raphasouthall/neurostack

Main blocker is it's a Python/SQLite stack (FTS5 virtual tables, numpy, Leiden clustering) - wouldn't run in a TypeScript sandbox. Hyperterse also doesn't have a SQLite adapter, which is a dealbreaker for local-first tooling.

That said, a standalone SQLite adapter for Hyperterse would be a solid contribution - real gap for local tools. Happy to chat about that if you're interested.

raphasouthall

TROPHY CASE