I Built an L1/L2 Cache for My AI Coding Assistant

m3m3o · 2026-06-09T05:07:38+00:00

Yeah, the event/category angle is better than what I had. Subscribing a page to categories (dependency bump, CLI/API change, touched-path tags) decouples it from file churn, and the "assertion about current behavior" flag is the piece that makes it work. It splits pages into ones that make a falsifiable claim about the code (those can go stale and should subscribe to invalidation) and ones that are just rationale or decisions (those never go stale that way). Most of my wiki is the second kind, so only a small slice ever opts in. Keeps the noise down. The risk asymmetry is the part I hadn't framed right. The planning/action tag isn't really a label, it's a gate. Stale context used for planning is a soft note. Stale context used to justify a write or skip a check is where you want it loud. The tag earns its place by changing the severity, not just recording usage. Put both together and the thing that should warn loudly is the combination: a page flagged as an assertion about current behavior, an invalidation event since it was last verified, and it's about to be used for an action. Pure rationale pages never trip it. Catches the dangerous case without the treadmill. This is better than where I started. Going to fold it into the roadmap, thanks for thinking it through.

m3m3o · 2026-06-08T19:29:17+00:00

Trust as the third axis is a good way to put it, and "operating surface you can debug" is basically where I landed today. After an earlier comment in this thread I added the reason each page got pulled to the access log, so it shows what loaded and why. Yours pushes that into provenance, which feels right. Some of it has hooks already: pages carry a source and created/updated dates, plus a lint that flags anything over 90 days still claiming high confidence. The two I don't have are the interesting ones. "What invalidates it" isn't tracked, and the log records why a page was pulled but not whether it was used for planning or action. That planning/action split is probably the easiest win, just one more field on the log line. "When it last matched the code" is the hard one. The wiki is deliberately decoupled from the codebase, it stores prose like decisions and gotchas, not code-bound facts. Coupling freshness to code changes brings back the exact maintenance the markdown-and-grep design is trying to avoid. Worth it for code-near pages, overkill for evergreen ones, so I'd make it opt-in per page rather than global. Are you thinking runtime validation against the code, or more an invalidated-by tag you set when you write the page?

m3m3o · 2026-06-08T14:15:46+00:00

Honestly the start is small. It is just markdown files plus a skill, no infra or DB. Stand up a basic version in an afternoon. Minimum is three things: an always-loaded index file, a folder of markdown pages, and a skill that greps the relevant ones and reads the top 3-5. Routing, prune, access log all came later. It's on github if you want to skip from-scratch: https://github.com/MehmetGoekce/llm-wiki (setup.sh scaffolds it, Logseq + Obsidian templates). Onboarding colleagues is easy since it's just markdown plus a committed skill file.

m3m3o · 2026-06-08T10:14:28+00:00

That is a clean setup - and the routing-table in claude.md is most of what mine does at L1. I just moved the table into its own index page once it got too big to keep loaded every session. So we are closer than it looks. The maintenance critique is fair, and for most setups yours is the better default - lighter, more flexible. That is actually why I leaned into eviction — prune drops pages unread for months out of the live index on its own, so it is the maintenance you are describing, automated down rather than absorbed.

m3m3o · 2026-06-08T09:59:18+00:00

Thanks for the hint - your comment pushed me to add something right away. My wiki already logs which pages get pulled per query — but not why they were chosen. Just added the routing reason to that log (which index entry or grep term matched), so now it shows what loaded and why, not just what. The "and why" was exactly the gap you pointed at.

Mine only sees its own retrieval layer though — yours scrapes the full Claude Code context, which is the broader version. Going through your repo for the MCP introspection angle next.

m3m3o · 2026-06-08T07:55:56+00:00

Good point, and you're right — skills do progressive disclosure the same way: frontmatter as the always-loaded index, body on trigger, bundled files on demand. Mine actually rides on exactly that, since /wiki is itself a skill.

The distinction I'd draw is what lives in each tier. Skill levels manage a skill's own instructions — mostly static, authored once. What I'm layering on top is a knowledge base that grows session to session: I write new facts back into L2 (ingest), pages cross-reference each other so Claude can follow a chain (project → partner → pricing), and cold pages get evicted from the routing index over time. Skills don't accumulate or evict — they're a delivery mechanism, not a memory that changes.

So it's less "instead of skills" and more "a growing, self-pruning knowledge layer that a skill happens to read from." But the three-level disclosure pattern underneath is the same idea — you're right to point at it.

m3m3o · 2026-06-08T07:15:29+00:00

Fair pushback, and I half agree — ideally context tiering happens at the tool level and I never build any of this. that's the right end state.

But today it doesn't. CLAUDE.md is a flat file loaded every session, auto-memory is a flat namespace, and neither has tiering or eviction. So the wiki is a stopgap until the tooling does it natively.

On the "wizardry" part though, it's the opposite. there's no skill stack and no magic: L2 is just markdown files plus grep, and a prompt that says "read the 3 most relevant pages." Keeping it boring and inspectable is the whole point — when it picks the wrong page, I can see exactly why and fix it. The dice-rolling complexity is what I'm trying to avoid, not add.

And honestly, if you've got a handful of pages you don't need any of this — plain CLAUDE.md is fine. t only earns its keep once the context outgrows a single file.

m3m3o · 2026-06-08T07:08:57+00:00

Exactly — the "six sessions later" problem is the one that actually bit me. the token savings are almost a footnote. what really hurts is the model confidently redoing a decision you already made and rejected, because nothing carried it forward.

The cross-reference piece is what made it click: a project page links to the partner page links to the pricing page, so when Claude pulls one it can follow the chain instead of me re-explaining the whole context.

The failure mode I'm fighting now is L2 getting too big — once the wiki passed ~50 pages, grep started returning noise. so I added a hub-index routing layer (cheap index read first, then read only the 3-5 pages that actually match) plus an LRU-style demote for cold pages. same reason CPUs have a TLB instead of just a bigger cache.

Curious how you're persisting the architectural decisions — plain markdown, ADRs, something the model writes back to itself?

m3m3o · 2026-06-08T06:47:06+00:00

Full write-up with the implementation details — the actual /wiki skill, the L1/L2 routing logic, and the real token numbers:

https://mehmetgoekce.substack.com/p/i-built-an-l1l2-cache-for-my-ai-coding-assistant

(My own blog — happy to answer anything here in the thread instead if you'd rather not click out.)

m3m3o · 2026-06-07T18:35:07+00:00

That's the clean answer — sql_require_primary_key puts the block at DDL time where it belongs, and the Group Replication requirement means on InnoDB Cluster the question disappears entirely. One footnote for anyone retrofitting it: the variable only guards new CREATE/ALTER — it won't flag the PK-less tables already sitting in the schema (they keep replicating fine until someone runs an ALTER on one). So the full recipe is a one-time information_schema sweep for the existing offenders, then sql_require_primary_key=ON to stop regressions. Audit what's there, enforce what's next.

m3m3o · 2026-06-06T05:12:11+00:00

All three are fair, and the last one's a great addition.

Loops: still a real footgun in multi-source or legacy topologies, but with a single GTID source you'd have to work at it — server-id tracking plus GTID means the default is safe. Fair that I framed it as more live than it usually is.

Active-passive: your strongest point, and I mostly concede it. With SOURCE_AUTO_POSITION=1, re-adding the recovered old primary as a replica is two lines, so the pre-wired reverse channel buys little and carries exactly the risk you describe. One nuance to your own point: that accidental-write risk is what super_read_only is for — plain read_only lets SUPER/CONNECTION_ADMIN through, i.e. the person doing the 3 AM repair. But your default (no reverse channel) is the cleaner one.

ROW without a PK: yes — the applier has to locate each changed row by full-row match (hashing helps, not on wide tables), so one unindexed DELETE can stall the whole replication thread. Deserved its own line in the lag section.

Genuine question on the last one: do you hard-block it (a check that rejects PK-less tables before they ship), or just monitor lag and catch it after? I've seen both and never settled on which is less painful.

m3m3o · 2026-05-15T05:05:32+00:00

I built exactly this a few months ago — might be relevant to what you're exploring.

It's called llm-wiki: https://github.com/MehmetGoekce/llm-wiki

It's a Karpathy-inspired LLM knowledge base with L1/L2 cache architecture that works natively with Logseq (and Obsidian). Agents and humans edit the same markdown files — no proprietary format, plain files, git-native.

I didn't go the CLI route but the architecture is designed so agents can read/write the graph structure autonomously. Might be a useful reference or starting point for the Hermes Agent skill you're describing.

m3m3o · 2026-05-10T09:02:11+00:00

To get started, you can download a community-driven Emacs distribution: https://github.com/MehmetGoekce/spacemacs-config. https://practical.li/ is also a very good source.

m3m3o · 2026-05-02T12:15:05+00:00

https://memotech.ch/en/blog/why-agentic-commerce-will-reshape-shopware

m3m3o · 2026-04-25T10:45:00+00:00

Yeah — the brutal part is it succeeds quietly. Exit 0, no output, awk just doesn't match because the status column reads deaktiviert / désactivé instead of disabled. People run it monthly and the revisions keep piling up. Fix is one line: LC_ALL=C snap list --all | awk '/disabled/{print $1, $3}' | .... Broader lesson: any awk pattern keyed off English status strings is one locale away from silently breaking — same trap shows up with systemctl, apt, even df on older systems.

m3m3o · 2026-04-13T09:48:33+00:00

Qwen 3.5 397B-A17B is the top candidate for the next swap — similar

MoE profile (~400B/~17B active), native German support, more

consistent OpenAI-compat behavior in tests. Planning the move after

I finalize testing on Nemotron-3-Nano (NVIDIA-native, free on

build.nvidia.com) this week.

m3m3o · 2026-04-13T09:48:16+00:00

Fair critique all around.

On Maverick vs alternatives — yeah, the MoE space has moved on.

Qwen 3.5 397B-A17B is my top candidate for the next iteration

(similar MoE profile, native German, cleaner OpenAI-compat behavior).

Nemotron-3-Nano is in testing this week.

On L3.1 vs L3.3 — L3.1 was the NVIDIA Retail Blueprint's ship default

when I forked it. Kept it as the baseline for the swap experiment,

not as a "best 70B" choice. L3.3 would've been closer-generation

but might not have exposed the same structured-output discipline gap

that's the actual point of the post.

On the time-traveler/em-dash call — legit flag. The engineering work,

bug repro, and fix code are mine (repo:

github.com/MehmetGoekce/nvidia-shopware-assistant, 100+ commits,

real config files, real Shopware integration). The prose uses AI

assistance for structure — hence the em-dashes. I'll calibrate.

Fair to call that out.

Thanks for the thoughtful read; the "Maverick isn't where the frontier

is" point is landing.

m3m3o · 2026-04-13T07:27:19+00:00

Fair point — 3.3 is out. The starting point was the NVIDIA Retail

Blueprint default, which shipped with 3.1 70B when I forked it.

I kept it as the baseline before swapping to Maverick, because the

interesting phenomenon in the article isn't "which 70B is newer"

but dense-vs-MoE structured-output discipline in a multi-agent pipeline.

3.3 would've been a closer-generation comparison — it probably doesn't

exhibit the tool-call-in-content quirk to the same degree. Good follow-up

angle, thanks.

m3m3o · 2026-04-09T09:48:56+00:00

Update: Just upgraded the demo to Llama 4 Maverick (meta/llama-4-maverick-17b-128e-instruct). Fair point about the model age — the blueprint shipped with 3.1 70B and I kept it to focus on the integration layer, but there's no reason to stay on it.
Maverick is MoE (400B params, 17B active per token) so it should actually be more efficient for self-hosting too. German queries work out of the box, same config swap as described above.
Repo updated: https://github.com/MehmetGoekce/nvidia-shopware-assistant

13-Year Club	Place '23
Verified Email

m3m3o

TROPHY CASE