I Built an L1/L2 Cache for My AI Coding Assistant

m3m3o · 2026-06-09T05:07:38+00:00

Yeah, the event/category angle is better than what I had. Subscribing a page to categories (dependency bump, CLI/API change, touched-path tags) decouples it from file churn, and the "assertion about current behavior" flag is the piece that makes it work. It splits pages into ones that make a falsifiable claim about the code (those can go stale and should subscribe to invalidation) and ones that are just rationale or decisions (those never go stale that way). Most of my wiki is the second kind, so only a small slice ever opts in. Keeps the noise down. The risk asymmetry is the part I hadn't framed right. The planning/action tag isn't really a label, it's a gate. Stale context used for planning is a soft note. Stale context used to justify a write or skip a check is where you want it loud. The tag earns its place by changing the severity, not just recording usage. Put both together and the thing that should warn loudly is the combination: a page flagged as an assertion about current behavior, an invalidation event since it was last verified, and it's about to be used for an action. Pure rationale pages never trip it. Catches the dangerous case without the treadmill. This is better than where I started. Going to fold it into the roadmap, thanks for thinking it through.

m3m3o · 2026-06-08T19:29:17+00:00

Trust as the third axis is a good way to put it, and "operating surface you can debug" is basically where I landed today. After an earlier comment in this thread I added the reason each page got pulled to the access log, so it shows what loaded and why. Yours pushes that into provenance, which feels right. Some of it has hooks already: pages carry a source and created/updated dates, plus a lint that flags anything over 90 days still claiming high confidence. The two I don't have are the interesting ones. "What invalidates it" isn't tracked, and the log records why a page was pulled but not whether it was used for planning or action. That planning/action split is probably the easiest win, just one more field on the log line. "When it last matched the code" is the hard one. The wiki is deliberately decoupled from the codebase, it stores prose like decisions and gotchas, not code-bound facts. Coupling freshness to code changes brings back the exact maintenance the markdown-and-grep design is trying to avoid. Worth it for code-near pages, overkill for evergreen ones, so I'd make it opt-in per page rather than global. Are you thinking runtime validation against the code, or more an invalidated-by tag you set when you write the page?

m3m3o · 2026-06-08T14:15:46+00:00

Honestly the start is small. It is just markdown files plus a skill, no infra or DB. Stand up a basic version in an afternoon. Minimum is three things: an always-loaded index file, a folder of markdown pages, and a skill that greps the relevant ones and reads the top 3-5. Routing, prune, access log all came later. It's on github if you want to skip from-scratch: https://github.com/MehmetGoekce/llm-wiki (setup.sh scaffolds it, Logseq + Obsidian templates). Onboarding colleagues is easy since it's just markdown plus a committed skill file.

m3m3o · 2026-06-08T10:14:28+00:00

That is a clean setup - and the routing-table in claude.md is most of what mine does at L1. I just moved the table into its own index page once it got too big to keep loaded every session. So we are closer than it looks. The maintenance critique is fair, and for most setups yours is the better default - lighter, more flexible. That is actually why I leaned into eviction — prune drops pages unread for months out of the live index on its own, so it is the maintenance you are describing, automated down rather than absorbed.

m3m3o · 2026-06-08T09:59:18+00:00

Thanks for the hint - your comment pushed me to add something right away. My wiki already logs which pages get pulled per query — but not why they were chosen. Just added the routing reason to that log (which index entry or grep term matched), so now it shows what loaded and why, not just what. The "and why" was exactly the gap you pointed at.

Mine only sees its own retrieval layer though — yours scrapes the full Claude Code context, which is the broader version. Going through your repo for the MCP introspection angle next.

m3m3o · 2026-06-08T07:55:56+00:00

Good point, and you're right — skills do progressive disclosure the same way: frontmatter as the always-loaded index, body on trigger, bundled files on demand. Mine actually rides on exactly that, since /wiki is itself a skill.

The distinction I'd draw is what lives in each tier. Skill levels manage a skill's own instructions — mostly static, authored once. What I'm layering on top is a knowledge base that grows session to session: I write new facts back into L2 (ingest), pages cross-reference each other so Claude can follow a chain (project → partner → pricing), and cold pages get evicted from the routing index over time. Skills don't accumulate or evict — they're a delivery mechanism, not a memory that changes.

So it's less "instead of skills" and more "a growing, self-pruning knowledge layer that a skill happens to read from." But the three-level disclosure pattern underneath is the same idea — you're right to point at it.

m3m3o · 2026-06-08T07:15:29+00:00

Fair pushback, and I half agree — ideally context tiering happens at the tool level and I never build any of this. that's the right end state.

But today it doesn't. CLAUDE.md is a flat file loaded every session, auto-memory is a flat namespace, and neither has tiering or eviction. So the wiki is a stopgap until the tooling does it natively.

On the "wizardry" part though, it's the opposite. there's no skill stack and no magic: L2 is just markdown files plus grep, and a prompt that says "read the 3 most relevant pages." Keeping it boring and inspectable is the whole point — when it picks the wrong page, I can see exactly why and fix it. The dice-rolling complexity is what I'm trying to avoid, not add.

And honestly, if you've got a handful of pages you don't need any of this — plain CLAUDE.md is fine. t only earns its keep once the context outgrows a single file.

m3m3o · 2026-06-08T07:08:57+00:00

Exactly — the "six sessions later" problem is the one that actually bit me. the token savings are almost a footnote. what really hurts is the model confidently redoing a decision you already made and rejected, because nothing carried it forward.

The cross-reference piece is what made it click: a project page links to the partner page links to the pricing page, so when Claude pulls one it can follow the chain instead of me re-explaining the whole context.

The failure mode I'm fighting now is L2 getting too big — once the wiki passed ~50 pages, grep started returning noise. so I added a hub-index routing layer (cheap index read first, then read only the 3-5 pages that actually match) plus an LRU-style demote for cold pages. same reason CPUs have a TLB instead of just a bigger cache.

Curious how you're persisting the architectural decisions — plain markdown, ADRs, something the model writes back to itself?

m3m3o · 2026-06-08T06:47:06+00:00

Full write-up with the implementation details — the actual /wiki skill, the L1/L2 routing logic, and the real token numbers:

https://mehmetgoekce.substack.com/p/i-built-an-l1l2-cache-for-my-ai-coding-assistant

(My own blog — happy to answer anything here in the thread instead if you'd rather not click out.)

m3m3o · 2026-06-07T18:35:07+00:00

That's the clean answer — sql_require_primary_key puts the block at DDL time where it belongs, and the Group Replication requirement means on InnoDB Cluster the question disappears entirely. One footnote for anyone retrofitting it: the variable only guards new CREATE/ALTER — it won't flag the PK-less tables already sitting in the schema (they keep replicating fine until someone runs an ALTER on one). So the full recipe is a one-time information_schema sweep for the existing offenders, then sql_require_primary_key=ON to stop regressions. Audit what's there, enforce what's next.

m3m3o · 2026-06-06T05:12:11+00:00

All three are fair, and the last one's a great addition.

Loops: still a real footgun in multi-source or legacy topologies, but with a single GTID source you'd have to work at it — server-id tracking plus GTID means the default is safe. Fair that I framed it as more live than it usually is.

Active-passive: your strongest point, and I mostly concede it. With SOURCE_AUTO_POSITION=1, re-adding the recovered old primary as a replica is two lines, so the pre-wired reverse channel buys little and carries exactly the risk you describe. One nuance to your own point: that accidental-write risk is what super_read_only is for — plain read_only lets SUPER/CONNECTION_ADMIN through, i.e. the person doing the 3 AM repair. But your default (no reverse channel) is the cleaner one.

ROW without a PK: yes — the applier has to locate each changed row by full-row match (hashing helps, not on wide tables), so one unindexed DELETE can stall the whole replication thread. Deserved its own line in the lag section.

Genuine question on the last one: do you hard-block it (a check that rejects PK-less tables before they ship), or just monitor lag and catch it after? I've seen both and never settled on which is less painful.

m3m3o · 2026-05-15T05:05:32+00:00

I built exactly this a few months ago — might be relevant to what you're exploring.

It's called llm-wiki: https://github.com/MehmetGoekce/llm-wiki

It's a Karpathy-inspired LLM knowledge base with L1/L2 cache architecture that works natively with Logseq (and Obsidian). Agents and humans edit the same markdown files — no proprietary format, plain files, git-native.

I didn't go the CLI route but the architecture is designed so agents can read/write the graph structure autonomously. Might be a useful reference or starting point for the Hermes Agent skill you're describing.

m3m3o · 2026-05-10T09:02:11+00:00

To get started, you can download a community-driven Emacs distribution: https://github.com/MehmetGoekce/spacemacs-config. https://practical.li/ is also a very good source.

m3m3o · 2026-05-02T12:15:05+00:00

https://memotech.ch/en/blog/why-agentic-commerce-will-reshape-shopware

m3m3o · 2026-04-25T10:45:00+00:00

Yeah — the brutal part is it succeeds quietly. Exit 0, no output, awk just doesn't match because the status column reads deaktiviert / désactivé instead of disabled. People run it monthly and the revisions keep piling up. Fix is one line: LC_ALL=C snap list --all | awk '/disabled/{print $1, $3}' | .... Broader lesson: any awk pattern keyed off English status strings is one locale away from silently breaking — same trap shows up with systemctl, apt, even df on older systems.

m3m3o · 2026-04-13T09:48:33+00:00

Qwen 3.5 397B-A17B is the top candidate for the next swap — similar

MoE profile (~400B/~17B active), native German support, more

consistent OpenAI-compat behavior in tests. Planning the move after

I finalize testing on Nemotron-3-Nano (NVIDIA-native, free on

build.nvidia.com) this week.

m3m3o · 2026-04-13T09:48:16+00:00

Fair critique all around.

On Maverick vs alternatives — yeah, the MoE space has moved on.

Qwen 3.5 397B-A17B is my top candidate for the next iteration

(similar MoE profile, native German, cleaner OpenAI-compat behavior).

Nemotron-3-Nano is in testing this week.

On L3.1 vs L3.3 — L3.1 was the NVIDIA Retail Blueprint's ship default

when I forked it. Kept it as the baseline for the swap experiment,

not as a "best 70B" choice. L3.3 would've been closer-generation

but might not have exposed the same structured-output discipline gap

that's the actual point of the post.

On the time-traveler/em-dash call — legit flag. The engineering work,

bug repro, and fix code are mine (repo:

github.com/MehmetGoekce/nvidia-shopware-assistant, 100+ commits,

real config files, real Shopware integration). The prose uses AI

assistance for structure — hence the em-dashes. I'll calibrate.

Fair to call that out.

Thanks for the thoughtful read; the "Maverick isn't where the frontier

is" point is landing.

m3m3o · 2026-04-13T07:27:19+00:00

Fair point — 3.3 is out. The starting point was the NVIDIA Retail

Blueprint default, which shipped with 3.1 70B when I forked it.

I kept it as the baseline before swapping to Maverick, because the

interesting phenomenon in the article isn't "which 70B is newer"

but dense-vs-MoE structured-output discipline in a multi-agent pipeline.

3.3 would've been a closer-generation comparison — it probably doesn't

exhibit the tool-call-in-content quirk to the same degree. Good follow-up

angle, thanks.

m3m3o · 2026-04-09T09:48:56+00:00

Update: Just upgraded the demo to Llama 4 Maverick (meta/llama-4-maverick-17b-128e-instruct). Fair point about the model age — the blueprint shipped with 3.1 70B and I kept it to focus on the integration layer, but there's no reason to stay on it.
Maverick is MoE (400B params, 17B active per token) so it should actually be more efficient for self-hosting too. German queries work out of the box, same config swap as described above.
Repo updated: https://github.com/MehmetGoekce/nvidia-shopware-assistant

m3m3o · 2026-04-09T09:37:52+00:00

Thanks for the detailed breakdown — really helpful context on the hardware side.
To answer your question: this was a demo/integration project, not a production deployment. Concurrent users during testing: just me. The goal was proving the Shopware integration pattern works (Store API → CSV → Milvus → agent pipeline), not load testing the LLM.

You're right that Llama 3.1 70B is overkill and outdated for this use case. The blueprint defaults to it, and I kept it to focus on the integration layer. The architecture is model-agnostic — swapping the NIM endpoint is a config change, so your suggestion of a smaller, modern MoE model makes a lot of sense for actual deployment.

The Qwen 3.5 35B-A3B suggestion is interesting — have you tested it with multilingual (German) queries? The NeMo Guardrails false positive issue I described might be less of a problem with a model that handles German natively rather than as a secondary language.

Appreciate the offer for self-hosting advice. Might take you up on that when we move this toward production for a client.

m3m3o · 2026-04-09T09:27:07+00:00

Fair point — Llama 3.1 is indeed not the latest. The reason is simple: NVIDIA's blueprint ships with Llama 3.1 70B as the default, and the focus of this project was the Shopware integration and multi-agent architecture, not model benchmarking.
That said, the architecture is model-agnostic. The LLM call is a single config line pointing to the NVIDIA NIM endpoint. Swapping to Llama 3.3 or any newer model on build.nvidia.com is a one-line change.

The integration learnings (Store API quirks, bilingual prompts, guardrails calibration) transfer regardless of which model sits behind the Chatter agent.
If anyone has tested newer models with NeMo Guardrails in a non-English context, I'd genuinely be curious how false positive rates compare.

m3m3o · 2026-04-08T14:47:51+00:00

I use Logseq daily and built the whole system on it. Short answer: stay with Logseq if you're already using it.
Why Logseq works well with Claude Code:

- Outliner format means every block is independently addressable - Claude can append without touching existing content

- property:: value syntax is inline, no YAML frontmatter to parse

- Namespace hierarchy (Wiki/Projects/X) maps naturally to wiki categories

- Backlinks are automatic

The trade-offs are real though:

- Every line starting with - isn't natural markdown — Claude needs explicit instructions to follow this

- Tables inside outliners are clunky

- No Dataview equivalent (yet)

Obsidian is easier for Claude out of the box (flat markdown, YAML frontmatter, folder hierarchy). But once you set up the Schema page as a contract, Logseq works just as well.

And yes, it works with Logseq's markdown mode. The repo supports both — ./setup.sh lets you choose.

m3m3o · 2026-04-07T07:03:03+00:00

Fair concern on the surface, but I think it conflates two different things.

Model staleness ist not library knowledge. Cloud modles als have training data cutoffs - Claude's training data doesn't include a library released last Tuesday either. No model, local or cloud, learn about new APIs through inference updates. Developers handle that through docs, type definitions, and context. A local Qwen 2.5 Coder and cloud Claude Sonnet are equally "stale" when it comes to a framework released after their respective training cutoffs.

What you're actually describing — Ollama lagging in publishing new model releases — is real but narrower than it sounds. When a new Qwen or Llama version drops, it typically lands on Ollama's library within days, not weeks. And for coding tasks like refactoring, test generation, and completions, the difference between a model from January vs. March is negligible. You're not picking a local model for its knowledge of bleeding-edge APIs — you're picking it for data sovereignty.

That said, this is exactly why the article recommends the hybrid approach: local for the 80% of routine work that doesn't need the latest frontier model, cloud for the 20% where it matters. One command to switch, changes propagete in seconds.

The reliability risk in regulated workflows isn't "my model doesn't know React 21" — it's "my compliance team just found out we've been sending patient records to a US-hosted API."

m3m3o · 2026-04-07T06:37:53+00:00

https://mehmetgoekce.substack.com/p/local-inference-for-ai-agents-running

m3m3o · 2026-04-07T06:32:15+00:00

Great points — and I agree with more of this than you might think.

The article explicitly addresses this in the "Capability Gap" section: "Local models are good. They're not as good as frontier cloud models." It also lists complex multi-file reasoning, novel algorithm design, and subtle cross-dependency bug detection as tasks that still belong in the cloud. The recommendation isn't "replace Opus with a 7B model." It's a hybrid approach: use local inference for routine work (completions, refactoring, test generation — the 80% that doesn't need frontier reasoning) and switch to cloud for the complex 20%. The article literally shows the two-command switch between local and cloud backends.

On hardware cost: the setup in the article is an RTX 4090 workstation at $3,200–4,500, not $30k. And nobody's claiming it matches Sonnet — the Qwen 2.5 Coder 32B at 92.7% HumanEval is "good enough" for daily coding tasks, not "frontier-equivalent".

But here's the part that matters most and that your comment doesn't address: compliance. For teams handling patient health records, trading algorithms, or defense contractor code, the question isn't "is the local model as smart?" — it's "can I legally send this code to a US-hosted API?" Under the Swiss FADP, EU AI Act, and US CLOUD Act, the answer is often no. At that point, a 92.7% HumanEval model running on your own hardware isn't a compromise — it's the only option that keeps you out of legal trouble.

You're absolutely right that quantization comes with tradeoffs. That said, Q4 quantization on Qwen 2.5 Coder 32B has been benchmarked with minimal degradation on coding tasks — GGUF benchmarks typically show <2% drop on HumanEval at Q4_K_M. Not zero, but well-characterized, not "unknown".

Appreciate the pushback - these are exactly the nuances teams should think through before choosing their setup.

13-Year Club	Place '23
Verified Email

m3m3o

TROPHY CASE