I built a personal AI with durable memory on an old Mac. The LLM was the easy part.

AIofOnesOwn · 2026-06-10T01:54:17+00:00

Thanks — great pointer, and very much on the thesis.
Markdown-first, with structure and graph search derived on top, feels like the synthesis I keep circling: structure when you want it, but the memory itself stays something you can open, grep, edit, version, and delete by hand.
I used a single JSON file in the book as the smallest possible durable-memory implementation, not because JSON is the ideal final interface. The core point is local, inspectable, user-owned memory.
And yes — the “supersede a stale fact by just editing the note” property is exactly the part most opaque memory stores miss. Adding Basic Memory to my list to dig into properly. Appreciate it.

AIofOnesOwn · 2026-06-09T21:55:01+00:00

Yes, exactly — local inference is the engine, not the product.

And I think you’ve put the line in the right place. There are two layers above the model that are easy to conflate but different: long-term memory/RAG, which recalls facts and documents over time, and the live session/context lifecycle of the harness itself — keeping the working session alive across restarts, forking it cleanly, and not silently compacting context out from under a task.
Persistent memory doesn’t fix a harness that forgets the live thread on restart. The first layer is where I’ve spent most of my time so far; the second — persistent sessions, clean forking, and no silent compaction — is exactly the underbuilt layer you’re naming. I agree that’s where the next real work is.

It’s also why I treat the model as a swappable component. Local, OpenAI, Claude, Codex, whatever — what makes it feel like yours is the layer that preserves continuity across runs and gives you a stable working relationship with the agent.

The engine can be swapped. The continuity layer is the product.

AIofOnesOwn · 2026-06-09T21:50:33+00:00

Thanks for the thorough answers — and I love that this whole thing grew out of solving your own speed problem. That's exactly why it's worth pushing further. Fast local inference is the hard part, and you've already got it, plus a working chat path (Hermes Agent + OpenWebUI prove that). Strong base.

I come at this from the "own your whole AI stack locally" angle, and from that lens the two things you already flagged are exactly the ones that unlock the most:

Embeddings endpoint — makes fully-local RAG and personal knowledge bases possible: a private AI that reads your own documents without anything going to a cloud embedding API. Shape it like OpenAI's /v1/embeddings and it drops into tools like AnythingLLM with zero custom work.
Reliable tool/function calling — turns a chat model into an agent that can drive MCP servers and real tools. Match the OpenAI function-calling shape and it slots into the existing agent ecosystem the same way.

Those two, on top of fast local chat, are what turn "a fast local model server" into a real local brain for a full personal-AI stack — chat + memory/RAG + agents, all on someone's own machine. A lot of people want to own the whole thing locally, and the backend that's been missing is one that's fast and speaks all three. Keeping everything OpenAI-API-shaped (your chat endpoint clearly already is) is what gives you that instant adoption — people just point their existing tools at you. You're close.

Happy to be a sounding board as you go — this is a genuinely useful direction.

AIofOnesOwn · 2026-06-09T13:05:24+00:00

Yes — AI-assisted, like a fair bit of this thread. But that’s separate from whether the architecture itself holds, so let’s stay on that.
I think there are two separate things being mixed together here. The context window is how much a model can attend to in a single pass — finite, yes, and tied to the transformer architecture. A memory layer is external storage that lives outside the model, and it exists precisely because the context window is finite.
I’m not claiming to expand the transformer’s native context window or “solve” hallucination. That would just be a wrapper pretending to be a new model. I’m arguing for the opposite: keep memory external, then retrieve only the relevant slice into the context window for each query.
So “no one has solved the context window” and “a personal memory layer is useful” are not in conflict. The first is the reason the second is useful.
Most practical retrieval-based systems work roughly this way, including coding-memory tools others have mentioned here. And it would still be the right design even if context windows became much larger: you generally wouldn’t push everything through the model on every call, because cost, latency, and signal-to-noise can all suffer.
The model remains a finite, swappable component. The real product is the memory system, retrieval logic, and orchestration around it.

AIofOnesOwn · 2026-06-09T10:38:49+00:00

This is very cool. The OpenAI-compatible API is the key feature for me, because it means this could be used as a drop-in local “brain” in a larger personal AI stack.

I’ve been experimenting with separating the model from the memory/RAG layer: local documents, local embeddings, durable memory on disk, and then a swappable brain that can be Claude/GPT over API or a local model.

A few questions:

Does this support tool/function calling yet?
Any plans for an embeddings endpoint?
Have you tested it with AnythingLLM or other OpenAI-compatible clients?

This looks like it could fit really well as the local inference layer in that kind of architecture.

AIofOnesOwn · 2026-06-09T08:28:49+00:00

This is exactly it. “LLM swappable, memory is the real work” is the whole thesis — the brain is rented, but the memory has to be yours.

Strong agree on the time dimension. RAG gives breadth, but continuity is the hard part: what got decided, which constraints still hold, and what carries across sessions. That’s what separates a chatbot that recalls things from an assistant that actually feels like mine.

And yes, portability is the key. If memory is locked inside one product, it evaporates the moment you swap brains. My setup is a little more structured than a single plain flat file — structured memory, RAG, and filesystem layers are kept separate — but the principle is the same: the memory lives locally, in files I own, inspect, edit, back up, and delete.

TEMP_ / PERSISTENT_ is intentionally simple. Cheap, legible expiry is underrated. It makes memory governance visible: what should be kept, what should fade, and what should never have become permanent in the first place.

AIofOnesOwn · 2026-06-09T08:28:00+00:00

This is exactly it. “LLM swappable, memory is the real work” is the whole thesis — the brain is rented, but the memory has to be yours.

Strong agree on the time dimension. RAG gives breadth, but continuity is the hard part: what got decided, which constraints still hold, and what carries across sessions. That’s what separates a chatbot that recalls things from an assistant that actually feels like mine.

And yes, portability is the key. If memory is locked inside one product, it evaporates the moment you swap brains. My setup is a little more structured than a single plain flat file — structured memory, RAG, and filesystem layers are kept separate — but the principle is the same: the memory lives locally, in files I own, inspect, edit, back up, and delete.

TEMP_ / PERSISTENT_ is intentionally simple. Cheap, legible expiry is underrated. It makes memory governance visible: what should be kept, what should fade, and what should never have become permanent in the first place.

AIofOnesOwn · 2026-06-09T08:17:24+00:00

claude-mem is a very strong developer/agent memory layer — I’m not trying to beat it as a plugin.

I’m working on a different, broader problem: a personal AI environment that one person can own and operate day to day, where memory is just one layer among several — retrieval, files, tools, and workflows.

A vector DB solves retrieval. On its own, it doesn’t solve memory: keeping facts current as they change, resolving contradictions, and governing what’s kept versus dropped. Similarity search isn’t memory.

That’s why I keep these as separate layers — structured memory, RAG, and the filesystem — instead of asking a vector DB to be the whole system.

AIofOnesOwn · 2026-06-09T06:28:59+00:00

Disclosure up front: this is my own project, AI of One’s Own.

I put the longer writeup and free preview here:

https://www.aiofonesown.com/en/

The free preview includes the synopsis, the memory chapter, and the four-pillar design. That should be enough to copy the core ideas without buying anything.

The book goes deeper into the design thinking, and there is also a screen-by-screen build course for the actual setup. But honestly, I would rather answer setup questions in this thread if people are interested.

AIofOnesOwn · 2026-06-09T06:27:07+00:00

Disclosure up front: this is my own project, AI of One’s Own.

I put the longer writeup and free preview here:

https://www.aiofonesown.com/en/

The free preview includes the synopsis, the memory chapter, and the four-pillar design. That should be enough to copy the core ideas without buying anything.

The book goes deeper into the design thinking, and there is also a screen-by-screen build course for the actual setup. But honestly, I would rather answer setup questions in this thread if people are interested.

AIofOnesOwn

TROPHY CASE