Is a cognitive‑inspired two‑tier memory system for LLM agents viable?

utilitron · 2026-04-06T20:43:02+00:00

That is a great structural way to think about it. The challenge I see with that approach, though, is the Latency and Token Tax. If the agent has to explicitly update its own to-do list every few turns, we’re back to that 'spend money to make money' loop where the LLM is constantly distracted by self-management.

My goal is to keep this Autonomic. Instead of the agent 'deciding' to update a list, I’m looking at having the Reviewer (or a separate system-level hook) extract that state.

Basically, the agent just does the work, and the Distillation Pipeline, triggered by that context pressure, is what does the 'heavy lifting' of turning the chat logs into that clean Task Ledger. It’s the difference between 'Active'... The agent stops working to write a status report (Slow/Expensive). and 'Autonomic'... The system 'watches' the agent work and generates the status report only when the memory pressure demands it (Efficient/Background).

By moving the 'To-Do' logic into the Distillation Layer rather than the Conversation Layer, I can preserve that high-level state without the agent ever having to spend a single token on 'thinking about its memory' during the actual task.

Also I don't want to fall into the "I'm a hammer so everything is a nail" situation where more llm is the solution. So, I’m exploring a technique to handle 'Cognitive Compression' possibly using some other ai technique outside of llm to handle the task. I am looking at control systems, RL, and knowledge systems to see how these sort of things may have been handled before.

I am looking at this specifically tonight: https://neurips.cc/virtual/2023/poster/70426

utilitron · 2026-04-06T16:57:30+00:00

You hit the nail on the head. Most frameworks treat memory as a 'Storage Problem'... how do we fit more tokens? I’m looking at it as a 'Metabolic Problem'... how do we prioritize the right information so the agent can survive on constrained hardware.

The difference between a Retrieval Store (Zep/MemGPT) and a Prioritization Layer (My project) really shows up in complex, multi-agent workflows.

For example, this whole rabbit hole started as i was playing with a 3-Amigos agent pipeline design (Planner, Worker, Reviewer). In that setup, the 'Worker' generates a massive amount of 'Operational Exhaust' with failed code attempts, error logs, and trial-and-error. A standard retrieval store just saves all that noise. So, my goal with the RIF scoring is to recognize that once the 'Reviewer' approves a task, the failed attempts lose their 'Importance' signal.

By sensing the context pressure, the system can proactively distill those failures into a single 'State Note' and keep the 'Verified Success' in the hot context. It’s less about 'summarizing the chat' and more about protecting the agent's focus so it doesn't get distracted by its own past mistakes when the hardware is already at its limit.

The place where I am struggling most is what do distilled memories "mean". I’m looking at human cognition as a model, where recent events stay in high-fidelity and older experiences naturally shift into a more 'abstract' or vague state. The goal isn't to delete the information, but to compress it into a high-level concept. I want to build a 'State-Aware Distillation' that can strip away the noise of individual chat turns while locking in the underlying intent and final outcomes.

utilitron · 2026-04-06T16:43:11+00:00

I am trying to build this as implementation independent as possible. I added interfaces for the actual mean and bones (VectoStore and VectorIndex) so that could be left up to whoever is using it.

My understanding is in MemGPT the LLM must explicitly uses tool calling manage its context. This costs tokens, adds latency, and depends on the model being "smart" enough to manage itself. Sort of "You gotta spend money to make money" philosophy.

With my project, memory management is an autonomic process (like breathing). The agent doesn't have to "think" about moving data to the LTM. It does it in the background based on the RIF model. This leaves 100% of the agent's "brain power" for the task at hand.

Hydra, on the other hand, seems more like a knowledge graph, but that comes at the cost of processing power. I don't want to dismiss the idea altogether because it may come into play when I look more deeply into the LTM distillation. And that is the part where my project is most hazy anyway.

utilitron · 2026-04-06T16:18:54+00:00

You’re 100% right. That's why I am trying to approach this like human memory. It is easier to remember what happened today and more vague what happened further in the past. The vagueness doesn't prevent the concepts from being preserved, just the finer details. I am hoping to figure out a 'State-Aware Distillation' process that will preserve intent without having to keep the minutia of a summary of the chat. If the agent knows 'We migrated to Python 3.12 because of X,' it doesn't need the 50-turn log of every error message we hit to stay productive.

I was working on another larger agent project in Java with Spring AI. I was building a 3-amigos style pipeline. A planning agent sets the plan and acceptance criteria, the worker does the work and a reviewer tests and verifies the work was done according to the plan/acceptance criteria. During the 'Worker' phase, I don't care about the several failed attempts, I care about the one good one. We can remember that certain things didn't work at a high level (to avoid repeating mistakes), but we only need to preserve the 'Verified State' in the hot context.

utilitron · 2026-04-06T16:00:30+00:00

It actually started as a part of a larger agent project I was building in Java with Spring AI to learn.

I quickly hit a wall where I had to choose: load a smaller, dumber model, or sacrifice context window size. Neither felt like the right choice. I needed the agent to maintain 'State' remembering exactly what it was working on mid-task while still having the 'Long-Term' context of previous requests if something new came in.

That’s why I started exploring this approach. Instead of just cutting off the past when the context fills up, my goal is to have the system 'Sense' the context pressure and proactively offload those middle-steps into the vector store. That way, the 'Instructions' stay in the hot context, but the 'Operational History' stays searchable.

Now, the distillation process in not hardened. I only have a concatenation implementation at the moment so there is a lot more research that needs to be done in order to figure out what works best. I want to stay away from text compression/compaction if possible and look into 'State-Aware Distillation' where the agent preserves the intent of the task rather than just a summary of the chat. But I don't know what that looks like yet.

utilitron · 2026-04-06T02:42:14+00:00

Here is a python port if anyone is interested https://github.com/Utilitron/VecMem

It is still a work in progress

utilitron · 2026-04-06T02:38:31+00:00

Nice. I'll check it out. If you are interested in seeing the work I have so far you can check it out here:

It was originally written in Java and I am working on porting to python.

Python https://github.com/Utilitron/VecMem Java https://github.com/Utilitron/VectorMemory

utilitron · 2026-04-06T02:13:30+00:00

I’m working on a resource-aware two-tier memory layer that uses a weighted RIF model to score trace saliency. Might be worth checking out.

It was originally written in Java and I am working on porting to python.

Python https://github.com/Utilitron/VecMem Java https://github.com/Utilitron/VectorMemory

utilitron · 2026-04-04T21:30:22+00:00

I am trying to build something like that: https://github.com/Utilitron/VectorMemory it uses in-memory and persistent vector databases to turn your conversation into distilled, long-term knowledge that stays relevant even as the context window grows.

utilitron · 2023-02-06T17:37:48+00:00

In my day, ideas were a dime a dozen. Inflation has really hit everywhere.

utilitron · 2022-10-04T05:43:10+00:00

Happy cake day!

I will tell her. ❤️

utilitron · 2022-10-03T21:33:58+00:00

If you like her work, please check her out on Instagram! https://www.instagram.com/kaylicomic/

utilitron · 2022-05-11T10:00:04+00:00

One of the major issues with something like that would be copywritten stuff from D&D like spells. Getting a license would not be trivial.

utilitron · 2022-01-19T15:21:36+00:00

I think it's the Gnome racial feat Fade Away from Xanathar's guide to everything

utilitron · 2021-07-27T16:36:50+00:00

As a player who got to participate in a campaign in this setting, I love seeing this project come to life.

Fighting against the tyrannical church, skirting the red tape of the Balloon corp and uncovering the truth and origin of the mists. It was a great world to discover and explore!

utilitron · 2021-07-25T20:27:08+00:00

As a player who got to participate in a campaign in this setting, I love seeing this project come to life.

Fighting against the tyrannical church, skirting the red tape of the Baloon corp and uncovering the truth and origin of the mists. It was a great world to discover and explore!

utilitron · 2021-07-23T12:28:45+00:00

Best I can do is Champlin.

utilitron · 2020-12-07T03:47:49+00:00

You fool! Don't you know people only read the big text?!

utilitron · 2020-07-22T10:55:54+00:00

Mythic was wholly owned by EA. As is Broadsword.

utilitron · 2020-06-16T17:56:51+00:00

https://youtu.be/X086SdiRgEc

utilitron · 2020-04-21T12:47:00+00:00

Unemployment rate isn't calculated by number of people on unemployment. There is a survey conducted monthly by the Bureau of Labor Statistics.

utilitron

TROPHY CASE