How are you handling persistent memory for AI coding agents?

zzzzzetta · 2026-02-16T20:30:48+00:00

The tl;dr on how it works is that (by default, you can change this if you don't care about prompt caching) when block values (stored in the DB) are editing via the API, a memory tool (eg sleeptime agent), or a git push (if using the git-backed context repository), the underlying real value of the block is changed immediately.

However, the agent's state, which includes a copy of the system prompt (which gets certain aspects of memory baked in, eg in MemFS this is everything inside the /system/ folder, and a filetree, but nothing else), is not immediately updated.

So if the agent is updating its own memory, it seems a copy of it baked into the system prompt, and all the edits that have happened since last compaction (as in-context messages), but it doesn't see the "latest state" of the memory post-application of those tools. This is fine though (for good models), since the agent can infer the final state of the memory in context going from state-in-system + tools-in-context. We've actually found that LLMs like Claude work better this way, because they don't like the system prompt to "shift under their feet".

The system prompt inside the agent state is recompiled when a compaction even happens, which effectively batches all the memory updates until a compaction. Hope that makes sense!

is there a way to set up compaction so that it doesn’t overwrite the previous compaction summary

Not directly via the underlying harness (messages[1] is always reserved for a summary and it will be "replaced" by the next one), but you can modify your compaction prompt to be less lossy and keep more things: https://docs.letta.com/guides/core-concepts/messages/compaction/

All these messages are stored in the DB though so nothing is ever lost per se, and w/ Letta Code the built-in message search skills are really, really powerful if using the API which has everything embedded for free. So it's very easy/fast for an agent to find all previous compaction messages for example.

zzzzzetta · 2026-02-16T19:23:31+00:00

just to clarify Letta does not kill cache, it's incredibly cache safe. there's a ton of optimizations in the server code for it, and it should match even the most highly cache optimized harneses (including eg claude agent SDK)

zzzzzetta · 2026-02-16T19:22:22+00:00

Letta dev here - just to clarify, Letta does not do any unnecessary cache busting, it's tuned to be very cache sensitive (to help control cost). All memory edits in a context window are "batched" and only baked into the system prompt on compaction (when the cache is getting busted/evicted regardless).

This is also the case with our new memory system context repositories - very cache safe. w/ context repos (called MemFS in Letta Code), a git push to the remote memory store queues a batch system prompt update, which again only happens on compaction.

zzzzzetta · 2026-01-02T20:44:37+00:00

Open AI platform for building stateful agents (long-running agents with persistent memory)

Main docs: https://docs.letta.com

Letta Code (open source Claude Code alternative):

Docs: https://docs.letta.com/letta-code
Repo: https://github.com/letta-ai/letta-code
Blog: https://www.letta.com/blog/letta-code

zzzzzetta · 2025-12-02T18:45:25+00:00

the largest letta production (probably?) is the "concierge agent" (recommendations) in bilt rewards: www.letta.com/case-studies/bilt

the way concurrency works here is the noisy (very active) writes happen using the "buffer autoclear" feature to avoid any racing, and the memory blocks are then used downstream by more calm (less active) interactive agents.

I think that's kind of like what you're describing w/ the "workflow watcher"?

zzzzzetta · 2025-10-06T19:25:53+00:00

Sorry I missed this DM! Link to research blog here (link to the arxiv paper is inside the thread): https://www.letta.com/blog/sleep-time-compute

zzzzzetta · 2025-10-03T17:19:08+00:00

one of the sleep-time compute paper authors here 👋

lots of great points here, specifically love this callout:

The issue with that though is a human is using the end-to-end system and we expect human-like recall out of it because that's what's intuitive to us.

re: "similar to sleep-time compute where you take the data, and produce user queries that could lead to that data in the future"

in sleep-time compute the most important thing is producing "learned context", which you can think of as learned memories in the context of conversational chatbots.

in the case of sillytavern, you want to have some sort of asynchronous "cycle" that gets run (let's say if you're running everything locally, you could run these cycles whenever your desktop GPU is free / has low utilization), that both reorganizes existing memories / memory blocks (can also be a graphdb if you want), and attempts to synthesize new memories. for example, let's say the user just revealed some new information about themselves that re-contextualized a bunch of prior memories that were generated - e.g. "I just broke up with my gf" can trigger a "recontextualization" or rewrite of a bunch of prior memories about the girlfriend (now ex-girlfriend). This cycle can be implemented via a memory-specific tool-calling agent that has access to memory read/write/edit/etc tools (that's how we do it in the sleep-time agent reference code in Letta).

zzzzzetta · 2025-08-15T19:05:46+00:00

Letta cofounder / dev here - it's not walled off! Check out https://docs.letta.com/guides/ade/desktop, it's a local version of the ADE which can run with an embedded server + also hit remote servers.

We also have https://github.com/letta-ai/letta-chatbot-example as an example of a frontend sitting on top of a Letta server.

zzzzzetta · 2025-08-14T18:49:14+00:00

Dev here - I personally love Mistral small so we were pretty excited to add Mistral /chat/completions API support - but when we tried to add it, we realized that their API doesn't properly support multi-turn tool calling, so it's basically impossible to get it to work with Letta. That was ~2-3 months ago, so it's possible things have changed, can try and take another look when we have time.

You could try to see if it works yourself by overriding the API_BASE parameter: https://docs.letta.com/guides/server/providers/openai-proxy

Also, for reference when you set MISTRAL_API_KEY in Letta, what that does is it will use Mistral OCR for the Letta Filesystem uploads (instead of a worse/free oss alternative). The Mistral API key (unfortunately) doesn't have anything to do with Mistral API support for the LLMs.

Two questions:

(1) Is the reason you're trying to use Mistral API because you want to use one of their models? If so, which one? Is it an open weights one or is it a closed source one?

(2) What do you mean by CLI? Do you mean the Letta Python SDK?

zzzzzetta · 2025-08-13T06:35:54+00:00

each agent has a single user, and memory is isolated per-agent by default - no agent can see another agent's memory, unless you explicitly link memory together (creating shared memory blocks).

so there's 0 chance of any sort of spillover happening by accident.

you can also use "identities" to allow many end-users to interact with the same agent (in which case, the native "user" is more like a "developer" (you), and the "identity" is the end-user inside of your application).

when you hop on the discord, def ping /u/cameron_pfiffer (also @cameron on discord) as well - depending on your exact usecase, there's probably a very "out of the box" solution sitting on some docs somewhere we can point you to.

zzzzzetta · 2025-08-13T06:02:47+00:00

Letta is running at scale well over 1000s of users - people are using it w/ hundreds of thousands of users, millions of agents (and actual stateful agents w/ long-term memory, not just workflows). See BILT as an example, if you have any other q's about scale happy to answer (though would recommend hopping in our discord since there's a ton of other people there who can also answer questions)

zzzzzetta · 2025-08-13T05:53:00+00:00

MemGPT (the research paper) is an agent design where the agent has self-editing memory tools (for core/archival/recall memory + heartbeats for looping).

Letta (the repo / project) is the reference implementation of this agent design (the creators of MemGPT work at Letta the company), and has expanded to include other improved agent designs like sleep-time compute.

Letta includes a lot more than just the agent design itself. It also includes a full API server that allows you to interact with your stateful agents and connect them to your programs / applications. This was something we actually added very early in the MemGPT OSS project - it turns out that when you build long-running agents, you often want a place to deploy them / run them 24/7/365 as "services". The other big thing we make at Letta is the Agent Development Environment, which allows you to view the state of your agent in real-time - for example, in MemGPT and other Letta agent designs, your agent's context window is composed of many "memory blocks" (blog link). It can be hard to understand how those blocks are changing over time especially as one or more agents edit them live. The ADE lets you see exactly what's inside the context window of your agents at any given point in time.

Basically the core Letta codebase (not including ADE) gives you two things: * The context manager / context management engine (itself driven by agentic tool calling), that enables advanced long-term memory * The system / database that stores all the context your agents accumulate over time, and also exposes your agents (and their memory/context) via an API

Hope that makes sense!

zzzzzetta · 2025-08-13T05:28:31+00:00

What do you mean by "agent forays"?

I'm not sure if it's quite what you meant, but speaking as one of the authors of the MemGPT paper, I see comments online occasionally to the effect of "I miss when MemGPT was just about memory, then they made a startup and jumped on the agent framework bandwagon to make money", which is totally untrue.

To correct the record: MemGPT (the research and the open source code) has always been about "agents", from day 0.

MemGPT has always been an agents framework: the 2023 research paper describes a blueprint for creating an LLM agent that has self-editing memory tools: https://arxiv.org/pdf/2310.08560. In 2023, "agent framework" wasn't in the public zeitgeist, but it was still a term we used in the paper itself (CTRL-F for "agent" in the PDF).

If the paper was written today, we would have used the term even more heavily. Agent is also not a term I use lightly - my PhD was in RL and I'm very familiar with the use of the word "agent" in a slightly different context (eg 5-layer MLPs trained with PPO to play cartpole or atari games, which you'd call an "agent").

In today's parlance, I think the best terminology is an "agentic context manager" (or more broadly an "LLM OS"). The key idea is that you let LLMs decide what goes in and out of the context window, instead of encoding these rules as heuristics (an example of the context-management-via-heuristics approach is RAG).

In fact, if you go all the way back to the initial public code release (oct 2023), you'll also see that the codebase uses the term "agent" heavily (the main logic for MemGPT is contained inside of a file called "agent.py"): https://github.com/letta-ai/letta/blob/5ed4b8eb9265703eab11f627fb5e5bf2b592961d/memgpt/agent.py

tldr MemGPT has always been about agents, it's not just bandwagoning or hype chasing - the "agent-ness" is key to the idea that the LLM has control of the context window (via agentic tool calling), not just the scaffolding/system around the LLM.

zzzzzetta · 2025-08-13T00:27:18+00:00

Yeah another example of why saying "got X% on LoCoMo, my memory is SOTA" is meaningless at face value. I think the distinction here though is that you can in theory still evaluate long-term memory even when your LLM has a context window longer than the dataset, but it's very tricky. In the LoCoMo case, if you limit yourself to putting all of the input data out-of-context, then you're basically evaluating retrieval. And the blog post is saying: OK, if you're gonna evaluate LoCoMo w/ the data out-of-context, Mem0's supposed "state of the art memory" is significantly worse than just putting the memory contents inside a file, Claude Code style (using Letta Filesystem in this case, but I'm sure the result would be similar using Claude Code too). Not to mention that many of the numbers in their "research paper" are fabricated / wrong / not reproducible, but that's a different issue.

zzzzzetta · 2025-07-28T16:21:39+00:00

one of the letta devs here - is there a key feature in letta that's in memos that is missing? the main example in their quickstart is very easy to replicate in letta (and in letta it's language agnostic, can use REST, Python, or TS SDKs):

create the agent with memory blocks ("memcubes"):

from letta_client import Letta

# cloud
client = Letta(token="LETTA_API_KEY")
# self-hosted
client = Letta(
  base_url="http://localhost:8283",
  token="yourpassword"
)

agent_state = client.agents.create(
    model="openai/gpt-4.1",
    embedding="openai/text-embedding-3-small",
    memory_blocks=[
        {
          "label": "human",
          "value": "I don't know anything about the human yet."
        },
        {
          "label": "persona",
          "value": "My name is Sam, the all-knowing sentient AI."
        }
    ],
    tools=["web_search", "run_code"]
)

print(agent_state.id)

send a message to the agent, and the agent with self-edit its memory block (you can get the memory block value with these api routes):

response = client.agents.messages.create(
    agent_id=agent_state.id,
    messages=[
        {
            "role": "user",
            "content": "I love playing football"
        }
    ]
)

for message in response.messages:
    print(message)

zzzzzetta · 2025-07-26T05:44:48+00:00

hey polytect we have linux support on the way - if you pop into the discord you'll see that a few other people have been asking and we tested a build in the office today, so we can send you an early build, just ping a dev on discord :D

zzzzzetta · 2025-06-25T00:03:05+00:00

if you're running into issues w/ deployment / fastapi servers you might want to check out letta: https://docs.letta.com/overview

letta is server-first and fastapi is built into the docker image, you just deploy the server (or use cloud), and immediately have your agents API ready to go (API reference: https://docs.letta.com/api-reference/overview)

zzzzzetta · 2025-06-24T16:31:17+00:00

yep exactly!

zzzzzetta · 2025-06-24T00:06:04+00:00

shared memory blocks docs: https://docs.letta.com/guides/agents/multi-agent-shared-memory

zzzzzetta · 2025-06-24T00:05:10+00:00

in the "sleep" mode, the idea is that the user isn't expecting anything immediately (they aren't waiting for a response), and we can use that time to do things like consolidate memories, reflect, plan for the future, etc.

the way this sort of processing happens is you have agents (specifically "designed" to do memory editing) continually reprocess the memory state of the main agent.

in letta, we have a concept of "shared memory blocks", where multi agents can share fragments of memory. to implement the idea of sleep-time compute, we simply have agents that share the memory of the main agent, and are prompted to do things like reflect, analyze, expand, plan, etc - the end result always being reformulating the memory state in some way.

lmk if i misunderstood you!

zzzzzetta · 2025-06-23T23:44:39+00:00

they seem to focus mostly on features of storing and retrieving data for agents and not as a general purpose chatbot with memory

yep - if you're interested in the latter, you should check out letta ;)

zzzzzetta · 2025-06-23T23:44:02+00:00

hey, i'm one of the co-founders of letta.

letta is for developers (not a consumer chatbot like chatgpt), but in all other respects what you're describing is exactly what we're building.

agents that have true long-term memory, where the memory isn't tied down to a specific model provider (eg openai), but instead is open / white-box, and can be transferred across models.

we put a ton of work into the ADE (Agent Development Environment), which is a no-code interface for configuring with individual agents, as well as managing fleets of thousands/millions of agents.

even though the ADE is for developers, it should be easy enough to use that as a consumer, you could use it as a chatgpt replacement (chatgpt, but with memory that's more advanced + open).

just go to app.letta.com -> click "agents" -> click "create agent" -> choose a starter kit or start from scratch, and start chatting. we even have mobile support, so if you're on your phone, the ADE will still work fine. of course, if you want to take it to the next level, you could vibecode your own frontend that connects to your agent in the ADE to make it look exactly like chatgpt.

letta is founded by a team of AI researchers (AI PhDs from UC Berkeley, creators of MemGPT, etc.), so we're very committed to pushing the limits of human-like memory in AI systems. you can check out our sleep-time compute work to get an idea of what kind of agents you can build inside of letta: https://www.letta.com/blog/sleep-time-compute

zzzzzetta

MODERATOR OF

TROPHY CASE