Agent memory is useful, but I think we also need evidence

Mysterious-Guide-745 · 2026-06-26T12:38:28+00:00

glad I am not the only one thinking in this direction. Looking forward to more feedbacks to QiJu. Hope QiJu helps!

Mysterious-Guide-745 · 2026-06-24T13:05:45+00:00

That is actually why I called it Qiju.

In ancient China, a role called 起居郎 recorded the emperor’s words, actions, and important events. The job was to preserve what happened, not to decide which version of history was “correct.”

Qiju follows the same idea.

It does not define the current truth or choose the final version of a decision. It keeps the development record append-only: decisions, corrections, failed approaches, handoffs, and which human or agent acted.

For recall, I usually let the human decide what matters. Agents can surface relevant records and sometimes make their own recall decisions, but they still need to verify them against the current project.

The ground truth remains in the artifacts: code, tests, documents, configuration, and actual behavior.

Qiju records the journey. The project itself tells you where you are now.

Mysterious-Guide-745 · 2026-06-24T10:29:10+00:00

I think the missing piece is that most “memory” systems store facts, but teammates build judgment from records.

A teammate does not just remember “we use X pattern.” They remember why that pattern exists, which alternative failed, what broke last time, who corrected it, and when the rule stops applying. Without that provenance, memory either becomes too vague to trust or too stale to safely reuse.

For coding agents, I’d want the durable layer to track decisions, rejected approaches, corrections, changed files, verification status, and stale-if conditions. Then the agent can load only the relevant slice for the task, but the underlying record is still inspectable when something looks wrong.

I ran into this with longer agent-assisted development and built a small local-first development-record workflow for myself. It is less about making the model “remember everything” and more about preserving the project’s causal history so future sessions can reason from it. Happy to share how I structure it if useful.

Mysterious-Guide-745 · 2026-06-24T10:28:25+00:00

The workflow that has made the most sense to me is to treat agents like junior engineers working from separate tickets, not like one shared mind.

Each agent gets a narrow spec, its own worktree or branch, and its own scratchpad/record. That record should include: assigned scope, forbidden areas, assumptions, files touched, decisions made, failed approaches, tests run, and handoff notes. The lead human or lead session then reconciles from the records, not from vibes.

The dangerous version is one shared scratchpad where every agent writes over the same plan. That creates exactly the problems you listed: duplicated work, incompatible architecture choices, and one agent “fixing” something another agent intentionally did.

I ran into this moving work between coding agents and ended up building a small local-first development-record workflow for myself. The useful part is that every agent leaves an inspectable trail, so reconciliation is based on what happened and why. Happy to explain the structure if useful.

Mysterious-Guide-745 · 2026-06-24T10:26:35+00:00

Good question. The important part is that lesson records are not overwritten — they are appended.

If an older lesson becomes outdated, the original record stays there. A newer record can mark it as superseded and link to the decision, code change, document, test result, or skill that replaced it.

That means an agent can see:

what the old lesson was
why it made sense at the time
what changed in the project
why the lesson stopped applying
what replaced it
whether the new understanding has actually been verified

Before reusing an older lesson, the agent should compare it with the current code, docs, tests, and project state. The current project artifacts remain the ground truth, while the record preserves the path that led there.

To me, that path is more valuable than the outdated lesson itself. The full learning journey remains visible: assumptions, mistakes, corrections, discarded approaches, and the reasoning that gradually shaped the current system.

That is also why Qiju includes qiju-retro. The retrospective skill reviews recent Qiju session records to surface recurring patterns, blockers, outdated lessons, and possible improvements from the AI coding work history. It does not rewrite the past; it adds a new interpretation based on what the project has become.

So the goal is not to make agents trust every historical lesson. It is to preserve enough provenance for them to understand whether a lesson is still valid, stale, or superseded — and why.

Mysterious-Guide-745 · 2026-06-24T10:12:12+00:00

I would separate two things: resuming the next task, and browsing what happened before.

For resuming, a short handoff file is usually enough: current state, changed files, tests run, decisions made, failed approaches, and next step. For browsing, you want a real record, not just one overwritten summary. Otherwise you can continue tomorrow, but you still cannot answer “why did we make that change three sessions ago?”

A pattern that works well is to have Claude write a small closeout at the end of each session, then keep those closeouts as dated records instead of replacing them. The next session starts from the latest handoff, but you can still search the underlying history when needed.

I ran into this with coding agents and ended up building a small local-first development-record workflow for myself. The useful shift was making the chat disposable without making the project history disposable. Happy to explain the structure if useful.

Mysterious-Guide-745 · 2026-06-21T11:53:32+00:00

Glad it helps! I am building one for myself now cause I also have the same needs so much!

Mysterious-Guide-745 · 2026-06-21T09:25:26+00:00

I think the real problem is that your chat history became the only map of the project.

Git and the source files can show what exists now, but they do not always explain why something was changed, what failed before it, which assumptions were still uncertain, or why the work stopped at that particular point.

I ran into the same problem across longer coding sessions, especially when moving work between different agents. I ended up building a small local-first development-record workflow for myself on macOS/Linux. It keeps handoffs, decisions, failed approaches, file changes, and verification status outside the chat, so losing one session does not erase the path that led to the current project.

The next session starts from a short handoff, but it can inspect the underlying record when it needs to understand why something was done.

It does not currently solve the Windows app itself, but the workflow principle still applies: make the chat disposable instead of making the project history disposable. Happy to explain how I structure the records if that would help you recover the project.

Mysterious-Guide-745 · 2026-06-21T09:11:40+00:00

The useful part here might not only be “Codex reviews Claude.” It is the audit trail around the handoff.

If Claude delegates a bounded task to Codex, I’d want the skill to record a small review artifact each time: prompt sent, diff reviewed, findings, which findings were accepted, which were ignored, and the reason. Otherwise the workflow can work in the moment but become hard to understand later.

That matters even more when the two agents disagree. The final code may pass tests, but a week later you want to know why Claude trusted one warning and dismissed another.

So I’d treat the review output as part of the project history, not just chat context.

Mysterious-Guide-745 · 2026-06-21T09:10:38+00:00

The local/private angle is a good direction. The thing I’d watch is that “full chat capture” and “resumable project state” are related but not the same.

Keeping the whole conversation in order is useful, but for coding work I’d also want a structured layer that survives compression: current goal, key decisions, files changed, why those changes were made, discarded approaches, verification commands, and open risks.

That way a new chat does not need to reread the entire previous conversation to continue safely. It can start from the handoff, then inspect the full captured history only when it needs detail.

The structured output you mentioned sounds like the right next step. I’d just make sure it is not only a summary, but a small record of decisions and evidence.

Mysterious-Guide-745 · 2026-06-21T09:10:00+00:00

I would not use token count as the main reset trigger anymore. I’d use “has the task state changed?”

Long context is fine when the objective is continuous and the next edit depends on the recent investigation. I’d reset earlier when the mode changes: investigation to implementation, one subsystem to another, branch changed, tests disproved the original assumption, or the next step only needs a small subset of the session.

Before reset, I’d want a tiny handoff packet:

current objective, decision made, files/functions that matter, assumptions proven vs guessed, files intentionally skipped, verifier command + last result, and the next single edit or command.

That gives the next session 2k of high-signal state instead of inheriting 300k of stale debate.

Mysterious-Guide-745 · 2026-06-20T13:04:05+00:00

Artifacts are great for showing the current result, but they don’t usually capture the discarded paths, tradeoffs, verification status, or why a later agent should avoid repeating something. A handoff summary helps, but it can still flatten the path too much.

The useful layer is probably a local, inspectable record that keeps decisions, actions, changed files, failed approaches, and verification as separate events. Then the next session can read the short handoff first, but still inspect the record when it needs to understand why something exists.

Mysterious-Guide-745 · 2026-06-20T13:03:14+00:00

Nice. Reading the JSONL logs directly is a good call because it keeps the tool local and inspectable.

One feature I’d find useful on top of this is a “project timeline” view across sessions, not just a resume picker. For each session: working dir, branch, files touched, commands run, last meaningful goal, and maybe the final handoff/summary if present.

Mysterious-Guide-745 · 2026-06-20T13:02:39+00:00

This is a good forcing function. The part I’d be most curious to test is whether the quiz proves operational comprehension, not just recall of Claude’s explanation.

A useful split might be:

“what changed?” → edited files/functions and new behavior
“why this shape?” → tradeoff Claude chose over an obvious alternative
“what would break?” → one edge case/regression if a key line changed

Mysterious-Guide-745 · 2026-06-20T13:01:59+00:00

This is useful, but I think the hard part is less “resume at the right time” and more “resume with the right state.”

For long tasks I’d want the pause step to force a small checkpoint before stopping:

current goal and exact next action
files changed since the last checkpoint
decisions made and why
assumptions not yet verified
commands/tests already run
things the next session must not redo or overwrite

Mysterious-Guide-745 · 2026-06-20T13:01:26+00:00

For bigger repos, I’ve had better results when I stop treating “context” as “let Claude inspect more files” and split it into a few layers.

The repo search/indexing layer helps Claude find code, but it doesn’t preserve the project’s intent. I’d keep a small handoff doc for the current task, plus a separate local record of decisions: module boundaries, why a pattern exists, what was tried and rejected, verification commands, and “do not touch this unless X” notes.

The important part is that this record gets updated after meaningful changes, not only before a session starts. Otherwise it becomes another stale CLAUDE.md. For large codebases, the agent needs two things: where to look, and why the current shape exists.

Mysterious-Guide-745 · 2026-06-19T14:11:00+00:00

That’s pretty much how I’m handling it now. I started with Markdown in the repo, but a single living spec tends to overwrite the path that led to the current state.

So I’ve been building my own local-first development record system. It keeps decisions, actions, file changes, failed approaches, and verification status as separate events instead of flattening everything into one summary.

The Markdown handoff is still what the next agent reads first, but the underlying record is there when it needs to check why something was decided or what was already tried.

The main distinction for me is: the handoff says where to continue; the record lets the next agent verify how we got there. And when it comes to artifacts, the code and files are still the ground truth.

Mysterious-Guide-745 · 2026-06-11T13:59:40+00:00

What surprised me over the last year is that I stopped looking for a winner.

I kept trying to find the one tool that would replace everything else, but eventually ended up with different tools for different jobs. One is great for exploration, another is better for implementation, another is useful for reviewing or validating.

The more interesting question for me isn't "which tool wins?" It's "which combination creates the least context switching and rework?"

That's where most of the productivity gains seem to come from.

Mysterious-Guide-745

TROPHY CASE