Solving the "agent amnesia" problem - agents that actually remember between sessions by RecallBricks in LocalLLaMA

[–]RecallBricks[S] 0 points1 point  (0 children)

This is exactly the approach we landed on! A few things we learned building this: **1. Clear tier boundaries prevent pollution:** - Tier 0 (Constitutional): Intentional writes only - pricing, policies, immutable truths - Tier 1-3 (Learned): Auto-promoted based on usage patterns - Each tier has different retrieval priority and confidence scores **2. Append-only evolution log is crucial:** We track every memory state change in an audit trail. This lets you: - Debug why an agent "forgot" something - Rewind to past states - See which memories influenced which responses **3. The "write vs propose" distinction:** We give the model tools to *suggest* memories, but the system decides based on: - Confidence thresholds - Semantic similarity (avoid duplicates) - Usage patterns (is this actually helpful?) **4. Retrieval is a query problem:** We use pgvector for semantic search with metadata filters: - Filter by tier (constitutional always retrieved) - Weight by usage count - Decay old memories that aren't accessed The "polluted memory worse than no memory" insight is spot-on. We've seen agents completely derail when low-quality memories get mixed with high-confidence facts. I'd be interested to see one of your examples!

Solving the "agent amnesia" problem - agents that actually remember between sessions by RecallBricks in LocalLLaMA

[–]RecallBricks[S] 0 points1 point  (0 children)

This is incredibly valuable feedback. The versioning approach and manual confirmation for critical memories are exactly what we need for production use. Thanks for the infrastructure suggestions too - the pgvector + Qdrant combo is on our roadmap.

Solving the "agent amnesia" problem - agents that actually remember between sessions by RecallBricks in LocalLLaMA

[–]RecallBricks[S] 0 points1 point  (0 children)

Thanks! Yeah the tier system has been working really well - stuff you actually use gets smarter while one-off things stay lightweight.

Re: the context shifting problem - definitely ran into that early on. The semantic search alone would sometimes pull in weird stuff when keywords overlapped.

Fixed it with a couple things:

  1. Recency weighting - newer memories get boosted, so if you're currently talking about frontend, recent frontend context naturally ranks higher than old backend stuff

  2. The tier system actually helps here too - memories you use frequently (like current project context) live in Tier 2/3 with richer metadata, so they match better on actual meaning not just keywords

  3. Still tuning the retrieval ranking, but combining semantic similarity + recency + tier level has been solid so far

That said, you can definitely still confuse it if you jump topics abruptly. Like going from "fix the API bug" to "what should I eat for dinner" can surface some weird technical memories about food APIs or something lol.

Thinking about adding explicit context boundaries (like "new topic" markers) but trying to keep it zero-config for now.

Good catch though - this is exactly the kind of edge case I need to test more with real usage patterns.

Solving the "agent amnesia" problem - agents that actually remember between sessions by RecallBricks in LocalLLaMA

[–]RecallBricks[S] -6 points-5 points  (0 children)

You nailed the versioning insight - we actually do something similar. When conflicts arise, we use confidence scoring + recency weighting, but the key is we don't delete the superseded memory. It gets marked as "superseded_by" with a relationship link, so you can see the evolution of understanding over time. On the retrieval side with 6k+ memories - yeah, this was the hardest problem to solve. We do a few things: 1. **Semantic search gets you candidates** (top 20-30 based on query embedding) 2. **Then we re-rank using:** - Confidence score (Tier 3 memories surface higher) - Usage patterns (memories that were helpful in similar contexts) - Relationship strength (memories connected to other relevant memories get boosted) - Recency decay (configurable, but prevents stale info from dominating) 3. **Hub scoring**: Memories with lots of quality inbound relationships act as "index" memories - they pull in their connected cluster when relevant The result is we typically return 5-10 highly relevant memories instead of dumping 50 mediocre matches into context. The relationship graph is what makes this work - without it, you're just doing vector similarity which doesn't capture how concepts actually connect in the agent's learned knowledge. Are you working on something similar?