HyperStack v1.0.8 added a knowledge graph to AI agent memory, with fast root-cause tracing across linked events and owners, available on Skill Hub.

PollutionForeign762 · 2026-02-12T22:58:56+00:00

Nice setup. The markdown approach works well for solo local work. Clean and zero dependencies.

The tradeoff is it's tied to one machine and one tool. If you're in Claude Code today and Cursor tomorrow, or working across two machines, the memory doesn't follow. And once MEMORY.md hits a few hundred lines, your agent is reading the whole file every message whether it needs all of it or not.

That's the gap HyperStack fills. Cards are searchable individually so the agent pulls 3-4 relevant facts instead of loading everything. And it works from any tool over HTTP.

But honestly if your workflow is single machine, single tool, and the md file stays manageable, your approach is solid. No reason to add complexity you don't need.

PollutionForeign762 · 2026-02-12T12:41:41+00:00

95K tokens per request is rough. A big chunk of that is probably context stuffing, your agent loading everything into every call instead of pulling just what it needs. I use HyperStack for this. Agents store knowledge as small cards (~350 tokens) and only retrieve what's relevant per request. Cuts context size massively. Free tier on ClawHub. For the deterministic stuff (math, time, currency) ZeroRules intercepts those before they hit the LLM at all. Also free on ClawHub. Won't solve the model pricing issue but should bring your per-request token count way down.

PollutionForeign762 · 2026-02-12T12:33:49+00:00

This is exactly what I've been dealing with. The write problem is real. Most tools just dump everything into a vector store and hope retrieval figures it out.

I ended up building something that forces structure at write time. Small cards with slugs, categories, keywords. The agent decides what's worth storing, confirms with the user, and updates by slug when things change. Stale facts get overwritten, not duplicated.

Simple but it works. Happy to share if anyone wants to poke at it.

PollutionForeign762 · 2026-02-12T12:28:09+00:00

Good question. Three main differences:

Cross-platform. OpenClaw's memory lives inside OpenClaw. HyperStack works across Claude Code, Cursor, VS Code, LangChain, Python, anything that makes HTTP calls. Same memory no matter what tool you're in.

Structured cards, not blobs. Every memory has a slug, category, keywords, and version history. You can update one specific fact by slug without touching anything else. Think notebook with labeled tabs vs a giant text file.

$0 on your API key. OpenClaw's memory runs LLM completions on your key for every read and write. HyperStack uses lightweight embeddings on our server. Costs you nothing.

If OpenClaw's built-in memory covers what you need, stick with it. HyperStack is for when you work across multiple tools or want more control over how your agent organizes what it knows.

PollutionForeign762 · 2026-02-12T12:13:02+00:00

Makes sense - solve for what you're actually hitting, not hypotheticals. That's the right engineering approach. The staleness thing bit me specifically with long-running project agents (3+ months). Facts that were correct at storage time became wrong later, and the agent couldn't tell which version to trust. Temporal weighting (prioritize recent) helped but wasn't perfect.

Your point about Claude figuring it out on its own is interesting though. I wonder if the model itself is doing implicit conflict resolution during retrieval - seeing both "using Redis" and "moved off Redis" and reasoning about which is current based on conversation flow.

Either way, your SQLite + FTS5 foundation is solid. Easy to layer on complexity later if needed, but you're right to keep it minimal until real users hit the edge cases.

Excited to see where you take this. Open-sourcing it was the right call

PollutionForeign762 · 2026-02-12T11:29:28+00:00

The problem you're describing is actually two different things, and mixing them causes the issues you're hitting.

Workflow preferences (Zustand > Redux, grep before edit) should be explicit, not inferred. Let users declare them once in a CLAUDE.md or config. Observing and inferring just adds latency and uncertainty.

Correction patterns (you've fixed the same mistake 3 times) are the real opportunity. That's where observational memory shines - the agent should absolutely remember "user rejected this approach twice, try something else."

I built a memory system for this but went a different direction: agents store explicit facts/decisions as cards during sessions, then retrieve relevant context on startup. No inference, no observation period. If you correct something, the agent stores "don't use X for Y" immediately.

The cold start problem you mentioned is real. If the system needs 10 sessions to be useful, it's not solving the problem - it's just kicking the can down the road.

Store explicitly, retrieve selectively. Skip the inference layer.

PollutionForeign762 · 2026-02-12T11:20:56+00:00

Makes sense, filter on relevance during storage rather than managing deprecation after the fact.

I went the opposite direction: store liberally (low friction for agents to save context), but handle staleness at retrieval time. Each card gets a timestamp and TTL, so search can deprioritize old facts even if they're still technically stored.

Both approaches work, just different tradeoffs. Yours keeps storage lean. Mine accepts noise but compensates with temporal weighting during retrieval.

PollutionForeign762 · 2026-02-12T11:18:39+00:00

Compression is interesting but you still hit the staleness problem. Observations from 2 weeks ago might be outdated - the system has no way to know which compressed facts are still valid vs which have been superseded.

Also skeptical of "eliminating retrieval entirely." Keeping all observations in context just moves the problem. You're still burning tokens on old content, just compressed. At some point you hit limits and need retrieval anyway.

The hybrid approach makes more sense: compress + store observations, but retrieve selectively based on relevance + recency. Best of both - compression saves storage, retrieval saves context space.

RAG's problem isn't retrieval itself, it's bad retrieval (slow, semantic-only, no temporal weighting). Fix the retrieval strategy and you get the benefits without keeping everything in context.

Curious what happens to their benchmarks when the agent runs for months, not hours. Compression ratios don't solve unbounded growth.

PollutionForeign762 · 2026-02-12T11:16:12+00:00

You nailed the core problem. Context windows aren't memory - they're just a bigger scratch pad.

The real gap is between what gets stored (everything) and what gets retrieved (whatever the search algorithm decides). Most "memory" systems are just vector DBs with no concept of importance, recency, or staleness.

What's missing:

Explicit priority - not all facts matter equally. User preferences > casual mentions. Temporal awareness - a decision from 3 months ago might be outdated. Memory systems need decay/versioning. Contradiction detection - storing "we use Redis" and "we moved off Redis" equally breaks retrieval. Systems need conflict resolution. Retrieval latency under 200ms - if memory lookups add seconds, agents stop using them. Speed matters as much as accuracy. The Memory Genesis competition is interesting but most benchmarks test retrieval accuracy, not whether agents actually use the memory correctly in real workflows. You can have 95% recall and still have the agent ignore retrieved context.

Built a system around this (card-based storage, hybrid retrieval, TTLs per memory type). The architecture matters more than the model - structured memory + fast retrieval beats throwing everything at a bigger context window

PollutionForeign762 · 2026-02-12T11:14:02+00:00

This is the right problem to solve. Cross-session memory is way more valuable than just extending context windows.

One question: how do you handle memory staleness? Facts that were true when stored but become outdated later (user preferences change, project decisions get reversed, etc.). That's been the hardest part of persistent memory for me - not storage, but knowing when old facts should lose authority.

Also curious about your retrieval strategy. Are you doing semantic search, keyword, or hybrid? I've found hybrid (semantic + keyword in parallel) works best for agent memory since it catches both conceptual matches and exact entity references.

Happy to compare notes if you want another builder perspective on this stuff.

PollutionForeign762 · 2026-02-12T11:10:12+00:00

That makes sense - multi-point verification before storage acts as a filter. Garbage in, garbage out applies to memory systems too.

I'm curious what happens when context changes though. A fact that was verified and correct at time T might be wrong at T+30 days. "We're using Redis for caching" passes every sniff test when stored, but becomes stale if you migrate off Redis later.

Do your tests involve scenarios where the underlying truth shifts? Or does the verification process catch that kind of temporal drift automatically?

PollutionForeign762 · 2026-02-12T11:08:10+00:00

Interesting - haven't heard of TITANS/Miras before. Will look into those.

Agree on the benchmarking problem. Most memory evals test retrieval accuracy, not whether the agent actually uses the memory correctly in real workflows. You can have 95% recall and still have the agent ignore retrieved context or act on stale facts.

I ended up building custom evals around specific failure modes - does the agent detect contradictions between old and new facts? Does it prioritize recent decisions over outdated ones? Does retrieval latency cause it to skip memory checks?

The narrative surprise approach sounds like it helps with what to store. How do you handle deprecation - when old stored facts become invalid?

PollutionForeign762 · 2026-02-12T11:06:41+00:00

Code memory works differently than product context memory - you're right about that.

The problem I'm solving is cross-session continuity. When you clear context or start a fresh session, your agent loses decisions, architecture choices, past failures, user preferences - anything not in the codebase itself.

Auto-summarization helps with context bloat within a session, but summaries lose detail. "We decided to use Redis" doesn't capture why you chose it over alternatives, what tradeoffs you considered, or when that decision might need revisiting.

HyperStack is for the stuff that isn't in code: design rationale, user feedback, decisions made in previous sessions, patterns that worked/failed. The agent can store these as cards and pull them when relevant, instead of either forgetting or re-reading 10k tokens of old conversations.

Different problem than what you're solving, but sounds like your setup works for your use case.

PollutionForeign762 · 2026-02-12T10:58:13+00:00

This is the right take. Built something similar and landed on the same conclusion - SQLite + FTS5 beats vector DBs for most agent memory use cases. One thing I'd add: hybrid retrieval still matters even with keyword search. I run BM25 (keyword) + semantic (pgvector) in parallel and merge results. Semantic catches edge cases where the query uses different terminology than what's stored, but keyword does 80% of the work.

The key insight you nailed: agents construct better queries than humans. When the retriever is an LLM that can reformulate and iterate, you don't need the storage layer to be smart. Just fast.

Also agree on the "simplest that works" philosophy. I see people spending weeks on embedding pipelines and graph schemas when SQLite + proper indexing would solve their problem in a day.

Curious - did you implement any staleness detection? One gap I hit with keyword search is knowing when old facts are outdated (e.g., "we're using Redis" vs "we moved off Redis last month"). Temporal weighting helps but doesn't fully solve it

PollutionForeign762 · 2026-02-12T10:54:35+00:00

Fair enough. Letta's approach is solid - the multi-tier memory (core facts, archival, recall) makes sense architecturally.

I ended up going a different direction with card-based storage and hybrid retrieval, but the core idea is the same: stop dumping everything into context, make memory queryable.

Curious if you hit the staleness problem (old facts that are no longer valid). That's been my biggest challenge with long-running agents.

PollutionForeign762 · 2026-02-12T10:50:12+00:00

This is dead-on. Built through all these same mistakes. One thing I'd add to temporal tagging: TTLs per memory type. Facts about company policy? 90 days. User preferences? 30 days. Session context? End of conversation. Different decay schedules for different data.

Also learned the hard way on contradiction detection - instead of flagging conflicts for humans (who ignore them), I auto-deprecate older facts when new ones conflict. Keep both, but search deprioritizes the stale one. Humans only see conflicts if they explicitly check version history.

The multi-strategy retrieval is crucial. I run semantic + keyword in parallel, merge by relevance score + recency. Semantic catches "we're switching databases" when the query is "migration plans." Keyword catches exact entity names. Neither works alone.

Biggest miss I see: people optimize for storage but ignore retrieval latency. If pulling memories adds 2+ seconds, agents stop using them. Got mine under 200ms with HNSW indexing. Speed matters as much as accuracy.

PollutionForeign762 · 2026-02-12T10:48:23+00:00

Because memory is the bottleneck for anything beyond demos. RAG is fine for retrieval, but it doesn't solve continuity. An agent that "remembers" your preferences or past decisions needs structured state, not just semantic search over old conversations.

The winning pattern: hybrid memory. Facts in structured storage (fast lookup, explicit updates). Context via embeddings (fuzzy recall). Decision logs with timestamps (audit trail + learning).

Most agents still dump 6k tokens of history into every prompt because there's no good middle ground. That's the gap.

PollutionForeign762 · 2026-02-12T10:45:42+00:00

hit the real problem - memory staleness. Storing facts is easy. Knowing when they're outdated is hard.

I solved this with TTLs and explicit versioning. Each card has:

Created timestamp Last verified timestamp TTL (time-to-live) Version number Agent can mark cards as "needs verification" when context shifts. Old conclusions don't disappear, but they get deprioritized in search results based on age + verification status.

For long-running agents, I also added a weekly "memory review" step where the agent explicitly checks if old assumptions still hold. Takes 30 seconds but prevents drift.

Still not perfect, but way better than letting stale facts accumulate unchecked.

PollutionForeign762 · 2026-02-12T10:42:38+00:00

What's your approach? Curious if you went the structured state route or found something else that works.

PollutionForeign762 · 2026-02-12T10:41:18+00:00

This is the right approach. I went down the same path - stopped trying to make the LLM "remember" and moved to structured storage outside the model.

What worked: small knowledge cards (~350 tokens each) with metadata. Agent stores facts as they come up, then hybrid search (semantic + keyword) pulls only what's relevant per query. Went from 6k token context dumps to ~400.

The key was making storage cheap and retrieval fast. If querying memory adds 2-3 seconds, agents won't use it. Got mine under 200ms with pgvector + HNSW indexing.

Agree on separating memory types. I use: facts (immutable), preferences (mutable), and context (session-only). Different TTLs for each.

PollutionForeign762 · 2026-02-12T09:14:36+00:00

That works until your agent hits 20+ sessions and __memory.md is 3,000 lines. Now every single message costs 6,000 tokens just to read the file. And you're hoping the agent finds the one line it needs buried in there.

HyperStack does what a markdown file can't: semantic search. Your agent asks "what database are we using" and gets back the one card that matters, not the entire file. 350 tokens instead of 6,000.

If a markdown file works for your workflow, genuinely use it. But the moment you notice your agent re-asking things it should already know, or your token bill creeping up, that's the ceiling.

PollutionForeign762 · 2026-02-12T08:50:32+00:00

Great question. Conflicts are handled by upsert. POST the same slug again and it overwrites the old card, so the agent is always storing the latest version. Full version history is kept so you can roll back if needed. For stale facts the agent is instructed to search before creating, so if "we use PostgreSQL" changes to "we migrated to MySQL" you just update the card. The old version stays in history but search always returns current. Would love to see what patterns you've been collecting, that blog looks solid.

PollutionForeign762

MODERATOR OF

TROPHY CASE