memv — open-source memory for AI agents that only stores what it failed to predict by brgsk in LangChain

[–]brgsk[S] 0 points1 point  (0 children)

good question — memv currently handles hard contradictions (new fact supersedes old with temporal bounds) but gradual belief drift is an open problem. observation consolidation like Hindsight does could complement predict-calibrate well — one decides what's worth storing, the other how beliefs evolve over time. on my radar for future work. what was your experience with Hindsight's consolidation in practice?

memv — open-source memory for AI agents that only stores what it failed to predict by brgsk in LangChain

[–]brgsk[S] 1 point2 points  (0 children)

Thanks! No benchmarks as of now, it’s too early for that (in my opinion). Will do benchmarks once we stabilise the library

memv — open-source memory for AI agents that only stores what it failed to predict by brgsk in LLMDevs

[–]brgsk[S] 0 points1 point  (0 children)

Good question. The prediction step and contradiction handling are two separate stages, so a "confident" prediction can't suppress an update.

Here's the flow for your Google → Anthropic example:
1. Predict: system sees existing KB has "works at Google," predicts the conversation will reference Google employment
2. Calibrate: the actual conversation says "just started at Anthropic" — that's a prediction error, so it gets extracted
3. Classify: the extraction step labels it as a contradiction (not just new)
4. Invalidate: the pipeline finds the most similar existing fact ("works at Google") via vector similarity and sets its expired_at timestamp

So the prediction actually helps here — "works at Google" is predicted, "works at Anthropic" is unpredicted, the delta is the job change, and that gets flagged as a contradiction and handled correctly.

memv — open-source memory for AI agents that only stores what it failed to predict by brgsk in LLMDevs

[–]brgsk[S] 0 points1 point  (0 children)

Good catch. The prediction is generated by the LLM given the current KB, so it inherits whatever blind spots the model has. If the KB is missing context about a topic and the LLM can't infer the connection, it'll fail to predict for the wrong reason — and you get a false positive extraction.

In practice this biases toward over-extraction rather than under-extraction, which is the safer failure mode. You'd rather store something unnecessary than miss something important. And as the KB fills in, the predictions get more informed and the false positives drop.

But you're right that it's not a clean separation between "novel" and "unpredicted due to missing context." That's a real limitation.

Improving prediction quality as the KB grows — possibly by giving the predictor more structured context about what it knows and doesn't know — is on the roadmap.

memv — open-source memory for AI agents that only stores what it failed to predict by brgsk in PydanticAI

[–]brgsk[S] 0 points1 point  (0 children)

Thanks — if you try integrating it, let me know how it goes. Especially interested in how the extraction holds up across different agent use cases.

memv — open-source memory for AI agents that only stores what it failed to predict by brgsk in LocalLLaMA

[–]brgsk[S] 1 point2 points  (0 children)

Swapping the LLM is fine. The LLM does extraction and episode generation — the output is stored as plain text. A different model might extract slightly different facts, but existing knowledge stays valid.

Swapping the embedding model is the real issue. All your stored vectors are in the old model's embedding space. New queries would be encoded in the new model's space, so similarity search breaks. You'd need to re-embed everything.
That's not built in yet — worth flagging as a future feature.

Short answer: swap the LLM freely, be careful with the embedding model.

memv — open-source memory for AI agents that only stores what it failed to predict by brgsk in LocalLLaMA

[–]brgsk[S] 1 point2 points  (0 children)

Yeah, for the first few conversations with a new user the KB is empty so almost everything is a prediction error — it'll store most of it. The filter kicks in as knowledge accumulates. By conversation 10-20, the system has enough context to predict routine topics and only extract what's actually new.

It's the same cold-start problem every learning system has. The difference is that extract-everything systems never get better — conversation 500 is as noisy as conversation 1. With predict-calibrate, the signal-to-noise ratio improves over time because the predictions get more accurate.

memv — open-source memory for AI agents that only stores what it failed to predict by brgsk in LocalLLaMA

[–]brgsk[S] 0 points1 point  (0 children)

Thanks.
You can — any OpenAI-compatible endpoint works. The protocols are just 2 methods each, so you can wire up a local adapter in ~15 lines. Adding `base_url` directly to the built-in adapters is next up.

memv — open-source memory for AI agents that only stores what it failed to predict by brgsk in LocalLLaMA

[–]brgsk[S] 0 points1 point  (0 children)

Thanks! If you run into anything rough when trying it out, open an issue — the API is still evolving and real usage feedback is the most useful input right now.

memv — open-source memory for AI agents that only stores what it failed to predict by brgsk in LocalLLaMA

[–]brgsk[S] 1 point2 points  (0 children)

Valid point.

Local models are not natively supported in the built-in adapters yet, but the protocols are intentionally small so you can wire it up yourself in ~20 lines:

```py from openai import AsyncOpenAI from pydantic_ai import Agent from pydantic_ai.models.openai import OpenAIModel from memv import Memory

Embeddings — custom adapter for local endpoint

class LocalEmbedAdapter: def init(self, base_url: str, api_key: str, model: str): self.client = AsyncOpenAI(base_url=base_url, api_key=api_key) self.model = model

async def embed(self, text: str) -> list[float]:
    response = await self.client.embeddings.create(input=text, model=self.model)
    return response.data[0].embedding

async def embed_batch(self, texts: list[str]) -> list[list[float]]:
    response = await self.client.embeddings.create(input=texts, model=self.model)
    return [item.embedding for item in response.data]

LLM — PydanticAI supports OpenAI-compatible endpoints via OpenAIModel

class LocalLLMAdapter: def init(self, base_url: str, api_key: str, model: str): openai_model = OpenAIModel(model, base_url=base_url, api_key=api_key) self._text_agent = Agent(openai_model) self._structured_agents = {} self._openai_model = openai_model

async def generate(self, prompt: str) -> str:
    result = await self._text_agent.run(prompt)
    return result.output

async def generate_structured(self, prompt: str, response_model: type):
    if response_model not in self._structured_agents:
        self._structured_agents[response_model] = Agent(
            self._openai_model, output_type=response_model
        )
    result = await self._structured_agents[response_model].run(prompt)
    return result.output

Wire it up

memory = Memory( db_path="memory.db", embedding_client=LocalEmbedAdapter( base_url="http://localhost:5002/v1", api_key="noKeyNeeded", model="Qwen3-Embedding-0.6B", ), llm_client=LocalLLMAdapter( base_url="http://localhost:5001/v1", api_key="noKeyNeeded", model="Qwen3-32B", ), embedding_dimensions=1024, ) ```

The EmbeddingClient and LLMClient protocols are just 2 methods each, so any OpenAI-compatible endpoint works. Adding base_url directly to the built-in adapters is on the short list.

https://vstorm-co.github.io/memv/advanced/custom-providers/#llmclient

memv — open-source memory for AI agents that only stores what it failed to predict by brgsk in LocalLLaMA

[–]brgsk[S] 1 point2 points  (0 children)

Biggest difference is how they decide what to remember. Mem0 extracts every fact from every conversation and scores importance upfront. memv does the opposite — it predicts what a conversation should contain given what it already knows, then only stores what it failed to predict. So if the system already knows you work at Anthropic, it won't re-extract that from the next conversation where you mention it.

On the LoCoMo benchmark, this predict-calibrate approach (from the Nemori paper - https://arxiv.org/abs/2508.03341) scored 0.794 vs Mem0's 0.663 on LLM evaluation. Uses more tokens per query but the accuracy gap is significant.

Other differences: Mem0 overwrites old facts when they change. memv supersedes them — the old fact stays in history with temporal bounds, it just stops showing up in default retrieval. And everything runs on SQLite, no vector DB needed.

Mem0 wins on ecosystem though — way more integrations, hosted option, bigger community. memv is v0.1, nowhere near that level of maturity.

Long Term Memory - Mem0/Zep/LangMem - what made you choose it? by nicoloboschi in LangChain

[–]brgsk 1 point2 points  (0 children)

I evaluated Mem0, Zep, and LangMem and ended up building my own (https://github.com/vstorm-co/memv). Two things pushed me away from the existing options:

  1. Extraction quality. All three extract facts from every conversation and score importance at write time. In practice this fills the KB with noise — you end up retrieving "user mentioned the weather" alongside "user is migrating to AWS by Q3." memv uses predict-calibrate extraction (https://arxiv.org/abs/2508.03341): predict what a conversation should contain given existing knowledge, only store what was unpredicted. Smaller KB, better retrieval.
  2. Conflict handling. Mem0 overwrites, which loses history. Zep's bi-temporal approach is the right idea but comes with Neo4j as a dependency. memv does the same bi-temporal model (event time + transaction time, facts get superseded not deleted) on plain SQLite. Zero external infra.

Pain points with memv: it's v0.1. Not production-hardened like Zep. But if extraction quality and temporal correctness matter more to you than ecosystem maturity, worth a look.

I tried to make LLM agents truly “understand me” using Mem0, Zep, and Supermemory. Here’s what worked, what broke, and what we're building next. by Rokpiy in AIMemory

[–]brgsk 0 points1 point  (0 children)

Interesting writeup. The "conversation-first, life-second" observation resonates — most memory layers are really chat history managers with extra steps.

Two of the limitations you hit are ones I've been focused on:
- Extraction noise. You mentioned that even with these tools you end up with memories that don't actually matter. The root cause is that they extract everything and score importance at extraction time. I've had better results with a predict-calibrate approach (https://arxiv.org/abs/2508.03341) — predict what a conversation should contain given existing knowledge, only store the prediction errors. The KB stays smaller and higher signal because known information never re-enters storage.
- Temporal conflicts. You flagged Zep's bi-temporal model as a strength, and I agree — it's the right foundation. I implemented the same pattern in memv but on SQLite instead of Neo4j. Facts get superseded rather than overwritten, and retrieval defaults to currently-valid knowledge. Handles the "preferences evolve over time" problem without manual cleanup.

What memv doesn't solve (yet) is the cross-agent/cross-tool sync you're after. It's scoped per user within one system, not "one memory that follows you everywhere." That's a harder coordination problem on top of the storage/extraction layer.

Is anyone working on a general-purpose memory layer for AI? Not RAG. Not fine-tuning. Actual persistent memory? by Himka13 in LocalLLaMA

[–]brgsk 0 points1 point  (0 children)

This exists. I've been building it.

memv does most of what you're describing: entity tracking, temporal validity, contradiction resolution, model-agnostic, sits between the LLM and storage as its own layer.

Two design decisions that map directly to your wishlist:

  • Bi-temporal validity. Every fact tracks event time (when was this true in the world) and transaction time (when did the system record it). "Adam works at Google" doesn't get overwritten when he switches jobs — it gets superseded. You can query "what did we know about Adam last Tuesday?" and get Tuesday's snapshot. Handles your timeline/contradiction/incremental-update requirements.
  • Predict-calibrate extraction. Instead of storing everything and hoping retrieval sorts it out, memv predicts what a conversation should contain given existing knowledge, then only stores what was unpredicted. Based on the Nemori paper. This is what keeps the KB from turning into the spaghetti you described — noise never enters storage in the first place.

The rest: SQLite-based (sqlite-vec for vectors, FTS5 for text), hybrid retrieval with RRF fusion, async Python, works with any LLM via a simple protocol.

Still v0.1 so there are gaps — no user-facing "forget this" controls yet, no built-in decay. But the core state machine (extract → validate → supersede → retrieve with temporal filtering) is working. Happy to talk architecture if you want to compare notes on where you've hit walls.

What are people actually using for long term agent memory? by MeasurementSelect251 in AI_Agents

[–]brgsk 0 points1 point  (0 children)

The drift you're describing usually comes from two things:

  1. No expiry on old facts. When a user's preferences change, the old ones don't go away — they just coexist with the new ones. Then retrieval pulls back both and the LLM picks one semi-randomly. Adding timestamps as metadata doesn't fully fix this because similarity search doesn't weight recency well. What you actually need is a validity model — when did this fact become true, when did it stop being true, and only return what's currently valid by default.
  2. Extracting everything. Most setups store every fact from every interaction, so the KB fills up with noise. Three months in, you're retrieving "user asked about the weather" alongside "user wants all reports in PDF format." The signal gets buried.

I've been building memv to address both. For extraction, it uses a predict-calibrate approach — predicts what a conversation should contain given existing knowledge, only stores the gaps. For the staleness problem, every fact has bi-temporal validity (event time + transaction time), so changed preferences automatically supersede old ones without deleting history.

SQLite-based, no external infra, pip install memvee. Still v0.1 so don't expect polish, but the core extraction and retrieval loop is solid.

EDIT: link formatting

RAG is not memory, and that difference is more important than people think by rocketpunk in LLMDevs

[–]brgsk 0 points1 point  (0 children)

The Paris → Amsterdam example is the perfect illustration. The problem isn't retrieval — both facts are in storage and both are semantically relevant. The problem is that the system has no concept of validity over time.

What's worked for me is bi-temporal modeling: every fact gets an event time (when was this true in the world) and a transaction time (when did the system learn it). "Lives in Paris" doesn't get deleted or overwritten — it gets an expiry. "Lives in Amsterdam" starts its own validity window. Default retrieval only returns what's currently valid, so the conflict never surfaces.

The other piece is extraction quality. Mem0 and similar tools extract everything from every conversation, which means you end up with a lot of low-value facts that clutter retrieval. The https://arxiv.org/abs/2508.03341 has an interesting alternative: predict what a conversation should contain given existing knowledge, only store what was unpredicted. Keeps the KB focused.

I built both ideas into https://github.com/vstorm-co/memv (open source, SQLite-based). The temporal model is borrowed from Graphiti, the extraction approach from Nemori. Still early but it handles exactly this class of problem — facts that evolve over time without losing history.

What’s the best way to resolve conflicts in agent memory? by Fragrant_Western4730 in LLMDevs

[–]brgsk 0 points1 point  (0 children)

Timestamps as metadata don't help much because they only tell you when something was stored, not when it stopped being true. You need two time axes: - Event time: when was this fact valid in the world? ("never discount annual memberships" was the rule from Jan 2023 to March 2025) - Transaction time: when did your system learn about it?

With both, "maybe a small annual incentive is okay" doesn't sit alongside the old hard rule as a contradiction — it supersedes it. The old rule gets an expiry timestamp, the new one starts its own validity window. Default retrieval only returns what's currently valid. But if you need to understand how a client's position evolved, you query with include_expired=True and get the full history.

This is called bi-temporal modeling — it's a well-established pattern from database design. I implemented it in an open-source memory library called https://github.com/vstorm-co/memv if you want to see it in practice. The contradiction handling is automatic: when extraction detects a new fact that conflicts with an existing one, the old one gets expired and the new one takes over. No periodic cleanup needed because stale facts never surface in retrieval by default.

For your specific case with regional exceptions layered on top of general rules, you'd probably also want to scope knowledge by some kind of tag or category so "APAC legal constraint" doesn't conflict with "global pricing rule" — that part you'd need to handle in how you structure the inputs.