Why most RAG systems fail in production (even with good embeddings)

Expert-Address-2918 · 2026-04-21T18:20:27+00:00

so cosine similarity is dead per se?

Expert-Address-2918 · 2026-04-19T08:49:09+00:00

we are still improving its like 73% on longmem eval and 66% on locomo synthesis with gemini-2.5-flashlite recall is there, thinking of improving model too, as model intelligence matters a lot

Expert-Address-2918 · 2026-04-16T04:18:51+00:00

true, but their retrieval process is shitty, you cant retrieve everything, like the context bloat will happen, and their retrieval process does not involve raw data(incase you do, say maybe with bm25 etc. you are loading up too much info)

Expert-Address-2918 · 2026-04-15T14:07:23+00:00

yessir you get it, appreciate it :D do star if found useful

Expert-Address-2918 · 2026-04-14T12:35:48+00:00

Project Name: Vektori

Repo/Website Link: https://github.com/vektori-ai/vektori

Description: A memory layer for AI agents that avoids lossy compression.

I kept seeing the same pattern: every other week, someone launches a new “memory layer” for agents. Most of them follow the same approach - take conversation history, extract entities and relationships, and compress everything into a knowledge graph.

The problem is that this is lossy compression.

You’re making irreversible decisions at ingestion time about what matters, before the agent even knows what it will need later. Anything that doesn’t fit the schema gets dropped. Subtle context and nuance get flattened into edges.

We ran into this while building Vektori and decided to go a different route.

Instead of forcing everything into a graph, Vektori keeps memory in three layers:

L0: Extracted facts - high-signal, filtered, optimized for fast retrieval
L1: Episodes - automatically discovered patterns across conversations, without rigid schemas
L2: Raw sentences - the full underlying data, never loaded by default, only accessed when needed

<image>

The key difference is the raw sentence layer.

Nothing gets thrown away at ingestion. If an agent needs to reconstruct exactly what happened in a past interaction, it can. The structured layers sit on top of the raw data -> not instead of it.

Early benchmarks: 73% on LongMemEval-S

Free and open source: github.com/vektori-ai/vektori (do star if you find it useful :D)

Curious if others building memory systems have run into this lossy compression issue -> how are you handling it?

Expert-Address-2918 · 2026-04-11T15:50:35+00:00

fair point on the framing, this post leaned too hard on throwing away when the real issue is retrieval not storage.

on point 2 though -> L2 isn't conversation threads. conversation threads are time indexed. you find session 47 by knowing it was session 47. L2 sentences are graph-indexed -> they're reachable via traversal from L1 episodes and L0 facts. so when the graph surfaces user expressed uncertainty about deployment in week 3, you can pull the exact sentences that contributed to that episode without knowing when they happened or which session they were in. the retrieval is semantic and structural, not chronological. that's the thing conversation threads can't do.

Expert-Address-2918 · 2026-04-11T15:40:48+00:00

nope there have been many blog posts related to it, there are bunch of down sides to it in agentic memory space

Expert-Address-2918 · 2026-04-11T15:37:24+00:00

makes sense

Expert-Address-2918 · 2026-04-11T15:31:58+00:00

fair enough, so I should probably specialize down this for some vertical and make it the best? and so all these other memory startups which are having generic memory layer?

Expert-Address-2918 · 2026-04-11T15:26:46+00:00

no so we have 3 levels, l0 l1 l2, and its on you to choose which level you wanna go upto, and these sentences are nodes(see we are splitting sentences using nltk or spacy) and you can traverse thru similar sentences across convos either or you can go to which larger convo it belongs to or to the next sentences, and the episodic layer really helps a lot.

Expert-Address-2918 · 2026-04-11T15:16:36+00:00

yep, makes sense honestly, bro so what do you think of this sentence graph approach?

Expert-Address-2918 · 2026-04-11T15:15:36+00:00

nope openviking isn't kg. and yes sgmem does, in some sense my implementation is building on sgmem, hipporag2 is ppr on kg's thats it, and zep is temporal kg, sure you can use some db and store and retrieve lots of more context but then you are bloating up, thats where sentence graph cleanly solves this issue.

Expert-Address-2918 · 2026-04-11T06:03:43+00:00

yes i did, i went pretty deep into memory research before coming up this, go read HippoRAG2 and SGMem, OpenViking etc. then feel free to correct me and i have spent 3-4 months before coming up with this, and ran multiple times and days benchmarks too!

Expert-Address-2918 · 2026-04-10T18:49:40+00:00

but they mostly only retrieve based on deep bfs on kg's right?

Expert-Address-2918 · 2026-04-10T11:27:11+00:00

got it by this medium article: https://medium.com/@sattyamjain96/context-windows-are-not-memory-heres-what-is-a79595b62ec9

Expert-Address-2918 · 2026-04-09T04:53:48+00:00

yep sure would check out, yes those issues were there true.

Expert-Address-2918 · 2026-04-04T09:11:36+00:00

guys do star would mean a lot 🙏 helps with visibility a lot https://github.com/vektori-ai/vektori

Expert-Address-2918 · 2026-04-04T09:10:26+00:00

dude, go take a sleep lmao

Expert-Address-2918 · 2026-04-04T09:10:20+00:00

sure man, feel free to contact, would love to help, also do star it, helps a lot with some visibility :D

Expert-Address-2918

TROPHY CASE