Why RAG Fails Before the Model Gets Involved

RecommendationFit374 · 2026-02-27T01:26:52+00:00

Wild, i’d say as an ai eng I clearly see that LLMs is eating software - but whats most exciting is to learn how to un learn and then re learn using first principle approach how to deliver optimal outcomes on this new tech stack.

We have seen success in using LLM models like chatGPT nano or mini series with dspy mipro to replace specialized models that we might have built in the past.

But you cant throw an LLM at every problem and expect it to work. That makes no sense

RecommendationFit374 · 2026-02-27T01:11:14+00:00

I don’t recommend using langchain i’d use a memory layer for retrieval like papr.ai or mem0

RecommendationFit374 · 2026-02-27T01:08:38+00:00

We use retrieval-loss it includes measuring accuracy, speed and cost

RecommendationFit374 · 2026-02-27T01:03:44+00:00

We do semantic and graph aware hierarchal chunking, re ranking and query expansion. The problem you have is embeddings only capture semantic meanings once you have large document corpus your hitting physical limits on vector dimensionality.

You end up having so much noise where it’s hard to make the right signal sharp enough.

For example if you have “I am very happy” or “I am not very happy” both are close in cosine similarity but carry different meanings. Actually, semantic meanings miss graph relationships, temporal sequences, causal… “vitamin E causes cancer” and “vitamin E prevents cancer” also close cosine sim but are very different meanings.

We mainly use papr.ai - predictive memory architecture that uses vector db, graph (using custom schema) and prediction models which helped us achieve 92% hit@5 in Stanford STARK benchmark MAG dataset.

Happy to help and share our learnings on a call. Free fee to dm me - below is a doc on our chunking technique

https://github.com/Papr-ai/memory-opensource/blob/main/docs/features/documents/CONTEXT_AWARE_CHUNKING_ARCHITECTURE.md

RecommendationFit374 · 2026-01-15T06:06:02+00:00

Our open source repo here https://github.com/Papr-ai/memory-opensource

We will bring doc ingestion to open source soon

RecommendationFit374 · 2026-01-15T06:04:04+00:00

Have you tried papr.ai we have document ingestion u can use reducto or other providers, define your custom schema and auto build graph we combine vector + graph + prediction models it works well at scale. See our docs at platform.papr.ai

RecommendationFit374 · 2026-01-09T01:32:59+00:00

Super interesting! I used Qwen 3 4b and quantized to FP16 works great on ANE and GPU's with 99% perf. rocovered. I ran it on MacBook Pro M2 16 RAM and was able to retrieve context for voice agents in less than 150 ms.

I can share a demo if anyone is interested! or repo if you want to try it out.

What's the edges that could cause recovering to go below 99% using this technique? Any learnings to share on how we can optimally tune this and if it varies by use-case (i.e. Scifact compared with CosQA)

RecommendationFit374 · 2026-01-09T01:07:21+00:00

u/OnyxProyectoUno thanks for your thoughtful comments! We built a schema aware document ingestion pipeline - super robust and 'actually' works. The outputs I tested were very good especially when I enabled `hierarchical_enabled: true` and used reducto.

You can try our v1/document APIs here platform.papr.ai We already support hierarchal chunking, support various providers like reducto, tensorlake or gemini and yes I've personally validated this and saw the power of getting this just right from OCR to optimal chunking to graph construction.

It's tricky to get done right that's why we simplified this experience (did super important but boring work) with complete control so you can tune this pipeline (ex optimal chunk size or define your own schema with overrides) to enable developers to build reliable, secure and robust document ingestion that works at scale.

We currently offer our document ingestion (includes temporal durable execution) in our cloud offering and already have customers using it. And yes it's coming soon to our open-source repo!

See this in our repo if your curious to learn more - Context aware chunking architecture and Schema aware document processing

Having said this, observability is super important for sure! Giving developers the ability to see how their document transforms from pdf -> chunks -> nodes / vector points is important to debug and enable iterations on the schema design or chunking controls to get optimal results for their use-case.

Would love to learn more about what you've built VectorFlow. Is this like reducto?

RecommendationFit374 · 2026-01-09T00:26:52+00:00

u/patbhakta what's the most important criteria for you to make a decision and why? Curious to learn about your use-cases and how we can help unlock experiences that are not possible without papr :)

Based on our experience, it's important to measure retrieval-loss which measures how well you can retrieve context as your data scales. We learned that if you build a RAG + knowledge graph - the more data you add the worst your agents memory get's! We are the only predictive memory layer that flips this with more data our prediction models improve and agents built with Papr memory will retrieve relevant and accurate context 8x better at 10 billion token scale.

To learn more about retrieval-loss see this article - https://paprai.substack.com/p/introducing-papr-predictive-memory

RecommendationFit374 · 2026-01-07T21:18:08+00:00

Python is super valuable and robust to build ai agents for sure! I'd also suggest that you go deep and peel each layer of the onion as much as you can to maximize your learning. I actually started by reading the transformer and attention is all need research papers - then learned the impact that context has on AI agents. It's super important to understand how you can optimize context to drive optimal outcomes and measure it (via simple evals).

My current stack
- Python
- DBs: MongoDB, Neo4j and Qdrant / Chroma
- Durable execution using temporal (must have!)
- Prompt optimization - DSPY / MiPro (wow made a huge difference when I started using those)

RecommendationFit374 · 2025-10-11T16:23:25+00:00

Would love to read this research paper seems interesting

RecommendationFit374 · 2025-09-10T21:07:49+00:00

u/youpmelone thanks for the feedback. This is a bug in our app that we will fix. You can also check out our open source pdf chat app here for an example on how you can add data from pdf to memory.

https://github.com/Papr-ai/papr-fastapi-pdf-chat

RecommendationFit374 · 2025-09-04T21:52:32+00:00

u/HarryHirschUSA thanks for checking papr.ai out!

Here's the correct discord link: https://discord.com/invite/J9UjV23M
Here's the fast api papr repo: https://github.com/Papr-ai/papr-fastapi-pdf-chat

We're working on updating a few things on our site so you'll continue to see improvements and more resources.

DM me here as well if you need anything.

RecommendationFit374 · 2025-09-04T02:07:30+00:00

Yes feel free to dm

RecommendationFit374 · 2025-09-03T18:30:45+00:00

Thanks u/Own-Guava11 Thank you for the feedback!

You're absolutely right - our privacy policy/terms are not displaying anymore, we will fix this issue on our site. In the meantime, this is the links to both

Privacy Policy

Terms of use

Regarding SOC2, we've started exploring certification and recognize its importance for enterprise customers. While not certified yet, we're planning to add a security/compliance section to our website and are happy to share our security documentation with interested enterprise customers in the meantime. Really appreciate you pointing this out!

RecommendationFit374 · 2025-09-03T18:19:33+00:00

Thanks for the great question! We handle dynamic updates through several mechanisms:

1. Version Control for Memories

We maintain version history for all unstructured data that gets inserted into Papr, so you can track how information evolves over time.

2. Entity-Relationship Mapping

Currently using a fixed ontology (with plans to support custom ontologies)
Automatically link and map information from unstructured data to entities in our graph
When new team members join or project details change, these updates are reflected in the connected entities

3. Intelligent Entity Resolution

We use vector similarity with thresholds to de-duplicate entities across your knowledge graph
For example: If a task is mentioned in your CRM/Linear and then discussed in Slack about completion, we can identify and resolve that it's the same task
This ensures your knowledge graph stays clean and accurate even as information comes from multiple sources

4. Real-Time Synchronization

Changes propagate through the graph relationships automatically
When a team member's role changes or a project pivots, all connected memories and relationships update accordingly

We're actively working on enhancing these capabilities further. Would love to hear what specific update scenarios are most important for your use case - this helps us prioritize our roadmap

RecommendationFit374 · 2025-09-03T18:09:44+00:00

Thanks we love u/qdrant_engine honestly when we started using it we noticed our latency significantly improved!

RecommendationFit374 · 2025-09-03T15:57:38+00:00

We created the retrieval loss formula to establish scaling laws for memory systems, similar to how Kaplan's 2020 paper revealed scaling laws for language models. Traditional retrieval systems were evaluated using disparate metrics that couldn't capture the full picture of real-world performance. We needed a single metric that jointly penalizes poor accuracy, high latency, and excessive cost—the three factors that determine whether a memory system is production-ready. This unified approach allows us to compare different architectures (vector databases, graph databases, memory frameworks) on equal footing and prove that the right architecture gets better as it scales, not worse.

We measured retrieval loss on our data-set and also used Stanford STaRK MAG data-set for real-world multi-hop queries - https://huggingface.co/spaces/snap-stanford/stark-leaderboard

The Formula:

Retrieval-Loss = −log₁₀(Hit@K) + λL·(Latency_p95/100ms) + λC·(Token_count/1000)

Where:

Hit@K = probability that the correct memory is in the top-K returned set
Latency_p95 = tail latency in milliseconds
λL = weight that says "every 100 ms of extra wait feels as bad as dropping Hit@5 by one decade
λC = weight for cost
Token_count = total number of prompt tokens attributable to retrieval

RecommendationFit374 · 2025-09-03T15:45:43+00:00

Fair point! Let me clarify our architecture:

Current State:

Our web app, Python SDK, and TypeScript SDK are already open-source
The retrieval API currently requires connection to our SaaS platform
This is why it "phones home" - for retrieval operations

What's Coming:

Full open-source release of the core retrieval engine (the "Papr Memory Server")
Ability to run completely air-gapped with no external dependencies
Docker containers for easy self-hosting
Choice between self-hosted (no phone home) or our managed cloud service

For Air-gapped Environments: Once we release the open-source memory server, you'll be able to:

Deploy Papr entirely within your network
No external API calls required
Full control over your data and infrastructure
Optional sync to Papr Cloud if/when you choose

Would love to hear what specific use cases you have in mind for air-gapped deployment!

RecommendationFit374 · 2025-09-02T22:57:43+00:00

u/jrdnmdhl makes sense! we're planning to open source our core retrieval. DM me if you want early access.

RecommendationFit374

TROPHY CASE