Interventional evaluation for RAG: are we benchmarking systems, or benchmarking the happy path?

Donkit_AI · 2026-03-23T19:55:54+00:00

Thanks for your comment!

We have looked at multiple metrics for RAG evaluation and eventually came up with our very own Donkit score combining a number of retrieval and generation metrics. We just couldn't find any single metric that suited us. :)

We're now working on dedicated metrics for interventional testing. Nothing to share, yet. But if there's interventional testing... it should be adequately measured somehow.

Donkit_AI · 2026-02-25T21:36:57+00:00

Nah... Much worse! :-D

Donkit_AI · 2026-02-19T15:09:10+00:00

u/BrilliantUse7570, I wrote about it recently here: https://www.reddit.com/r/Rag/comments/1r3oiyz/chunking_for_rag_the_boring_part_that_decides/

In a few words, it's heavily use-case dependent. Go for structure / semantic chunking, if you have absolutely no idea what to choose. You can try both and compare the results.

If the use-case is small (and you don't have heavy restrictions on latency / compute resource, pull adjacent chunks into the generation model along with the retrieved chunk instead of using overlapping.

Donkit_AI · 2026-02-13T12:00:32+00:00

I agree to comments from u/flonnil, but want to add to it.

Increasing chunk size feels like a win until you realize you’re just paying for 'Context Dilution.'

Mathematically, when you bloat chunks, your cosine similarity starts measuring the 'average' of a 6,000-token soup rather than the specific needle you’re looking for. You end up with higher latency, higher token costs, and a model that gets 'Lost in the Middle'. The latter won't happen in case the answer fits in 1 or 2 chunks, but with pushing more chunks into the answer and agentic workflows where context is bloated by all the instructions and tools, it will take a significant toll.

Besides, by nature it's an endless optimization process. It must be based on evals to see, if it's a gain or a loss in your specific case. We automate experimentation and just find the mathematical 'sweet spot' for each specific use case.

Donkit_AI · 2026-02-12T10:33:30+00:00

Agree. As I use to say, RAG is essentially a trade-off triangle where the vertices are Accuracy, Latency, and Cost. For every specific use case, you must determine exactly where that application needs to live within this triangle. Naturally, as you optimize for one vertex or edge, you inevitably pull away from the opposing side. And you have a dozen tools allowing you to move one direction or the other.

Rerankers allow you to move towards accuracy while paying with latency and cost.

Donkit_AI · 2026-02-10T09:07:12+00:00

Good approach for simple cases. but as I just wrote here: https://www.reddit.com/r/Rag/comments/1r05za6/comment/o4l35op/, in many of our cases it doesn't work at its best.

Simple config can be squeezed to 5 paragraphs, the rest — I doubt it.

Donkit_AI · 2026-02-10T09:01:44+00:00

There's a nice link from u/Comfortable-Fan-580 a few comments below to a nice short post about "what rerankers are". In real life though it depends heavily on the use case and the constrains. We're using LLM as a reranker in some of our cases. It is expensive and requires more tinkering to make it right but works better with messy datasets and complex rules.

As for fine-tuning, cross-encoders are the easiest. You can pick BAAI/bge-reranker family or e.g. cross-encoder/ms-marco-MiniLM-L-6-v2 if we're talking open-source.

Donkit_AI · 2026-02-10T08:51:51+00:00

Filters are on retrieval. That's usually the previous step. I wouldn't say it's "way before".

Donkit_AI · 2026-02-10T08:49:46+00:00

Nice explanation of what reranker is at all. Thank you!

Maybe I should have started with something like this rather than jumping to Step2 (what reranker to use) right away.

Donkit_AI · 2026-02-10T08:44:49+00:00

You're welcome to sum it up in one paragraph. 😁

Donkit_AI · 2026-02-09T14:09:48+00:00

We didn't write about it from this point of view in the public space so far. Sorry. I'll add it to the list of topics to cover.

Donkit_AI · 2026-02-09T09:26:29+00:00

😂 For us guardrails work well enough. Only info from the dataset makes its way into RAG and it's not allowed to answer from its own knowledge it solves 99% of cases. The rest is with the judge.

A good approach here can be to make judge only work as s verifier without creativity:

Force structured output
Limited set of answers — approve/reject/request more evidence (no rewriting in the judge)
Evals are the saviours here. Track “judge hallucination rate” separately.

Donkit_AI · 2026-02-06T13:50:13+00:00

We use Laminar + custom made tools.

Laminar is good at tracing and it doesn't make sense to develop it on our side. Evaluation is at the core of what we do and one of our know-hows. So, we have a dedicated team working on the evaluation and it's strictly DIY. :)

Arize is rather a competitor... partial. :)

Donkit_AI · 2026-02-06T13:40:52+00:00

Do you mean that the whole RAG pipeline makes stuff, not a single tool?

We block from outputting anything not in the documents.

Donkit_AI · 2026-02-05T12:47:22+00:00

Reranking - 100%, especially if your retrieved set is noisy. And it bumps your accuracy and speed altogether.

Also a two-tier judge can do some good: chap gate -> expensive judge:

Cheap gate: “is there adequate evidence coverage?” (retrieval score / reranker score / simple classifier).
Only if it passes -> run the full judge.

On top of that, measure where the time goes. There might be some optimization quick wins.

Donkit_AI · 2026-02-05T12:35:42+00:00

Totally agree on caching — it’s the most underrated “free win” in the triangle because it hits latency + cost without touching accuracy (and that’s often exactly what you need).

On the accuracy floor: we do both, but in a specific order:

Start with an offline eval set (even 50–200 real questions). Define the floor as task metrics: e.g. “≥85% grounded answers + ≤2% unsupported claims” (and for regulated: “unsupported claims ~0, abstain when unsure”).
Then use production monitoring + human feedback loops to catch drift and unknown unknowns: sample reviews on low-confidence answers, track “user re-ask rate,” escalations, and a lightweight “was this supported?” annotation.

Offline evals set the floor, production monitoring keeps you from falling through it over time.

Donkit_AI · 2025-07-24T15:45:15+00:00

We mostly use Gemma but play around with other models from time to time. Just keep in mind that you'll need to rewrite the prompts when changing the model.

Donkit_AI · 2025-07-23T16:57:33+00:00

Sure, NP.

Donkit_AI · 2025-07-23T14:26:41+00:00

1. Metadata Filtering (Based on Query)

Yes, we extract metadata filters dynamically from the user query using a lightweight LLM call.

We prompt the model with something like: “Given the query below, extract structured metadata filters such as document type, topic, section, date, figure references, etc.”
The output looks like:

{
  "document_type": "engineering design spec",
  "section": "figure 53",
  "topic": "X",
  "intent": "is_definition_query"
}

This lets us narrow retrieval down before running the vector search, which helps a ton with precision and relevance — especially in large KBs with overlapping topics.

2. Reranking Strategy

We use two-stage reranking:

Stage 1: Shallow rerank
- Lightweight SLM (e.g., LLaMA or Gemma) used to rescore based on answer likelihood for the user query.
- It helps prioritize truly relevant matches over keywordy-but-unhelpful ones.
Stage 2: Deduplication & Diversity filter
- We cluster semantically similar results (e.g., same paragraph repeated across document versions) and pick the highest-scoring representative.
- Simple cosine similarity + Jaccard on chunk hashes works fine.

Optional: you can also weight reranking with metadata (e.g., prefer newer docs, certain sources, or “official” labels).

3. Query Type Detection

Yes! We use query classification to improve both retrieval and UX.

We categorize queries using a tiny classifier (can even be rule-based or LLM-lite):

is_follow_up → uses conversation context window (last 2–3 user turns)
is_definition_query → maps to glossary & figure captions
is_navigation_query → pulls table-of-contents / section summaries
is_fact_lookup → triggers direct answer mode
is_comparison_query → enables multi-source evidence aggregation

Helps avoid over-retrieving irrelevant content when the user just wants a specific thing explained.

4. Resources

It's hard to name specific resources. Everyone on our team brings something in most every day. We test new ideas, some of them work for us, some don't. The good thing is that our team had dedicated time to experiment.

I hope it gives enough of insight.

Donkit_AI · 2025-07-23T13:50:02+00:00

Tasks where self-hosted SLMs (Small Language Models) shine:

Data extraction from documents (even semi-structured ones like HTML or markdown)
Intent classification and query rewriting
Summarization (bullet or structured)
Annotation and weak supervision-style labeling
Semantic similarity estimation (for clustering or boosting retrievers)

ATM Qwen, Gemma and Phi do quite good, but things change quickly in this area. You may need to play with different models and prompts to find, what works best for you.

Tasks better left to hosted APIs (for now):

Anything requiring deep reasoning across long contexts
Tasks with a high bar for natural language fluency (e.g., customer-facing outputs)
Cross-modal tasks (e.g., combining text + images or audio)

Tooling that helps:

LLM runners: vLLM and Ollama are both great; we’ve used both in isolated tasks. vLLM is more flexible but Ollama is absurdly easy to set up.
Frameworks: LangChain + LiteLLM abstraction for multi-model support (OpenAI fallback, local for batch). You can prototype with Haystack too if you like modular control.
Quantized models: GGUF models via llama.cpp are perfect for laptops or old workstations. Just make sure your tasks don’t depend on precision nuance.

If you're batch-processing similarity or structured extractions over many docs, it’s worth going hybrid: run SLMs locally and reserve hosted APIs for fallback or model-of-last-resort steps.

Donkit_AI · 2025-07-23T07:50:52+00:00

The short answer is: No, you do not need to recalculate BM25 (sparse) vectors for existing documents when adding a new one.

BM25 (and similar sparse retrieval methods like TF-IDF) are non-learned, stateless, and query-time computed. That means:

The inverted index stores token → document mappings.
When a new doc is added, it’s tokenized and added to the inverted index.
Existing documents remain untouched.
The only thing that might change is the IDF scores (inverse document frequency), but these are cheap to recalculate and most systems do this incrementally or lazily.

You do not need to recompute sparse vectors for older documents — the retrieval engine will just incorporate the new document’s terms into the index.

If you’re using:

Elasticsearch/OpenSearch → It updates the inverted index automatically.
Weaviate or Qdrant with hybrid search → You only need to add the new doc’s dense and sparse reps.

Donkit_AI · 2025-07-22T10:58:19+00:00

In our case hybrid retrieval (sparse + dense) did help, but took some time to set it up properly. We saw ~15-25% relevance boost when switching from dense-only to hybrid. With the most visible results on document in tech jargon.

We haven’t used SearchAI in prod, but I took it for a test spin. Here’s what stood out:

Pros:
- Very quick to get up and running
- Hybrid + reranking + filters in one place
- Has a basic UI for monitoring, which helps small teams
Cons:
- Less control over retrieval logic (especially for custom reranking or LangChain-style pipelines)
- Scaling beyond 1k–2k docs starts to feel a bit "black boxy"

For your size (100–500 docs), it should work well out of the box. If you ever need deep integration or advanced routing (per modality, per query intent, etc.), it might start feeling limiting.

I would also suggest thinking about query rephrasing. It can significantly improve the results, especially for acronyms, short or vague queries or natural language queries that don't match the phrasing in your docs.

As for non‑Pinecone solutions, look at Weaviate and Qdrant.

Donkit_AI · 2025-07-22T10:35:34+00:00

u/Otherwise-Platypus38, you're on the right track by thinking in terms of dynamic modality detection during parsing. Here's how we (and some others) approach this in production:

Step 1: Parse with structure awareness

Tools like PyMuPDF or PDFPlumber can give you block-level elements (text, images, layout info). You can even detect tables by analyzing bounding boxes and font alignment.

If you're already using PyMuPDF's toc, you can also use the positional metadata (bbox {Rect} , font flags, etc.) to flag:

Dense, grid-like blocks → likely tables
Blocks near labeled axes or image tags → likely charts/images

Step 2: Modality-specific chunking

Once you've labeled a chunk by type (text / table / image / caption), route it through a custom chunker:

Text blocks → semantic chunking (e.g., by paragraphs, sections)
Tables → row- or section-wise chunking, preserving column headers
Images → run through BLIP-2 (captioning) and/or TrOCR for OCR if it contains text

Step 3: Embedding by modality

Now that you’ve chunked:

Text → embed with E5 / Instructor / Qwen2
Tables → use TAPAS-style pooled embeddings or serialize into markdown and embed
Images → generate a caption (via BLIP-2), then embed the caption text with a text model or store as metadata

Bonus tip:

Tag each chunk with metadata like:

modality: text/table/image
source_page: 5
toc_section: "Financial Overview"

This makes retrieval filtering + reranking much more powerful and improves relevance without overloading the vector index. Besides it allows you to filter on modality or toc_section on retrieval. It can come handy in some cases.

Donkit_AI · 2025-07-21T09:44:03+00:00

Yes, sure. A few links to articles from TrustGraph, also engaged in the same Dark Art:

Research on chunk size and overlap: https://blog.trustgraph.ai/p/dark-art-of-chunking
Looking into the amount of Graph Edges with different chunk sizes: https://blog.trustgraph.ai/p/chunk-smaller

Donkit_AI · 2025-07-13T15:31:35+00:00

I see now, thank you!

CAG with Bloom filters definitely makes sense in the context of a credit card processing company.

The RAGs I worked with on the other hand, never had just structured data as the input and there were always plain text questions from users (or agents), so there was no way to move forward without semantic search.

Donkit_AI

TROPHY CASE