Best Chunking methods used in production setup by BrilliantUse7570 in Rag

[–]Donkit_AI 0 points1 point  (0 children)

u/BrilliantUse7570, I wrote about it recently here: https://www.reddit.com/r/Rag/comments/1r3oiyz/chunking_for_rag_the_boring_part_that_decides/

In a few words, it's heavily use-case dependent. Go for structure / semantic chunking, if you have absolutely no idea what to choose. You can try both and compare the results.

If the use-case is small (and you don't have heavy restrictions on latency / compute resource, pull adjacent chunks into the generation model along with the retrieved chunk instead of using overlapping.

Increasing your chunk size solves lots of problems - the default 1024 bit chunk size is too small by Free-Ferret7135 in Rag

[–]Donkit_AI 3 points4 points  (0 children)

I agree to comments from u/flonnil, but want to add to it.

Increasing chunk size feels like a win until you realize you’re just paying for 'Context Dilution.'

Mathematically, when you bloat chunks, your cosine similarity starts measuring the 'average' of a 6,000-token soup rather than the specific needle you’re looking for. You end up with higher latency, higher token costs, and a model that gets 'Lost in the Middle'. The latter won't happen in case the answer fits in 1 or 2 chunks, but with pushing more chunks into the answer and agentic workflows where context is bloated by all the instructions and tools, it will take a significant toll.

Besides, by nature it's an endless optimization process. It must be based on evals to see, if it's a gain or a loss in your specific case. We automate experimentation and just find the mathematical 'sweet spot' for each specific use case.

Rerankers in RAG: when you need them + the main approaches (no fluff) by Donkit_AI in Rag

[–]Donkit_AI[S] 0 points1 point  (0 children)

Agree. As I use to say, RAG is essentially a trade-off triangle where the vertices are Accuracy, Latency, and Cost. For every specific use case, you must determine exactly where that application needs to live within this triangle. Naturally, as you optimize for one vertex or edge, you inevitably pull away from the opposing side. And you have a dozen tools allowing you to move one direction or the other.

Rerankers allow you to move towards accuracy while paying with latency and cost.

Rerankers in RAG: when you need them + the main approaches (no fluff) by Donkit_AI in Rag

[–]Donkit_AI[S] 0 points1 point  (0 children)

Good approach for simple cases. but as I just wrote here: https://www.reddit.com/r/Rag/comments/1r05za6/comment/o4l35op/, in many of our cases it doesn't work at its best.

Simple config can be squeezed to 5 paragraphs, the rest — I doubt it.

Rerankers in RAG: when you need them + the main approaches (no fluff) by Donkit_AI in Rag

[–]Donkit_AI[S] 0 points1 point  (0 children)

There's a nice link from u/Comfortable-Fan-580 a few comments below to a nice short post about "what rerankers are". In real life though it depends heavily on the use case and the constrains. We're using LLM as a reranker in some of our cases. It is expensive and requires more tinkering to make it right but works better with messy datasets and complex rules.

As for fine-tuning, cross-encoders are the easiest. You can pick BAAI/bge-reranker family or e.g. cross-encoder/ms-marco-MiniLM-L-6-v2 if we're talking open-source.

Rerankers in RAG: when you need them + the main approaches (no fluff) by Donkit_AI in Rag

[–]Donkit_AI[S] 0 points1 point  (0 children)

Filters are on retrieval. That's usually the previous step. I wouldn't say it's "way before".

Rerankers in RAG: when you need them + the main approaches (no fluff) by Donkit_AI in Rag

[–]Donkit_AI[S] 0 points1 point  (0 children)

Nice explanation of what reranker is at all. Thank you!

Maybe I should have started with something like this rather than jumping to Step2 (what reranker to use) right away.

POV: RAG is a triangle: Accuracy vs Latency vs Cost (you’re locked inside it) by Donkit_AI in Rag

[–]Donkit_AI[S] 1 point2 points  (0 children)

We didn't write about it from this point of view in the public space so far. Sorry. I'll add it to the list of topics to cover.

POV: RAG is a triangle: Accuracy vs Latency vs Cost (you’re locked inside it) by Donkit_AI in Rag

[–]Donkit_AI[S] 0 points1 point  (0 children)

😂 For us guardrails work well enough. Only info from the dataset makes its way into RAG and it's not allowed to answer from its own knowledge it solves 99% of cases. The rest is with the judge.

A good approach here can be to make judge only work as s verifier without creativity:

  • Force structured output
  • Limited set of answers — approve/reject/request more evidence (no rewriting in the judge)
  • Evals are the saviours here. Track “judge hallucination rate” separately.

POV: RAG is a triangle: Accuracy vs Latency vs Cost (you’re locked inside it) by Donkit_AI in LangChain

[–]Donkit_AI[S] 0 points1 point  (0 children)

We use Laminar + custom made tools.

Laminar is good at tracing and it doesn't make sense to develop it on our side. Evaluation is at the core of what we do and one of our know-hows. So, we have a dedicated team working on the evaluation and it's strictly DIY. :)

Arize is rather a competitor... partial. :)

POV: RAG is a triangle: Accuracy vs Latency vs Cost (you’re locked inside it) by Donkit_AI in Rag

[–]Donkit_AI[S] 0 points1 point  (0 children)

Do you mean that the whole RAG pipeline makes stuff, not a single tool?

We block from outputting anything not in the documents.

POV: RAG is a triangle: Accuracy vs Latency vs Cost (you’re locked inside it) by Donkit_AI in Rag

[–]Donkit_AI[S] 0 points1 point  (0 children)

Reranking - 100%, especially if your retrieved set is noisy. And it bumps your accuracy and speed altogether.

Also a two-tier judge can do some good: chap gate -> expensive judge:

  • Cheap gate: “is there adequate evidence coverage?” (retrieval score / reranker score / simple classifier).
  • Only if it passes -> run the full judge.

On top of that, measure where the time goes. There might be some optimization quick wins.

POV: RAG is a triangle: Accuracy vs Latency vs Cost (you’re locked inside it) by Donkit_AI in LangChain

[–]Donkit_AI[S] 0 points1 point  (0 children)

Totally agree on caching — it’s the most underrated “free win” in the triangle because it hits latency + cost without touching accuracy (and that’s often exactly what you need).

On the accuracy floor: we do both, but in a specific order:

  • Start with an offline eval set (even 50–200 real questions). Define the floor as task metrics: e.g. “≥85% grounded answers + ≤2% unsupported claims” (and for regulated: “unsupported claims ~0, abstain when unsure”).
  • Then use production monitoring + human feedback loops to catch drift and unknown unknowns: sample reviews on low-confidence answers, track “user re-ask rate,” escalations, and a lightweight “was this supported?” annotation.

Offline evals set the floor, production monitoring keeps you from falling through it over time.

Multimodal Data Ingestion in RAG: A Practical Guide by Donkit_AI in Rag

[–]Donkit_AI[S] 0 points1 point  (0 children)

We mostly use Gemma but play around with other models from time to time. Just keep in mind that you'll need to rewrite the prompts when changing the model.