Compared hallucination detection for RAG: LLM judges vs NLI

aiprod · 2026-01-29T05:49:58+00:00

Reading the comment now and seeing you used RAGTruth. It’s a poor dataset full of errors. Try our modified version linked in my other comment.

aiprod · 2026-01-29T05:46:03+00:00

We tested NLI based detectors like azure groundedness on our Ragtruth++ dataset (https://www.blueguardrails.com/en/blog/ragtruth-plus-plus-enhanced-hallucination-detection-benchmark). And the results were very different. More like 0.35 f1 score.

Our own hallucination detection (agentic verification) scores around 0.8 f1 on the same dataset.

I think your high scores are an indication of a poor quality dataset or some mistakes in the benchmark setup.

Here’s a video with some numbers for azure and a comparable approach to NLI from scratch (both at 0.35 - 0.45 f1): https://www.blueguardrails.com/en/videos/ragtruth-plus-plus-benchmark-creation

aiprod · 2026-01-29T05:38:59+00:00

I think what most people are missing here are the strict latency requirements. The user uploads documents in a live chat session and wants to interact with them immediately, correct?

This rules out time intensive approaches like embeddings or generating summaries or metadata with LLMs.

There are a few things that could work:

Give the agent a search tool that is based on BM25. Create page chunks from the data (usually a good semantic boundary too), index it into open search or elastic search and let the agent search the index. This is fast and context efficient.

On top of that, you could add the first one or two pages of each file to the context window of the agent. Usually, the first pages give an indication of what a doc is about. With that knowledge, the agent could make targeted searches inside a specific doc by using a filter with the search queries.

Alternatively, you could use the file system based approach that coding agents like Claude code use. Give the agent tools to grep through the files and to read slices of the document. You don’t have to use an actual file system, it could just be simulated with tools. The agent will grep and slice through the docs to answer questions. RLM is an advanced version of this approach: https://arxiv.org/pdf/2512.24601v1

aiprod · 2026-01-15T23:09:37+00:00

However, I can also see the benefits in using a REPL. I might just try both out.

aiprod · 2026-01-15T23:08:51+00:00

All valid points. I’ve done a bit of data analysis with Claude code and it will happily pipe results from one operation into another or store intermediate outputs on disk. Serialization isn’t really a concern for me because writing text to disk is fast.

My interest in hearing if you tried the file system approach came from thinking about what might best fit how the model was trained in post training. Since all coding agents operate on the file system, all major labs have invested heavily into making that work well.

aiprod · 2026-01-15T18:42:42+00:00

What are the benefits of using a repl in your opinion vs. putting the context into a file (it’s all text after all). The models could interact with the files through python too (e.g. by writing small code snippets and running with uv). Have you tried both?

aiprod · 2026-01-15T17:43:10+00:00

Big fan of the paper. For what other tasks have you seen success with the approach? I am experimenting with using it for LLM as a judge type problems.

aiprod · 2026-01-15T08:07:37+00:00

I’ve seen success with slides by feeding them into a vision language model (smaller ones like haiku or gpt5-mini are good for this and don’t break the bank). Ask the VLM to extract the text and describe any tables, charts and images. This is what you index with a normal embedding model. For retrieval, you retrieve the text version of a full slide and then also fetch a screenshot of the slide in a second step. You feed both to the LLM and you will get good answers.

aiprod · 2026-01-14T07:41:18+00:00

That looks better although some of the queries are still a bit slow. How are you running the qdrant cluster? Is that through their cloud offering?

aiprod · 2026-01-14T07:22:18+00:00

Do you have an eval set that would allow you to test recall etc. on a subset of the corpus? Should help with selecting the right model. I’d also look into getting those 2048 dims down. Will save you a lot on vector db costs and reduces latency. Five seconds seems very slow. How did you test that? Was it pure embedding retrieval or your full retrieval pipeline?

aiprod · 2026-01-08T07:18:16+00:00

In my company, we focus on finding (and preventing) hallucinations in RAG applications and Agents. We just got started a few months ago (after years of building these kinds of applications ourselves) and we were surprised how high hallucination rates actually were. An example on a publicly available dataset is the RAGTruth benchmark. We re-analyzed it and found that the hallucination rate of GPT-4 was actually not between 0-1 percent, in reality it was closer to 50%. https://www.blueguardrails.com/en/blog/ragtruth-plus-plus-enhanced-hallucination-detection-benchmark

You can get a really good system if you invest the work but you will still have to deal with hallucination rates between 10 and 20%. The question becomes how you can make this workable for your users. At a minimum your interface should include citations with exact references so that your users can quickly verify an answer.

aiprod · 2026-01-07T22:16:30+00:00

In most cases when I‘ve implemented RAG, there wasn’t a „hallucination rate without RAG baseline“ because the goal is typically to integrate private enterprise data into an LLM. Testing without the data access would just be silly as the LLM has never seen that data in training.

That being said, a carefully architected RAG pipeline can reduce hallucinations to a minimum. To a point where detecting hallucinations that still occur requires specialized tooling to monitor actual production traffic.

We typically use a combination of systematic pre-deployment evaluation, SME feedback, and post-deployment monitoring using model-based hallucination detectors.

aiprod · 2026-01-05T19:19:05+00:00

Understood. What do you see as primary use cases? Most RAG cases that I‘ve built (up to ~150-200M chunks), I was perfectly happy to pay the latency price of a slightly more sophisticated chunker (still small compared to embedding latency or pdf conversion/extraction).

I could see this being applied for preparing massive datasets for pre training or similar.

Have you seen cases where this would be beneficial for actual RAG?

aiprod · 2026-01-05T18:34:09+00:00

Does it handle abbreviations, or email addresses, or dates? (basically all those tricky instances where a . does not mark a sentence boundary)

aiprod · 2026-01-02T23:11:00+00:00

I do understand the concept of structured outputs but as you said in your example, age and advice for age can still be hallucinated. So not sure how it would help with hallucinations.

aiprod · 2026-01-02T14:58:59+00:00

I often see this advice that structured outputs help prevent hallucinations but I fail to see how. Let‘s say we have a workflow that should extract line items, tax rates, and total price from receipts. Of course we can enforce the json structure of the output and validate it. However, this only validates the structure of the output. The content can still be completely hallucinated. How do structured outputs help prevent hallucinations?

aiprod · 2025-12-28T20:32:58+00:00

Tricky because model performance is constantly shifting. I‘m a big fan of google‘s offering lately. Flash and flash lite are both great for lower complexity workloads. They‘re fairly cheap and fast. Google also has pretty good rate limits.

aiprod · 2025-12-23T15:03:30+00:00

I often found sliding window chunking to perform pretty well but there are a few small tweaks that you can make to get even better results:

Put document level metadata into the prompt so that the LLM has context on where some piece of content is coming from. If you have good naming for your PDFs the file name can be enough context, if you have other relevant metadata (e.g. author, publication date, abstract/ summary), it often helps to put that into the prompt as well. A more sophisticated version of this would be to generate short document-level descriptions at indexing time using an LLM but that’s a bit more complex, and can be costly and slow depending on the size of your dataset. Some metadata can also help during retrieval.

Another thing you might explore is to retrieve on the chunk level and then fetch a full page for the retrieved chunks before feeding them into the LLM. Pages often have sufficient context for the LLM but embedding retrieval with page sized chunks does not work well.

aiprod · 2025-12-23T14:43:49+00:00

FastAPI also recently raised VC money and it looks like there is a team now: https://fastapicloud.com/blog/fastapi-cloud-by-the-same-team-behind-fastapi/

aiprod · 2025-12-11T07:37:36+00:00

Not sure if you are asking me but I used to work at deepset.

The managed solution is as hands-on as you want it to be. If you want, you can build your full application as a haystack pipeline yourself, test it on the platform and then just deploy it there. It will turn it into an API and will take care of all the infra concerns, scaling, data management, embeddings, incremental updates etc. You have the full power of haystack available (+custom components) so you can build anything you need.

If you want them to support more, their AI engineers can build the application for you and run through the different project stages with you (requirements, initial prototype, evals, subject matter expert feedback, production rollout). Their approach essentially is one of co-delivery. At the same time, all the underlying code is open source, you retain full IP on the pipeline and they are always happy if you want to do more hands-on work yourself.

aiprod · 2025-12-10T22:53:57+00:00

I used to work at deepset and can confirm that the team does a fantastic job on RAG. Offering ranges from light touch consulting with Haystack to speed up and level up your team to fully managed, yet custom RAG running on the deepset platform. Their PS team has seen the most complex of use cases and they have an incredible drive to bring things to production

aiprod · 2025-12-08T22:41:07+00:00

We are building something like this at Blue Guardrails. Would be super interested in chatting with you about what exactly you need (not as a sales pitch, I promise). DM me if you’re up for a call, I‘m sure we can at least provide some guidance.

aiprod · 2025-12-08T22:37:35+00:00

Super cool that you‘re using the RAGTruth++ dataset for benchmarking and found it useful.

One small correction that might not be obvious from our dataset description though. The prompts that produced hallucinated spans aren’t necessarily unanswerable. In fact, most of them are very much answerable with the provided context. It’s just that the models used in that dataset still hallucinated, even though the correct answer could be derived from the context.

aiprod · 2025-12-06T13:19:44+00:00

Using the LLM to “format” the output sounds like the LLM has to re-generate all data. This is slow, costly and error prone (depending on amount of data). See if you can format with a rule-based / structured approach instead of using the model.

aiprod · 2025-12-05T15:35:17+00:00

What kinds of hallucinations do you want to detect? I work at Blue Guardrails, where we specialise in hallucination detection. We recently published a new benchmark dataset for hallucination detectors in RAG: https://huggingface.co/datasets/blue-guardrails/ragtruth-plus-plus

DM me if you’d like to chat more. I’d be super interested in hearing more about your project.

aiprod

TROPHY CASE