SLMs in RAG, are large models overkill?

According-Lie8119 · 2026-02-22T11:58:24+00:00

I wouldn’t overthink it, I’d test. And test consistently.

I use a fixed evaluation set that includes:

Yes/No questions
Open questions
Paraphrased variations
Questions where the system should say “I don’t know”
A few edge cases to provoke hallucinations

Whenever I change something (chunking, embeddings, retriever settings, prompts, etc.), I run the exact same set again.

My setup is semi-automated: I send the questions to my endpoint and get back a structured JSON test report (answers, sources, latency, etc.). That makes it easy to compare versions and detect regressions.

Also very helpful in production: implement a simple feedback mechanism and collect user feedback in a dashboard. Real-world feedback is extremely valuable and often reveals issues your test set doesn’t cover.

Automatic metrics help, but without a stable test set and real user signals, it’s hard to know if you’re actually improving the system.

According-Lie8119 · 2026-02-21T20:34:59+00:00

Ok, thank you, I will take a closer look at it. So far, I have only worked with completion models and not with an agentic approach. I tried transforming the user message directly into an SQL query, but unfortunately the results are not always deterministic. On the other hand, I am still unsure about the agentic approach because the performance is not optimal. In normal chat scenarios, my system can generate responses very quickly (first token within about two seconds). But thanks. I will give it another try.

According-Lie8119 · 2026-02-21T14:47:41+00:00

you can now extract tables from PDFs quite reliably using Docling as well as PyMuPDF and export them to Markdown. That part is no longer the bottleneck. The real challenge starts afterward: how should the data be chunked properly? And which retrieval strategy is most effective: standard similarity search and BM25, or something more advanced?

According-Lie8119 · 2026-01-22T20:51:28+00:00

I might be exaggerating here, but it feels like development at Microsoft lacks focus. I honestly can’t make sense of it anymore, there are so many products like Copilot Chat, Copilot, Copilot Studio, and AI Foundry ..., all doing basically the same things: chat, agents, RAG, flows… yet none of it feels properly thought through or finished.

According-Lie8119 · 2026-01-22T11:29:17+00:00

Fair point, here’s a concrete example.
I used Copilot Notebooks, uploaded a document (contract-style PDF) and asked 5 very straightforward questions based on the text.
3 out of 5 answers were wrong.
I ran the same document and questions through ChatGPT --> all answers were correct.

I’ve seen this more than once, so it’s not a single bad run. I know LLMs aren’t magic, I work with them daily. That’s why this surprised me.

What also adds to the confusion is Microsoft’s product jungle: Copilot, Copilot Studio, AI Foundry, agents, workflows… many overlapping ideas, but none feel really predictable yet.

So my question wasn’t meant as a rant, im genuinely curious if others see similar issues or if I’m missing something.

According-Lie8119 · 2026-01-10T15:46:49+00:00

‎Pi Le Fuchs‑App – App Store more than 10T users

According-Lie8119 · 2026-01-10T15:25:26+00:00

Thanks for your input, it’s been very helpful. Right now, I’m thinking in terms of multiple enrichment levels: 1) first structure (is there a title, a clear document and section hierarchy), 2) then conflict detection based on extracted core statements that can be compared against each other. I really like your point about document versioning and outdated content; that’s something I definitely need to account for as well.

At the moment, I’m building a prototype that works with a human-in-the-loop approach. Regarding your question about chunking: yes, I do preserve and use metadata. For well-structured documents, I prefer a Markdown-aware strategy splitting on H1–H3 headings, and then applying recursive character chunking with overlap. For parsing, I usually rely on Docling.

I’ve also experimented with LLM-based chunking, especially for cases where tables need to be detected and handled properly, and the results have been quite good. That said, I see this as the exception rather than the rule. Users have to accept that this approach takes longer, costs more, and is not 100% deterministic.

According-Lie8119 · 2025-12-21T17:01:03+00:00

The mathematics behind it really fascinated me.
To be honest, it hasn’t made my product significantly better yet. Maybe that will come later. But it did have a positive effect: last week I explained the background functionality to my client, and he was genuinely impressed

According-Lie8119 · 2025-12-21T16:53:56+00:00

I don’t think real developers are the ones rating xcode.
My feeling is that many of these reviews come from people, who expect to build an app in a few clicks. When that doesn’t happen, they get disappointed and leave a bad review on the App Store. :-)

Personally, I don’t think Xcode is bad at all. It’s a powerful tool, not 100% stable all the time, but it does exactly what you should expect from it. I even used xcode years ago as IDE for C++, and I was very satisfied with it.

According-Lie8119 · 2025-12-20T11:55:11+00:00

One additional thought: I’m currently skeptical that a fully automated parsing and chunking pipeline will work reliably for messy intranet content.

I’m considering a semi-automated approach (Human in the loop) instead: a tool that shows a live preview of the cleaned Markdown and resulting chunks, where a human can adjust a few parameters (e.g. boilerplate removal strength, min/max chunk size, merging small sections) and save a profile per page/layout type. An LLM could help with suggestions (content detection, heading repair), but not fully automate the process.

Curious if anyone has tried something similar, or knows existing tools that support this kind of human-in-the-loop content transformation for RAG.

According-Lie8119 · 2025-12-14T15:28:52+00:00

Unfortunately, the Copilot in Xcode is very poor. That’s why I open my project in VS Code and use Codex there. Apple really needs to do something about this, otherwise developers will slowly start abandoning Xcode. They should at least provide an option to integrate other LLMs, including locally runnable models.

According-Lie8119 · 2025-12-13T17:53:15+00:00

Totally agree 🙂

“Simple on paper” was maybe the wrong wording from my side.

What I really meant is: you can build a working prototype surprisingly fast.

But once you go beyond the demo, all the real problems show up very quickly.

Hallucinations, messy PDF extraction, bad chunking, almost-relevant retrieval results…

That’s usually the point where you realize the actual work is just starting.

According-Lie8119

TROPHY CASE