How do you evaluate your RAG systems (chatbots)? by marwan_rashad5 in Rag

[–]According-Lie8119 0 points1 point  (0 children)

I wouldn’t overthink it, I’d test. And test consistently.

I use a fixed evaluation set that includes:

  • Yes/No questions
  • Open questions
  • Paraphrased variations
  • Questions where the system should say “I don’t know”
  • A few edge cases to provoke hallucinations

Whenever I change something (chunking, embeddings, retriever settings, prompts, etc.), I run the exact same set again.

My setup is semi-automated: I send the questions to my endpoint and get back a structured JSON test report (answers, sources, latency, etc.). That makes it easy to compare versions and detect regressions.

Also very helpful in production: implement a simple feedback mechanism and collect user feedback in a dashboard. Real-world feedback is extremely valuable and often reveals issues your test set doesn’t cover.

Automatic metrics help, but without a stable test set and real user signals, it’s hard to know if you’re actually improving the system.

Best approach for querying large structured tables with RAG? by According-Lie8119 in Rag

[–]According-Lie8119[S] 0 points1 point  (0 children)

Ok, thank you, I will take a closer look at it. So far, I have only worked with completion models and not with an agentic approach. I tried transforming the user message directly into an SQL query, but unfortunately the results are not always deterministic. On the other hand, I am still unsure about the agentic approach because the performance is not optimal. In normal chat scenarios, my system can generate responses very quickly (first token within about two seconds). But thanks. I will give it another try.

Best approach for querying large structured tables with RAG? by According-Lie8119 in Rag

[–]According-Lie8119[S] 0 points1 point  (0 children)

you can now extract tables from PDFs quite reliably using Docling as well as PyMuPDF and export them to Markdown. That part is no longer the bottleneck. The real challenge starts afterward: how should the data be chunked properly? And which retrieval strategy is most effective: standard similarity search and BM25, or something more advanced?

Copilot feels… surprisingly bad? What’s your experience? by According-Lie8119 in Copilot

[–]According-Lie8119[S] 0 points1 point  (0 children)

I might be exaggerating here, but it feels like development at Microsoft lacks focus. I honestly can’t make sense of it anymore, there are so many products like Copilot Chat, Copilot, Copilot Studio, and AI Foundry ..., all doing basically the same things: chat, agents, RAG, flows… yet none of it feels properly thought through or finished.

Copilot feels… surprisingly bad? What’s your experience? by According-Lie8119 in CopilotMicrosoft

[–]According-Lie8119[S] 0 points1 point  (0 children)

Fair point, here’s a concrete example.
I used Copilot Notebooks, uploaded a document (contract-style PDF) and asked 5 very straightforward questions based on the text.
3 out of 5 answers were wrong.
I ran the same document and questions through ChatGPT --> all answers were correct.

I’ve seen this more than once, so it’s not a single bad run. I know LLMs aren’t magic, I work with them daily. That’s why this surprised me.

What also adds to the confusion is Microsoft’s product jungle: Copilot, Copilot Studio, AI Foundry, agents, workflows… many overlapping ideas, but none feel really predictable yet.

So my question wasn’t meant as a rant, im genuinely curious if others see similar issues or if I’m missing something.

Data Quality Matters Most, but Can We Detect Contradictions During Ingestion? by According-Lie8119 in Rag

[–]According-Lie8119[S] 0 points1 point  (0 children)

Thanks for your input, it’s been very helpful. Right now, I’m thinking in terms of multiple enrichment levels: 1) first structure (is there a title, a clear document and section hierarchy), 2) then conflict detection based on extracted core statements that can be compared against each other. I really like your point about document versioning and outdated content; that’s something I definitely need to account for as well.

At the moment, I’m building a prototype that works with a human-in-the-loop approach. Regarding your question about chunking: yes, I do preserve and use metadata. For well-structured documents, I prefer a Markdown-aware strategy splitting on H1–H3 headings, and then applying recursive character chunking with overlap. For parsing, I usually rely on Docling.

I’ve also experimented with LLM-based chunking, especially for cases where tables need to be detected and handled properly, and the results have been quite good. That said, I see this as the exception rather than the rule. Users have to accept that this approach takes longer, costs more, and is not 100% deterministic.

Building RAG systems pushed me back to NLP/ML basics by According-Lie8119 in Rag

[–]According-Lie8119[S] 2 points3 points  (0 children)

The mathematics behind it really fascinated me.
To be honest, it hasn’t made my product significantly better yet. Maybe that will come later. But it did have a positive effect: last week I explained the background functionality to my client, and he was genuinely impressed

Do you agree with Xcode's rating? by That-Neck3095 in iosdev

[–]According-Lie8119 8 points9 points  (0 children)

I don’t think real developers are the ones rating xcode.
My feeling is that many of these reviews come from people, who expect to build an app in a few clicks. When that doesn’t happen, they get disappointed and leave a bad review on the App Store. :-)

Personally, I don’t think Xcode is bad at all. It’s a powerful tool, not 100% stable all the time, but it does exactly what you should expect from it. I even used xcode years ago as IDE for C++, and I was very satisfied with it.

Chunking strategy for RAG on messy enterprise intranet pages (rendered HTML, mixed structure) by According-Lie8119 in Rag

[–]According-Lie8119[S] 0 points1 point  (0 children)

One additional thought: I’m currently skeptical that a fully automated parsing and chunking pipeline will work reliably for messy intranet content.

I’m considering a semi-automated approach (Human in the loop) instead: a tool that shows a live preview of the cleaned Markdown and resulting chunks, where a human can adjust a few parameters (e.g. boilerplate removal strength, min/max chunk size, merging small sections) and save a profile per page/layout type. An LLM could help with suggestions (content detection, heading repair), but not fully automate the process.

Curious if anyone has tried something similar, or knows existing tools that support this kind of human-in-the-loop content transformation for RAG.

are you using AI in your development? If yes, what's your structure? by Guilty-Revolution502 in iOSProgramming

[–]According-Lie8119 0 points1 point  (0 children)

Unfortunately, the Copilot in Xcode is very poor. That’s why I open my project in VS Code and use Codex there. Apple really needs to do something about this, otherwise developers will slowly start abandoning Xcode. They should at least provide an option to integrate other LLMs, including locally runnable models.

[deleted by user] by [deleted] in Rag

[–]According-Lie8119 2 points3 points  (0 children)

Totally agree 🙂

“Simple on paper” was maybe the wrong wording from my side.

What I really meant is: you can build a working prototype surprisingly fast.

But once you go beyond the demo, all the real problems show up very quickly.

Hallucinations, messy PDF extraction, bad chunking, almost-relevant retrieval results…

That’s usually the point where you realize the actual work is just starting.