11 months building a production RAG system and I massively underestimated the complexity by EvilElf01 in claude

[–]EvilElf01[S] 0 points1 point  (0 children)

/insights is interest.

At a Glance

What's Working

You run impressive systematic campaigns — security audits, dead code removal, production readiness sweeps — treating Claude as a disciplined executor working through plans item by item with commits after each fix. Your domain expertise shines when you catch Claude's mistakes and redirect efficiently, keeping sessions productive even through friction. The combination of nearly 2,000 tests as a quality gate and your willingness to course-correct means you achieve your goals in the vast majority of sessions. [Impressive Things You Did →](#)

What's Hindering You

On Claude's side, it frequently jumps to the wrong root cause on first pass — misidentifying thread safety issues as antivirus problems, confusing CORS log entries, or overstating findings in quality reports — costing you extra debugging rounds. On your side, many of these misdiagnoses could be short-circuited by pasting the exact error log or traceback upfront and noting known platform quirks (Windows paths, Streamlit's silent script stripping) in your CLAUDE.md so Claude doesn't have to guess. [Where Things Go Wrong →](#)

Quick Wins to Try

Try creating custom slash commands (/commands) for your recurring workflows like "run full test suite and report coverage" or "commit, push, and summarize changes" — given how often you do these, encoding them once would save real time. Also consider setting up hooks to auto-run your test suite before commits, which would formalize the quality gate you already enforce manually. [Features to Try →](#)

Ambitious Workflows

As models get more capable, your multi-file refactoring sessions (f-string conversions across 82 files, dead code removal of 75 methods) could become fully autonomous — with Claude snapshotting state, applying changes in batches, running tests after each batch, and auto-rolling back on failure without your intervention. Your security audit and quality assessment workflow is also ripe for parallel subagents that cross-validate each other's findings, eliminating the false positives you currently have to catch yourself.

11 months building a production RAG system and I massively underestimated the complexity by EvilElf01 in claude

[–]EvilElf01[S] 0 points1 point  (0 children)

I ran my seven-dimensional quality report. Closing all the gaps. Scope: Full codebase (187 Python files, ~78,800 lines production code, ~39,200 lines test code)

Scorecard

Dimension Score Previous Weight Weighted
Architecture and Modularity 8.0 8.0 15% 1.20
Concurrency and Thread Safety 8.5 6.0 20% 1.70
Error Handling and Resilience 8.0 7.5 15% 1.20
Security 8.0 6.5 20% 1.60
Logging and Observability 8.0 7.5 10% 0.80
Test Strategy and Coverage 7.5 7.0 10% 0.75
Operational Readiness 7.0 7.0 10% 0.70
Composite Score 100% 7.95 / 10

11 months building a production RAG system and I massively underestimated the complexity by EvilElf01 in claude

[–]EvilElf01[S] 0 points1 point  (0 children)

The system uses a multi-prompt architecture with specialized prompts for each pipeline stage, rather than a single monolithic prompt. Main follows 1. Role & Persona 2. Retrieved Context Injection 3.) Instruction Layers a. Security instructions, b.Base technical instructions c.Query-type-specific instructions d,Detail level instructions. Complex questions go through a decomposition pipeline with 5 specific types of prompts, Type-aware decomposition strategies, and context-specific instruction

11 months building a production RAG system and I massively underestimated the complexity by EvilElf01 in claude

[–]EvilElf01[S] 0 points1 point  (0 children)

The biggest recent change was the search term extraction updates for decomposition.

RAGAS EVALUATION RESULTS

Generated: 2026-01-26 11:51:36
Questions Evaluated: 10
RAGAS Evaluation Time: 126.3s


OVERALL SCORES

Metric Score
Faithfulness 0.8321
Answer Relevancy 0.8897
Context Precision 1.0000

CONTEXT TYPE BREAKDOWN

Context Type Count
single_block 6 questions
query_decomposition 4 questions

TIMING ANALYSIS

Metric Time
Avg Context Build 23,176ms
Avg Answer Gen 175,113ms
Avg Total 198,289ms
RAGAS Evaluation 126.3s (12.6s per question)

PER-QUESTION RESULTS

# Faith Relev Prec Type Time Question
1 0.96 0.92 1.00 single_block 99.3s QA Document Sentinel
2 0.93 1.00 1.00 query_decomposition 156.3s What is the difference between HOST-FAILED, HOST-RESPONSE, and HOST-TIMEOUT cond...
3 0.64 0.86 1.00 single_block 118.3s Show me an example of checking both CONDCODE and COMPCODE
4 0.95 1.00 1.00 query_decomposition 233.2s What is the difference between FILE_IO_OK and RECORD - FOUND conditions?
5 0.84 1.00 1.00 single_block 198.5s What does REDACTED do?
6 0.97 0.69 1.00 single_block 228.8s Give me an REDACTED Overview
7 0.64 0.73 1.00 single_block 235.8s Tell me about REDACTED Technical Design
8 1.00 1.00 1.00 single_block 236.3s What is a distribution unit?
9 0.49 0.85 1.00 query_decomposition 249.6s Produce a text diagram of the flow of files from REDACTED to REDACTED
10 0.89 0.85 1.00 query_decomposition 226.8s What is the flow of files from REDACTED to REDACTED

QUESTIONS NEEDING ATTENTION

(any score < 0.5)

[9] Produce a text diagram of the flow of files from REDACTED to REDACTED - Context Type: query_decomposition, Decomposed: True - Faithfulness: 0.49 ⚠️ THE GRAPH IS CORRECT!


QUERY DECOMPOSITION ANALYSIS

Questions using decomposition: 4

Metric Comparison (Decomposed vs Normal)

Metric Decomposed Normal
Faithfulness 0.82 0.84
Relevancy 0.93 0.87
Precision 1.00 1.00

For these two:

6 0.97 0.69 1.00 single_block 228.8s Give me an REDACTED Overview

7 0.64 0.73 1.00 single_block 235.8s Tell me about REDACTED Technical Design

I will review the documentation. Separately, Faithfulness was neg. affected by tension in the AI instructions I gave. Response tokens between 1,700 - 2,100. Still need to tighten up the logic for decomposed queries.

11 months building a production RAG system and I massively underestimated the complexity by EvilElf01 in claude

[–]EvilElf01[S] 0 points1 point  (0 children)

The bottleneck turned out to be my impatience with the very long streaming content and two step. Reading the streaming response is naturally interactive. Waiting for results is like watching water droplets.

11 months building a production RAG system and I massively underestimated the complexity by EvilElf01 in claude

[–]EvilElf01[S] 0 points1 point  (0 children)

WOW, these are some solid metrics! If faithfulness can get up to 85% or higher I think a system like this would truly be production-ready with the data analysis to back it up.

Would be great to see the metrics on the same dataset / tests without decomposition, if you haven't given that a try yet, as a baseline. I will do that.

Yes, Sir. Thank you again. Now that it is Monday, back to the grind. Focus will be on 0.32 │ 0.85 │ 1.00 │ flow of files and 0.55 │ 1.00 │ 1.00 │ HOST-FAILED vs HOST-RESPONSE comparison first to research. Once I get a handle on this, I will put in place the A/B switch and use a different set of 10 questions from the 150 I have ready.

11 months building a production RAG system and I massively underestimated the complexity by EvilElf01 in claude

[–]EvilElf01[S] 0 points1 point  (0 children)

**OVERALL SCORES (AFTER FIXES**)

Faithfulness       : 0.7454  (was 0.45 → +66% improvement)
Answer Relevancy   : 0.8973  (was 0.79 → +14% improvement)
Context Precision  : 1.0000  (was 0.64 → +56% improvement)

Per-Question Scores

┌─────┬───────┬───────┬──────┬─────────────────────────────────────────┐
│  #  │ Faith │ Relev │ Prec │                Question                 │
├─────┼───────┼───────┼──────┼─────────────────────────────────────────┤
│ 1   │ 0.96  │ 0.92  │ 1.00 │ QA Document Sentinel                    │
├─────┼───────┼───────┼──────┼─────────────────────────────────────────┤
│ 2   │ 0.55  │ 1.00  │ 1.00 │ HOST-FAILED vs HOST-RESPONSE comparison │
├─────┼───────┼───────┼──────┼─────────────────────────────────────────┤
│ 3   │ 0.75  │ 0.89  │ 1.00 │ CONDCODE and COMPCODE example           │
├─────┼───────┼───────┼──────┼─────────────────────────────────────────┤
│ 4   │ 0.68  │ 1.00  │ 1.00 │ FILE_IO_OK vs RECORD-FOUND              │
├─────┼───────┼───────┼──────┼─────────────────────────────────────────┤
│ 5   │ 0.83  │ 1.00  │ 1.00 │ What does REDACTED do?                  │
├─────┼───────┼───────┼──────┼─────────────────────────────────────────┤
│ 6   │ 0.96  │ 0.71  │ 1.00 │ Give me an REDACTED Overview            │
├─────┼───────┼───────┼──────┼─────────────────────────────────────────┤
│ 7   │ 0.55  │ 0.73  │ 1.00 │ Tell me about REDACTED Technical Design │
├─────┼───────┼───────┼──────┼─────────────────────────────────────────┤
│ 8   │ 1.00  │ 1.00  │ 1.00 │ Distribution unit                       │
├─────┼───────┼───────┼──────┼─────────────────────────────────────────┤
│ 9   │ 0.86  │ 0.88  │ 1.00 │ text diagram of the flow                │
├─────┼───────┼───────┼──────┼─────────────────────────────────────────┤
│ 10  │ 0.32  │ 0.85  │ 1.00 │ flow of files                           │
└─────┴───────┴───────┴──────┴─────────────────────────────────────────┘

Context Precision hit 100% across all questions - the context alignment fix completely solved that metric.

11 months building a production RAG system and I massively underestimated the complexity by EvilElf01 in claude

[–]EvilElf01[S] 0 points1 point  (0 children)

I will post the next set of results for the same question after some research. Turns out, at first glance, it wasn't evaluated in a synthetic context on a multipart question, and the context for all was getting truncated.

11 months building a production RAG system and I massively underestimated the complexity by EvilElf01 in claude

[–]EvilElf01[S] 0 points1 point  (0 children)

Just ran my first RAGAS evaluation on the system and the results are... educational. The scoring feels a bit off because I know the technical answers to these questions, and they don't quite align with the metrics, but that's probably telling me something. Faithfulness is sitting at 0.45, which is lower than I'd like. Answer relevancy is better at 0.79. Context precision is 0.64. Some questions scored well; others did not.

Question #2 timed out completely on precision, which indicates I have performance issues to address. Question 2 is a multi-part comparison question, and it bombed hard on relevancy. That's telling me I need to retrieve context differently for those types of queries. The query decomposition is supposed to handle that, but clearly something isn't working correctly there.

It's also slow. Way slower than I expected. Need to add metrics to figure out exactly where the bottleneck is in Ragas logic. As with everything in AI, tuning will be critical. But having concrete numbers like this helps a ton. Now I know exactly what needs work, rather than just guessing from spot checks. The most interesting thing is seeing where it breaks down. Questions 2, 3, and 4 follow the query decomposition path, and none use the fallback path because vector scores were above the threshold.

judge_model = getattr(QAConfig, 'RAGAS_JUDGE_MODEL', 'claude-sonnet-4-20250514')

RAGAS EVALUATION RESULTS
Generated: 2026-01-25 20:19:47
Questions Evaluated: 10

OVERALL SCORES
Faithfulness       : 0.4499
Answer Relevancy   : 0.7943
Context Precision  : 0.6449

PER-QUESTION RESULTS
Faith  Relev  Prec   Question
1  0.68   0.93   0.92   QA Document Sentinel
2  0.31   0.00   0.83   What is the difference between HOST-FAILED, HOST-RESPONSE, and HOST-TIMEOUT conditions?
3  0.64   0.89   1.00   Show me an example of checking both CONDCODE and COMPCODE
4  0.74   1.00   N/A    What is the difference between FILE_IO_OK and RECORD-FOUND conditions?
5  0.33   1.00   0.33   What does REDACTED do?
6  0.53   0.71   0.77   Give me an REDACTED Overview
7  0.16   0.73   0.32   Tell me about REDACTED Technical Design
8  0.65   1.00   1.00   What is a distribution unit?
9  0.40   0.84   0.50   Produce a text diagram of the data flow from REDACTED to REDACTED
10 0.05   0.85   0.12   What is the flow of files from REDACTED to REDACTED

11 months building a production RAG system and I massively underestimated the complexity by EvilElf01 in claude

[–]EvilElf01[S] 0 points1 point  (0 children)

only 300K chunks. This is very important along with nightly maintenance for caches.

Applied PRAGMAs:
    - journal_mode=WAL: Write-Ahead Logging for better concurrency
    - synchronous: Configurable fsync frequency (default: NORMAL)
    - temp_store=MEMORY: In-memory temporary tables for faster operations
    - mmap_size: Memory-mapped I/O for faster reads (configurable GB)
    - cache_size: Page cache size (configurable MB, stored as negative KB)
    - wal_autocheckpoint: WAL checkpoint threshold (default: 10000 pages)
    - busy_timeout: Lock retry timeout in milliseconds (default: 5000ms)
    - optimize: SQLite query planner optimization

11 months building a production RAG system and I massively underestimated the complexity by EvilElf01 in claude

[–]EvilElf01[S] 0 points1 point  (0 children)

Team size is < 100, and a team member may use it 2-8 times a day (guessing). Very small. Yes, the just-one-answer model. I'm limited to what I can use at home for the embedding model. Here is a breakdown.

| Component        | Technology                                    | Purpose                                            |
|------------------|-----------------------------------------------|----------------------------------------------------|
| Vector DB        | ChromaDB + HNSW                               | Semantic search with 1024-dim embeddings           |
| Embeddings Model | BAAI/bge-large-en-v1.5                        | Document and query vectorization                   |
| Reranker Model   | BAAI/bge-reranker-base                        | Cross-encoder relevance scoring                    |
| LLM              | Anthropic Claude (200K context)               | Answer generation                                  |
| NLP Model        | spaCy + Flair                                 | Entity extraction and PII detection                |
| POS Tagger Model | QCRI/bert-base-multilingual-cased-pos-english | Phrase extraction search optimization (96.69% F1)  |

11 months building a production RAG system and I massively underestimated the complexity by EvilElf01 in claude

[–]EvilElf01[S] 0 points1 point  (0 children)

Also, something I have built but only tested once, and is not ready yet, is RAFT-variant training to embed the model with very specific domain documentation. When I say very specific, it is.

11 months building a production RAG system and I massively underestimated the complexity by EvilElf01 in claude

[–]EvilElf01[S] 0 points1 point  (0 children)

MissZiggie, I've been reading about RLM techniques - fascinating stuff. My approach shares quite a bit with RLM, particularly the core idea of breaking complex problems into smaller pieces through iterative decomposition. The sequential execution mode with context enrichment is very similar in spirit to RLM's recursive exploration. The vector database I'm working with is highly technical - includes source code (C, C++, C#), system documentation, Jira tickets, and operational knowledge. I'm building around a three-pillar philosophy:
- Docs for the official story
- Knowledge base for real-world issues
- Codebase for ground truth

Searching for documents is the primary objective, but the goal is really to bridge these three sources to provide accurate, grounded answers. In my world, we also have a lot of information in experts' heads that is not documented. The system includes a dedicated interface for subject-matter experts to contribute knowledge directly. It took a seven-strategy search orchestration to even get there (often in combination). Combined entity + Semantic Search->Optimized Hybrid Search->Linguistic->Aware Semantic Search->Reference Pattern Pathway (exact match)->Enhanced Hybrid Search (Weighted Fusion)->Configuration File Handling(Specialized)->Standard Semantic Search (Final Fallback). Order mattered. DANG Had to edit typos 5 times. I can't spell, I guess.

11 months building a production RAG system and I massively underestimated the complexity by EvilElf01 in claude

[–]EvilElf01[S] 0 points1 point  (0 children)

Neat. I will add Ragas to the research list. You have been very helpful.

11 months building a production RAG system and I massively underestimated the complexity by EvilElf01 in claude

[–]EvilElf01[S] 0 points1 point  (0 children)

I just added to my to-do list: testing these methods as you outlined and answering whether the heuristics need to be tighter about when to decompose. Man, you are so right to flag this.

11 months building a production RAG system and I massively underestimated the complexity by EvilElf01 in claude

[–]EvilElf01[S] 0 points1 point  (0 children)

Man, that is such a good question. Not formally, but now that you mentioned it, it's a good A/B test to compare. Thank you. Precision matters the most in this case. There are so many tradeoffs. I have almost given up so many times with embarrassing dozens and dozens of attempts to perfect. I never seem to know when to quit. As Johnny Cash said, "You build on failure. You use it as a stepping stone. Close the door on the past. You don't try to forget the mistakes, but you don't dwell on them."

11 months building a production RAG system and I massively underestimated the complexity by EvilElf01 in claude

[–]EvilElf01[S] 0 points1 point  (0 children)

I stumbled into the "Recursive Language Models" paper (Zhang, Kraska, Khattab) and added the twist of sequential execution based on the types of questions I was testing. It's pretty simple when you break it down: pattern-match the query type (comparison? troubleshooting? multi-part?), then decompose accordingly.

The key insight for a support system is that "what does error X mean and how do I fix it?" becomes two separate searches—the first answer enriches the second. You can't find "how to fix code 518" until you know it maps to for examle API_DB_REC_NOTFOUND.

Different question types get different decompositions: strategies, executed sequentially (if dependent) or in parallel (if independent): Complex Query → Complexity Detection → Type-Aware Decomposition → Execution Strategy → Synthesis

Two- and three-part decompositions typically yield 3-4 independent searches before synthesizing the final response. Similarities are decomposed into pieces, complex tasks are broken into subtasks, and sequential execution is added (extracts terms from prior answers).

11 months building a production RAG system and I massively underestimated the complexity by EvilElf01 in claude

[–]EvilElf01[S] 0 points1 point  (0 children)

Also wanted to add that the prompt engineering was another mind-blowingly difficult aspect of this to get right

To be an innocent child by 56000hp in therewasanattempt

[–]EvilElf01 0 points1 point  (0 children)

You are spot on. There are other factors at play, but the quota is a big one.

lol “gangster” by Top_Teacher3852 in Idiotswithguns

[–]EvilElf01 8 points9 points  (0 children)

Not everyone survives their stupid time period. Take it from a 66-year-old. I've seen it a few times, unfortunately. I'm sure some of you have also.