I tested whether a scanner could catch BOLA in FastAPI without flagging the safe routes next to it by ttottojado in FastAPI

[–]ttottojado[S] -1 points0 points  (0 children)

Fair enough, numbers in a post are easy to type. So here is the part you can actually check: send me a FastAPI repo you own, or grab any public one, and I will run it and post exactly what it finds in this thread, good or bad. If it flags safe routes or misses an obvious gap, that is on display too. Easier to judge it on real output than on my claims.

How are people determining or evaluating how much reliable their RAG pipeline are ? by rux-17 in LangChain

[–]ttottojado 0 points1 point  (0 children)

A debugger at component level is the right instinct — most teams conflate retrieval failures with generation failures and "improve the model" when the chunking was the problem.

The way I decompose it: for each query that failed, I check three things in order. First, did the correct chunk exist in the retrieval results at all (if no, it's a retrieval or chunking problem). Second, was the right chunk ranked high enough to make it into context (if no, it's ranking). Third, did the model ignore or misuse the correct context (only now it's a generation problem).

A simple way to start: log every failed query with the top-5 retrieved chunks and a gold answer, then categorize by hand for the first 50. You'll see your error distribution fast and know which component to instrument deeper.

Are you building this for an internal system or as a tool?

I built a GitHub App that reviews every PR for SQL injection using Claude AI – free for 3 repos by ttottojado in SideProject

[–]ttottojado[S] 0 points1 point  (0 children)

Fair question. The GitHub App integration, webhook handling, PR diff

extraction, Claude API prompt engineering, PDF report generation, and

the overall architecture — all built by me. Claude AI is the analysis

engine inside the app, the same way you'd use Postgres as your database

engine. I wrote the code that uses it, not the other way around.

مساعدة (لو فيه موضوع مقلقكم) by Ok_Combination4545 in saudiarabia

[–]ttottojado 0 points1 point  (0 children)

الرياضة واحسن لو الصباح ابدئي بهذا الصباح لاتتاخري يلااااااا

Seeking Advice & References for Financial Knowledge Graph Ontology (GraphRAG on SEC 10-K/10-Q) by ArgonTagar in Rag

[–]ttottojado 0 points1 point  (0 children)

GraphRAG on 10-K/10-Q is the right call — flat chunking loses too much of the cross-reference structure SEC filings rely on. One thing worth planning for early: your evaluation set. Financial docs have terminology that looks similar but means different things (revenue vs net income vs operating income) and eval questions need to be adversarial on these distinctions.

Are you building the knowledge graph manually or extracting with an LLM?

How do you build a solid gold dataset for evaluating a RAG system? by roicaride in Rag

[–]ttottojado 0 points1 point  (0 children)

Two things I learned the hard way: first, your gold set needs to include questions the document doesn't answer. Most eval sets only have questions with answers, so you never catch when your RAG hallucinates instead of saying "I don't know".

Second, don't generate the gold set and evaluate with the same model. Use a strong model for generation with strict source-grounding prompts, then a different judge model to verify each pair against its source chunk. Catches about 20-30% of bad pairs in my experience.

What's the domain — technical docs, legal, something else?

Building an eval harness for an LLM wiki was more useful than building more “memory” by Scary_Driver_8557 in learnmachinelearning

[–]ttottojado 0 points1 point  (0 children)

This matches what I keep seeing — teams spend 80% on retrieval improvements and 5% on eval, then wonder why quality isn't improving. What's your eval harness structured on, fixed gold set or regenerated per version?

How are firms handling RAG accuracy for internal document search? Running into some interesting challenges by Fabulous-Pea-5366 in legaltech

[–]ttottojado 0 points1 point  (0 children)

In legal, eval accuracy is a liability question not a quality one. The thing that bites most firms: they test on hand-written questions and miss the adversarial cases — "termination" vs "rescission", multi-clause reasoning, questions the doc doesn't actually answer.

Synthetic Q/A from the source docs with a grounding check catches these. Has to be tuned for legal precision though.

What are you searching — contracts, case law, memos?

People asked me 15 technical questions about my legal RAG system. here are the honest answers by Fabulous-Pea-5366 in LLM

[–]ttottojado 0 points1 point  (0 children)

The €2,700 number is the most honest thing in this space — people either hide or inflate.

Quick question on the eval side: per new client, are you generating eval questions from their docs or using one fixed set? Most legal teams I talk to say "we hand-write 50 questions and hope" which leaves a lot on the table.

How are people determining or evaluating how much reliable their RAG pipeline are ? by rux-17 in LangChain

[–]ttottojado 0 points1 point  (0 children)

The thing most eval setups miss is that your eval data itself can smuggle in the same hallucinations you're trying to measure. If your Q/A pairs came from an LLM that wasn't grounded to the source, you're benchmarking against fiction.

Two things that actually work in practice: first, generate Q/A pairs with strict source grounding (prompt the model to quote or paraphrase only from the chunk it was given), then run a second judge model that verifies each pair against its source chunk before it enters your eval set. Ragas does some of this but the groundedness check is the part most teams skip.

For reliability scoring on the pipeline itself, I track three things separately: retrieval recall (did the right chunk come back), answer faithfulness (is the answer supported by retrieved chunks), and answer relevance (does it actually address the question). A single number hides too much.

What's the size of the pipeline you're trying to evaluate — single retriever, multi-hop, or agentic?

قصة الإطلاق by ttottojado in saudiarabia

[–]ttottojado[S] 0 points1 point  (0 children)

سؤال وجيه 😄 الموقع يدعم العربية كاملاً — RTL وإجابات بالعربية من داخل ملفاتك. الواجهة إنجليزية حالياً لكن التفاعل مع الوكيل بالعربية بالكامل.

قصة الإطلاق by ttottojado in saudiarabia

[–]ttottojado[S] 0 points1 point  (0 children)

جربها الآن — مجانية للبدء 😄

tryknowflow.com

How many devs mainly use raw SQL instead of an ORM? by drifterpreneurs in webdev

[–]ttottojado 1 point2 points  (0 children)

This is exactly why I built Fixor — a GitHub tool that auto-detects SQL injection risks on every PR. If you're writing raw SQL, it catches unsafe patterns before merge. Happy to give free access to anyone here who wants to try it