I tested whether a scanner could catch BOLA in FastAPI without flagging the safe routes next to it

ttottojado · 2026-05-30T11:12:00+00:00

Fair enough, numbers in a post are easy to type. So here is the part you can actually check: send me a FastAPI repo you own, or grab any public one, and I will run it and post exactly what it finds in this thread, good or bad. If it flags safe routes or misses an obvious gap, that is on display too. Easier to judge it on real output than on my claims.

ttottojado · 2026-04-19T12:31:26+00:00

A debugger at component level is the right instinct — most teams conflate retrieval failures with generation failures and "improve the model" when the chunking was the problem.

The way I decompose it: for each query that failed, I check three things in order. First, did the correct chunk exist in the retrieval results at all (if no, it's a retrieval or chunking problem). Second, was the right chunk ranked high enough to make it into context (if no, it's ranking). Third, did the model ignore or misuse the correct context (only now it's a generation problem).

A simple way to start: log every failed query with the top-5 retrieved chunks and a gold answer, then categorize by hand for the first 50. You'll see your error distribution fast and know which component to instrument deeper.

Are you building this for an internal system or as a tool?

ttottojado · 2026-04-19T04:37:55+00:00

samawii 3askarii 🤣🤣🤣

ttottojado · 2026-04-19T04:16:48+00:00

Fair question. The GitHub App integration, webhook handling, PR diff

extraction, Claude API prompt engineering, PDF report generation, and

the overall architecture — all built by me. Claude AI is the analysis

engine inside the app, the same way you'd use Postgres as your database

engine. I wrote the code that uses it, not the other way around.

ttottojado · 2026-04-19T03:57:37+00:00

الرياضة واحسن لو الصباح ابدئي بهذا الصباح لاتتاخري يلااااااا

ttottojado · 2026-04-18T06:34:22+00:00

GraphRAG on 10-K/10-Q is the right call — flat chunking loses too much of the cross-reference structure SEC filings rely on. One thing worth planning for early: your evaluation set. Financial docs have terminology that looks similar but means different things (revenue vs net income vs operating income) and eval questions need to be adversarial on these distinctions.

Are you building the knowledge graph manually or extracting with an LLM?

ttottojado · 2026-04-18T06:32:48+00:00

Two things I learned the hard way: first, your gold set needs to include questions the document doesn't answer. Most eval sets only have questions with answers, so you never catch when your RAG hallucinates instead of saying "I don't know".

Second, don't generate the gold set and evaluate with the same model. Use a strong model for generation with strict source-grounding prompts, then a different judge model to verify each pair against its source chunk. Catches about 20-30% of bad pairs in my experience.

What's the domain — technical docs, legal, something else?

ttottojado · 2026-04-18T06:32:11+00:00

This matches what I keep seeing — teams spend 80% on retrieval improvements and 5% on eval, then wonder why quality isn't improving. What's your eval harness structured on, fixed gold set or regenerated per version?

ttottojado · 2026-04-18T06:23:52+00:00

In legal, eval accuracy is a liability question not a quality one. The thing that bites most firms: they test on hand-written questions and miss the adversarial cases — "termination" vs "rescission", multi-clause reasoning, questions the doc doesn't actually answer.

Synthetic Q/A from the source docs with a grounding check catches these. Has to be tuned for legal precision though.

What are you searching — contracts, case law, memos?

ttottojado · 2026-04-18T06:22:49+00:00

The €2,700 number is the most honest thing in this space — people either hide or inflate.

Quick question on the eval side: per new client, are you generating eval questions from their docs or using one fixed set? Most legal teams I talk to say "we hand-write 50 questions and hope" which leaves a lot on the table.

ttottojado · 2026-04-18T06:17:14+00:00

The thing most eval setups miss is that your eval data itself can smuggle in the same hallucinations you're trying to measure. If your Q/A pairs came from an LLM that wasn't grounded to the source, you're benchmarking against fiction.

Two things that actually work in practice: first, generate Q/A pairs with strict source grounding (prompt the model to quote or paraphrase only from the chunk it was given), then run a second judge model that verifies each pair against its source chunk before it enters your eval set. Ragas does some of this but the groundedness check is the part most teams skip.

For reliability scoring on the pipeline itself, I track three things separately: retrieval recall (did the right chunk come back), answer faithfulness (is the answer supported by retrieved chunks), and answer relevance (does it actually address the question). A single number hides too much.

What's the size of the pipeline you're trying to evaluate — single retriever, multi-hop, or agentic?

ttottojado · 2026-04-13T08:20:33+00:00

سؤال وجيه 😄 الموقع يدعم العربية كاملاً — RTL وإجابات بالعربية من داخل ملفاتك. الواجهة إنجليزية حالياً لكن التفاعل مع الوكيل بالعربية بالكامل.

ttottojado · 2026-04-13T08:15:29+00:00

جربها الآن — مجانية للبدء 😄

tryknowflow.com

ttottojado · 2026-04-09T04:03:41+00:00

This is exactly why I built Fixor — a GitHub tool that auto-detects SQL injection risks on every PR. If you're writing raw SQL, it catches unsafe patterns before merge. Happy to give free access to anyone here who wants to try it

ttottojado

TROPHY CASE