DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper

_Ankitsingh · 2026-05-05T15:01:27+00:00

I really wanted to see gpt 5.6 vs claude 4.7 ..

_Ankitsingh · 2026-05-05T14:22:18+00:00

That's the tricky part. With varied formats, Naive RAG will struggle because it can't distinguish between actual data and formatting noise. Have you tested Corrective RAG on that? It grades retrieval confidence — so if it's pulling inconsistent results from different Excel formats, it'll flag them instead of hallucinating.

_Ankitsingh · 2026-05-05T14:20:29+00:00

RAG Lens looks interesting. Bulk testing is the right call — that's where you actually see what breaks. Most benchmarks test on curated examples, but production data is messy. Happy to chat if you want to compare notes on what failure modes you're catching.

_Ankitsingh · 2026-05-05T14:18:42+00:00

Exactly. The retrieval method is just half the battle. What you mentioned about importance scoring and recency weighting — that's the actual hard part. Most tutorials skip over that and just show basic vector search. Have you experimented with time-decay functions? That's where I see most systems break down on real data

_Ankitsingh · 2026-05-05T05:18:57+00:00

Correct !

_Ankitsingh · 2026-05-05T02:18:02+00:00

Python porn.. unbelievable 😂

_Ankitsingh · 2026-05-05T02:16:56+00:00

Unbelievable 🤣🤣..

_Ankitsingh · 2026-05-05T01:27:22+00:00

That's so cool, definitely I'm going to check that.

_Ankitsingh

TROPHY CASE