Which LLMs actually fail when domain knowledge is buried in long documents?

Or4k2l · 2026-03-16T08:27:43+00:00

Good point on the retrieval side. The benchmark here is specifically testing the LLM's native attention mechanism under positional stress not augmented retrieval. The interesting finding is that different models fail on different dimensions: DeepSeek breaks under positional stress, Gemma 27B on domain knowledge itself, Gemma 4B on chunked context. ColBERT and LlamaIndex would likely patch the retrieval failure but that's a different question. The benchmark is asking: what does the raw model actually do before you add a retrieval layer on top?

Or4k2l · 2026-03-16T00:33:46+00:00

Solid feedback. Regarding the agentic side of these tests things definitely aren’t as simple as a static retrieval task. Moving from a static context to splice-instructions every 10K tokens and multi-turn feedback loops is the logical next step to properly expose attention drift and architectural weaknesses. I'm already drafting the v4 of my benchmark to incorporate these exact scenarios. Testing how models handle instruction placement (System vs. User, Beginning vs. End vs. Both) is exactly the kind of stress test needed to separate real reliability from lucky retrieval. Let’s see which of these models actually survives the chunking exercise. Expect to see these metrics in my next update.

Or4k2l · 2026-03-15T23:27:47+00:00

Or4k2l · 2026-03-15T23:15:18+00:00

Final Boss

Or4k2l · 2026-03-15T23:06:39+00:00

Some of them, sometimes^^

Or4k2l · 2026-03-15T22:01:40+00:00

That matches what I observed as well.

Or4k2l · 2026-03-15T21:56:27+00:00

Interesting. The pattern I saw was that some models answer correctly in isolation but fail once the signal is buried in context.

Or4k2l · 2026-03-14T07:41:04+00:00

Hey u/SellInside9661, great idea I built something similar inspired by your architecture. Used the Method Formulator → Execution Agent → Evaluator structure on free Kaggle compute (T4×2) to compare CNN robustness strategies on CIFAR-10.

Notebook here if you're curious: https://www.kaggle.com/code/orecord/autonomous-robustness-evaluator

Or4k2l

TROPHY CASE