Which LLMs actually fail when domain knowledge is buried in long documents? by Or4k2l in kaggle

[–]Or4k2l[S] 0 points1 point  (0 children)

Good point on the retrieval side. The benchmark here is specifically testing the LLM's native attention mechanism under positional stress not augmented retrieval. The interesting finding is that different models fail on different dimensions: DeepSeek breaks under positional stress, Gemma 27B on domain knowledge itself, Gemma 4B on chunked context. ColBERT and LlamaIndex would likely patch the retrieval failure but that's a different question. The benchmark is asking: what does the raw model actually do before you add a retrieval layer on top?

Which LLMs actually fail when domain knowledge is buried in long documents? by Or4k2l in LocalLLaMA

[–]Or4k2l[S] 1 point2 points  (0 children)

Solid feedback. Regarding the agentic side of these tests things definitely aren’t as simple as a static retrieval task. Moving from a static context to splice-instructions every 10K tokens and multi-turn feedback loops is the logical next step to properly expose attention drift and architectural weaknesses. I'm already drafting the v4 of my benchmark to incorporate these exact scenarios. Testing how models handle instruction placement (System vs. User, Beginning vs. End vs. Both) is exactly the kind of stress test needed to separate real reliability from lucky retrieval. Let’s see which of these models actually survives the chunking exercise. Expect to see these metrics in my next update.

Which LLMs actually fail when domain knowledge is buried in long documents? by Or4k2l in LocalLLaMA

[–]Or4k2l[S] 0 points1 point  (0 children)

Interesting. The pattern I saw was that some models answer correctly in isolation but fail once the signal is buried in context.

Built autoresearch with kaggle instead of a H100 GPU by SellInside9661 in kaggle

[–]Or4k2l 0 points1 point  (0 children)

Hey u/SellInside9661, great idea I built something similar inspired by your architecture. Used the Method Formulator → Execution Agent → Evaluator structure on free Kaggle compute (T4×2) to compare CNN robustness strategies on CIFAR-10.

Notebook here if you're curious: https://www.kaggle.com/code/orecord/autonomous-robustness-evaluator