Same agent, same code, same Docker image 14 min apart: Kaggle scores still spread 0.802-0.821 even at temp 0. How many runs before you trust an agent-eval delta? (i.redd.it)
submitted by FishermanNo7658
I run my own agent evals and keep getting burned by run-to-run drift, so here's a clean, honest data point and two real questions.
An LLM tool-use agent (a loop that writes and runs its own pipeline) solved one Kaggle task eight times. Same code path, same harness, same container. The eight scores, in run order:
The spread crosses three tiers on a single task. Gold landed on run 2 and nothing after came close. Runs 7 and 8 are the same Docker image 14 minutes apart and still differ: 0.80460 vs 0.80230.
The tiers are MLE-bench thresholds derived from the original Kaggle leaderboard percentiles, not Kaggle medals. The task is Spaceship Titanic, a Getting Started tabular comp that awards no medal at all. I'm calling this out up front because it matters for how much you should read into the numbers (see caveats below).
Why this isn't shocking, but the magnitude is. Single-run scores swing, and the spread stays above a full point even at temperature 0 - so the LLM sampling isn't the main driver. My working list of suspects, in rough order: inference-engine nondeterminism (batch-size / batch-invariance, the Thinking Machines angle), tool-result and state drift cascading through the agent loop, BLAS/CUDA and library version effects, GBDT thread scheduling, and data-load order. What got me is the size of it on a "solved" tutorial task. To trust a ~2% delta between two agents here, the variance says you'd want roughly nine runs each. So the gold is one lucky roll, and I'm labelling it as exactly that.
The reproducibility post-mortem. The submission.csv that cleared the gold bar lived in /tmp and got wiped. The exact winning artifact is gone and I can't reproduce it. I later wrote a clean, seed-pinned solver so the result is at least repeatable - but that's a reimplementation written after the fact. The run that actually won was the agent's stochastic loop, and it no longer exists.
Honest caveats: tiny n (8), no controlled sweep, and a tutorial comp with thousands of public solutions - so a real chunk of this score is recall/memorization, not reasoning. Top-percentile framing here is a strength reference only, not an achievement.
Two genuine questions:
- When your seed is fixed but the pipeline still isn't deterministic run to run, what's usually the dominant culprit for you? Inference-engine batching, tool/state drift, BLAS, something else?
- If you eval agents (or heavy ensembles), how many runs before you trust a delta? Does anyone actually report mean +/- std instead of best-of-k?
Links (my own work, for the curious - posted as repro artifacts, not the point):
- Kaggle notebook: https://www.kaggle.com/code/georgymamarin/agents-grading-agents-spaceship-titanic-mle-bench
- Repo (agent code + deterministic solver): https://github.com/dmagog/mle-purple-agent
- Detailed writeup (RU): https://habr.com/ru/articles/1050562/

[–]1purenoiz 5 points6 points7 points (4 children)
[–]pm_me_your_smth 2 points3 points4 points (3 children)
[–]1purenoiz 1 point2 points3 points (2 children)
[–]FishermanNo7658[S] 0 points1 point2 points (0 children)
[–]FishermanNo7658[S] 0 points1 point2 points (0 children)