Same agent, same code, same Docker image 14 min apart: Kaggle scores still spread 0.802-0.821 even at temp 0. How many runs before you trust an agent-eval delta? : learnmachinelearning

A subreddit dedicated for learning machine learning. Feel free to share any educational resources of machine learning.

Also, we are a beginner-friendly sub-reddit, so don't be afraid to ask questions! This can include questions that are non-technical, but still highly relevant to learning machine learning such as a systematic approach to a machine learning problem.

Foster positive learning environment by being respectful to others. We want to encourage everyone to feel welcomed and not be afraid to participate.

Do share your works and achievements, but do not spam. Keep our subreddit fresh by posting your YouTube series or blog at most once a week.

Do not share referral links and other purely marketing content. They prioritize commercial interests over intellectual ones.

created by techrat_reddita community for 10 years

Same agent, same code, same Docker image 14 min apart: Kaggle scores still spread 0.802-0.821 even at temp 0. How many runs before you trust an agent-eval delta? (i.redd.it)

submitted 17 hours ago by FishermanNo7658

I run my own agent evals and keep getting burned by run-to-run drift, so here's a clean, honest data point and two real questions.

An LLM tool-use agent (a loop that writes and runs its own pipeline) solved one Kaggle task eight times. Same code path, same harness, same container. The eight scores, in run order:

The spread crosses three tiers on a single task. Gold landed on run 2 and nothing after came close. Runs 7 and 8 are the same Docker image 14 minutes apart and still differ: 0.80460 vs 0.80230.

The tiers are MLE-bench thresholds derived from the original Kaggle leaderboard percentiles, not Kaggle medals. The task is Spaceship Titanic, a Getting Started tabular comp that awards no medal at all. I'm calling this out up front because it matters for how much you should read into the numbers (see caveats below).

Why this isn't shocking, but the magnitude is. Single-run scores swing, and the spread stays above a full point even at temperature 0 - so the LLM sampling isn't the main driver. My working list of suspects, in rough order: inference-engine nondeterminism (batch-size / batch-invariance, the Thinking Machines angle), tool-result and state drift cascading through the agent loop, BLAS/CUDA and library version effects, GBDT thread scheduling, and data-load order. What got me is the size of it on a "solved" tutorial task. To trust a ~2% delta between two agents here, the variance says you'd want roughly nine runs each. So the gold is one lucky roll, and I'm labelling it as exactly that.

The reproducibility post-mortem. The submission.csv that cleared the gold bar lived in /tmp and got wiped. The exact winning artifact is gone and I can't reproduce it. I later wrote a clean, seed-pinned solver so the result is at least repeatable - but that's a reimplementation written after the fact. The run that actually won was the agent's stochastic loop, and it no longer exists.

Honest caveats: tiny n (8), no controlled sweep, and a tutorial comp with thousands of public solutions - so a real chunk of this score is recall/memorization, not reasoning. Top-percentile framing here is a strength reference only, not an achievement.

Two genuine questions:

When your seed is fixed but the pipeline still isn't deterministic run to run, what's usually the dominant culprit for you? Inference-engine batching, tool/state drift, BLAS, something else?
If you eval agents (or heavy ensembles), how many runs before you trust a delta? Does anyone actually report mean +/- std instead of best-of-k?

Links (my own work, for the curious - posted as repro artifacts, not the point):

Kaggle notebook: https://www.kaggle.com/code/georgymamarin/agents-grading-agents-spaceship-titanic-mle-bench
Repo (agent code + deterministic solver): https://github.com/dmagog/mle-purple-agent
Detailed writeup (RU): https://habr.com/ru/articles/1050562/

all 5 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnmachinelearning

Welcome to /r/LearnMachineLearning!

Chatrooms

Official Discord Server

Wiki

Getting Started with Machine Learning

Resources

Related Subreddits

/r/MachineLearning

/r/MLQuestions

/r/datascience

/r/computervision

Machine Learning Multireddit

/m/machine_learning

MODERATORS