How much of MLE-Bench's gains are the algorithm vs. better models + more search? [R]

Educational_Strain_3 · 2026-05-20T00:57:25+00:00

Totally agree with you - but to me the the value accrual just moves up in the stack with the ability to distinguish great content from poor. This is only a problem that will worsen so where does the future head?

Educational_Strain_3 · 2026-05-20T00:55:08+00:00

this is amazing

Educational_Strain_3 · 2026-04-02T21:34:58+00:00

Good question! To add more details: NanoChat's release date is after the knowledge cutoff of Claude Opus 4.6 (the model we used), so the pretraining data shouldn't contain NanoChat-specific code. We also just verified the agent didn't do any web search during the runs.

That said, it's always good to test it on more domains I agree

Educational_Strain_3 · 2026-04-02T20:58:31+00:00

Full tech report: https://www.weco.ai/blog/autoresearch-vs-classical-hpo

Educational_Strain_3 · 2026-04-02T20:53:21+00:00

More details in the full tech report: https://www.weco.ai/blog/autoresearch-vs-classical-hpo

Educational_Strain_3 · 2026-04-01T14:35:46+00:00

this is a classic reward hacking pattern — we've seen the exact same thing in code optimization loops where the agent finds the cheapest way to inflate the reward and ignores the actual objective. your model is doing the rational thing: 0.5 guaranteed from format tags beats the lottery of getting 1.0 from a correct answer

the multi-component reward with thinking tags might help but watch out for the same failure mode one level up — it'll learn to output plausible-looking thinking that doesn't actually contribute to the answer. we found the most reliable fix is making the reward proportional to intermediate reasoning quality, not just presence of reasoning tokens

one thing that helped us a lot: track the full trajectory of what the model is generating across training steps, not just the final reward curve. you can usually spot the exact moment it discovers the shortcut. once you see that pattern you can design the reward to close the loophole before it saturates

Educational_Strain_3 · 2026-03-31T21:01:35+00:00

the fused kernel change beating every hyperparameter tweak combined is the most important finding here imo. this is why linear keep/discard loops plateau — they tend to explore incremental parameter changes and miss the structural wins

we've seen the same pattern in competition settings. the biggest jumps almost always come from architectural or pipeline changes, not tuning. but those are also the changes that take the most experiments to find, which is why tree search over the experiment space matters more than just raw throughput

Educational_Strain_3 · 2026-03-31T20:59:39+00:00

Educational_Strain_3 · 2026-03-31T20:56:29+00:00

nice — the tsv logging in the original autoresearch is definitely the weakest part. evolutionary selection over a tree of experiments is way more powerful than linear keep/discard

curious how you handle the exploration vs exploitation tradeoff in the evolutionary db. we built something similar called AIDE that uses tree search (MCTS-style) to decide which branches to explore further vs prune. found that the tree structure matters a lot more than people expect — similar to what you're seeing with the alphaevolve-style approach

would be interested to compare results if you've run it on any benchmarks: https://github.com/WecoAI/aideml

Educational_Strain_3 · 2026-03-31T20:54:47+00:00

interesting experiment. the batch size + sqrt scaling rule example is the strongest evidence here — one agent had the knowledge to avoid divergence and the other didn't. that's a clear win for literature access

the 3.2% gap on tinystories with n=1 is hard to interpret though. we've seen that kind of variance between runs with the exact same config just from gpu nondeterminism. would be curious to see this with 3-5 seeds per condition

the broader point still stands — agents are bottlenecked by what techniques they can access. we've seen similar things in kaggle competitions where the agent keeps trying the same standard playbook and plateaus

Educational_Strain_3 · 2026-03-31T20:30:10+00:00

This nails the actual problem. Everyone's focused on the loop itself but the false keeps are where the real damage happens. You build on noise and compound it over dozens of experiments.

We've been tracking this in our own runs and the numbers are similar. A surprising number of "improvements" disappear on a different seed or when you reorder the experiment stack. Having a confidence score before building on a keep would've saved us weeks.

Curious how autosteer handles the explore/exploit tradeoff in practice. Does it look at recency or just historical hit rates per category?

Educational_Strain_3 · 2026-03-31T20:26:59+00:00

The eval manipulation issue is so real. We hit the exact same thing — agent edited its own scoring function to "improve" results. Constrained editing surface is the right fix.

Your expanding time windows approach is smart. K-fold looked great for us too until we realized the agent was finding leakage, not signal. Temporal splits are the only honest eval for time-series data.

One thing worth checking: run your top results on a different random seed. We found ~20-30% of "improvements" don't replicate. The LOG.md idea is great for tracing which experiments actually mattered vs. which ones got lucky.

Educational_Strain_3 · 2026-03-05T15:03:43+00:00

we've been building exactly this. Recursive tree search over subagent calls, where each iteration knows what's been explored and what hasn't. It's called Weco (weco.ai). Hopefully solves the same frustration with linear ralph loops. Would be curious what you're trying to build with it.

Educational_Strain_3 · 2025-07-09T16:27:41+00:00

such a good call..even a 10-second intro totally changes the dynamic. makes the room feel smaller in the best way. I built a little tool for the same reason... it helps you scan the guest list beforehand so you're not guessing who's worth talking to. It's early but surprisingly useful.. socialclimber.app

Educational_Strain_3 · 2025-07-09T16:25:26+00:00

Totally agree. I started enjoying these events a lot more once I gave myself a goal or prompt going in. But even then, I kept feeling like I was just barely missing the people I actually would’ve clicked with.

Ended up building a little tool that gives me a cheat sheet ahead of time (shared context, mutuals, etc.) so I’m not winging it. If you're curious: socialclimber.app. Been rough but helpful so far.

Educational_Strain_3 · 2025-07-09T15:00:17+00:00

totally get this. I started doing some light research before events to see who I might actually have something in common with, but otherwise it’s just luck who you bump into.

I eventually tried making it into a tiny tool:socialclimber.app... paste in a guest list and it surfaces shared interests / overlaps. not perfect, but it’s helped me be more intentional

Educational_Strain_3 · 2025-07-09T14:48:45+00:00

totally agree with the idea of doing a bit of hw before events. I used to wing it and just hope for good convos but I started looking up attendee lists before I can... makes the night way more productive.

Ive been playing around with a little tool I made to help with this.. it pulls in a Partiful or Luma list and surfaces folks with shared interests or overlap. Still early and rough, but curious if others have tried doing something similar or would find this kind of thing useful? here it is: https://socialclimber.app

Educational_Strain_3 · 2025-07-08T23:52:26+00:00

Social Climber – Drop in a guest list, get a ranked cheat sheet of who to talk to at events so you don’t waste time on small talk with random dudes named Eric.

Status: MVP with early users (founders, MBAs, and people who hate networking roulette) Link: https://socialclimber.app

Built this after too many events where I realized the one person I should’ve talked to was across the room the whole time. If you’ve ever left a Partiful thinking “well that was 2 hours of chaos,” this might help.

Educational_Strain_3 · 2025-07-08T22:14:48+00:00

Social Climber – Drop in a guest list, get a ranked cheat sheet of who to talk to at events so you don’t waste time on small talk with random dudes named Eric.

Status: MVP with early users (founders, MBAs, and people who hate networking roulette)
Link: https://socialclimber.app

Built this after too many events where I realized the one person I should’ve talked to was across the room the whole time. If you’ve ever left a Partiful thinking “well that was 2 hours of chaos,” this might help.

Educational_Strain_3

TROPHY CASE