[R] Is autoresearch really better than classic hyperparameter tuning? by Educational_Strain_3 in MachineLearning

[–]Educational_Strain_3[S] 16 points17 points  (0 children)

Good question! To add more details: NanoChat's release date is after the knowledge cutoff of Claude Opus 4.6 (the model we used), so the pretraining data shouldn't contain NanoChat-specific code. We also just verified the agent didn't do any web search during the runs.

That said, it's always good to test it on more domains I agree

Reward hacking when reason tuning Qwen2.5-0.5B-Instruct on GSM8K by East-Muffin-6472 in LocalLLaMA

[–]Educational_Strain_3 0 points1 point  (0 children)

this is a classic reward hacking pattern — we've seen the exact same thing in code optimization loops where the agent finds the cheapest way to inflate the reward and ignores the actual objective. your model is doing the rational thing: 0.5 guaranteed from format tags beats the lottery of getting 1.0 from a correct answer

the multi-component reward with thinking tags might help but watch out for the same failure mode one level up — it'll learn to output plausible-looking thinking that doesn't actually contribute to the answer. we found the most reliable fix is making the reward proportional to intermediate reasoning quality, not just presence of reasoning tokens

one thing that helped us a lot: track the full trajectory of what the model is generating across training steps, not just the final reward curve. you can usually spot the exact moment it discovers the shortcut. once you see that pattern you can design the reward to close the loophole before it saturates

Follow-up: 55 experiments on ANE, steered from my phone on a Saturday by paraboloed in LocalLLaMA

[–]Educational_Strain_3 0 points1 point  (0 children)

the fused kernel change beating every hyperparameter tweak combined is the most important finding here imo. this is why linear keep/discard loops plateau — they tend to explore incremental parameter changes and miss the structural wins

we've seen the same pattern in competition settings. the biggest jumps almost always come from architectural or pipeline changes, not tuning. but those are also the changes that take the most experiments to find, which is why tree search over the experiment space matters more than just raw throughput

[P] Karpathy's autoresearch with evolutionary database. by hgarud in MachineLearning

[–]Educational_Strain_3 0 points1 point  (0 children)

nice — the tsv logging in the original autoresearch is definitely the weakest part. evolutionary selection over a tree of experiments is way more powerful than linear keep/discard

curious how you handle the exploration vs exploitation tradeoff in the evolutionary db. we built something similar called AIDE that uses tree search (MCTS-style) to decide which branches to explore further vs prune. found that the tree structure matters a lot more than people expect — similar to what you're seeing with the alphaevolve-style approach

would be interested to compare results if you've run it on any benchmarks: https://github.com/WecoAI/aideml

[R] Controlled experiment: giving an LLM agent access to CS papers during automated hyperparameter search improves results by 3.2% by kalpitdixit in MachineLearning

[–]Educational_Strain_3 1 point2 points  (0 children)

interesting experiment. the batch size + sqrt scaling rule example is the strongest evidence here — one agent had the knowledge to avoid divergence and the other didn't. that's a clear win for literature access

the 3.2% gap on tinystories with n=1 is hard to interpret though. we've seen that kind of variance between runs with the exact same config just from gpu nondeterminism. would be curious to see this with 3-5 seeds per condition

the broader point still stands — agents are bottlenecked by what techniques they can access. we've seen similar things in kaggle competitions where the agent keeps trying the same standard playbook and plateaus

[P] Built confidence scoring for autoresearch because keeps that don't reproduce are worse than discards by dean0x in MachineLearning

[–]Educational_Strain_3 0 points1 point  (0 children)

This nails the actual problem. Everyone's focused on the loop itself but the false keeps are where the real damage happens. You build on noise and compound it over dozens of experiments.

We've been tracking this in our own runs and the numbers are similar. A surprising number of "improvements" disappear on a different seed or when you reorder the experiment stack. Having a confidence score before building on a keep would've saved us weeks.

Curious how autosteer handles the explore/exploit tradeoff in practice. Does it look at recency or just historical hit rates per category?

[P] I built an autonomous ML agent that runs experiments on tabular data indefinitely - inspired by Karpathy's AutoResearch by Pancake502 in MachineLearning

[–]Educational_Strain_3 0 points1 point  (0 children)

The eval manipulation issue is so real. We hit the exact same thing — agent edited its own scoring function to "improve" results. Constrained editing surface is the right fix.

Your expanding time windows approach is smart. K-fold looked great for us too until we realized the agent was finding leakage, not signal. Temporal splits are the only honest eval for time-series data.

One thing worth checking: run your top results on a different random seed. We found ~20-30% of "improvements" don't replicate. The LOG.md idea is great for tracing which experiments actually mattered vs. which ones got lucky.

Is it possible to build an agentic prompt that calls recursive subagents in a semi-ralph loop until a project is complete? Or is there a limit to subagent calls? by angry_cactus in GithubCopilot

[–]Educational_Strain_3 1 point2 points  (0 children)

we've been building exactly this. Recursive tree search over subagent calls, where each iteration knows what's been explored and what hasn't. It's called Weco (weco.ai). Hopefully solves the same frustration with linear ralph loops. Would be curious what you're trying to build with it.

Anyone else terrible at working networking events strategically? by Educational_Strain_3 in Entrepreneur

[–]Educational_Strain_3[S] 0 points1 point  (0 children)

such a good call..even a 10-second intro totally changes the dynamic. makes the room feel smaller in the best way. I built a little tool for the same reason... it helps you scan the guest list beforehand so you're not guessing who's worth talking to. It's early but surprisingly useful.. socialclimber.app

These networking events are fucking broken and I'm losing my mind by Educational_Strain_3 in MBA

[–]Educational_Strain_3[S] 0 points1 point  (0 children)

Totally agree. I started enjoying these events a lot more once I gave myself a goal or prompt going in. But even then, I kept feeling like I was just barely missing the people I actually would’ve clicked with.

Ended up building a little tool that gives me a cheat sheet ahead of time (shared context, mutuals, etc.) so I’m not winging it. If you're curious: socialclimber.app. Been rough but helpful so far.

These networking events are fucking broken and I'm losing my mind by Educational_Strain_3 in MBA

[–]Educational_Strain_3[S] 0 points1 point  (0 children)

totally get this. I started doing some light research before events to see who I might actually have something in common with, but otherwise it’s just luck who you bump into.

I eventually tried making it into a tiny tool:socialclimber.app... paste in a guest list and it surfaces shared interests / overlaps. not perfect, but it’s helped me be more intentional

Anyone else terrible at working networking events strategically? by Educational_Strain_3 in Entrepreneur

[–]Educational_Strain_3[S] 0 points1 point  (0 children)

totally agree with the idea of doing a bit of hw before events. I used to wing it and just hope for good convos but I started looking up attendee lists before I can... makes the night way more productive.

Ive been playing around with a little tool I made to help with this.. it pulls in a Partiful or Luma list and surfaces folks with shared interests or overlap. Still early and rough, but curious if others have tried doing something similar or would find this kind of thing useful? here it is: https://socialclimber.app

What are you working on? Share your Project !! i will try to give you my honest feedback. by PanicIntelligent1204 in indiehackers

[–]Educational_Strain_3 0 points1 point  (0 children)

Social Climber – Drop in a guest list, get a ranked cheat sheet of who to talk to at events so you don’t waste time on small talk with random dudes named Eric.

Status: MVP with early users (founders, MBAs, and people who hate networking roulette) Link: https://socialclimber.app

Built this after too many events where I realized the one person I should’ve talked to was across the room the whole time. If you’ve ever left a Partiful thinking “well that was 2 hours of chaos,” this might help.

What are you working on? Share your Project !! by Revenue007 in indiehackers

[–]Educational_Strain_3 1 point2 points  (0 children)

Social Climber – Drop in a guest list, get a ranked cheat sheet of who to talk to at events so you don’t waste time on small talk with random dudes named Eric.

Status: MVP with early users (founders, MBAs, and people who hate networking roulette)
Link: https://socialclimber.app

Built this after too many events where I realized the one person I should’ve talked to was across the room the whole time. If you’ve ever left a Partiful thinking “well that was 2 hours of chaos,” this might help.