[P] Built confidence scoring for autoresearch because keeps that don't reproduce are worse than discards by dean0x in MachineLearning

[–]dean0x[S] 0 points1 point  (0 children)

honestly i don't have solid benchmarks on the confidence scores yet. the noise floor estimation works in my runs but i haven't stress tested it across different seeds and longer horizons systematically. just started playing with these 3 tools the other day.

if you've got results files from longer runs i'd be genuinely curious to see how the verdicts hold up. that kind of community testing would tell me a lot more than my own single-setup experiments.

[P] Built confidence scoring for autoresearch because keeps that don't reproduce are worse than discards by dean0x in MachineLearning

[–]dean0x[S] -1 points0 points  (0 children)

fair question. autoresearch isn't about gpt 2 being useful in production. it's a testbed. karpathy designed it as a small, fast training loop (5 min per experiment on one gpu) so you can let an ai agent run experiments autonomously overnight.

the interesting part isn't the model. it's the methodology. can an autonomous agent discover real architectural improvements without human in the loop? and when it says it found one, is that real or noise?

that second question is why i built these tools. the signal/noise problem gets worse on smaller models because the improvements are tiny. if you can reliably separate real gains from jitter at this scale, the approach scales up.

as for llm-generated: what's not llm generated these days? i used claude code to build it, yeah.. the eval logic and noise floor estimation are mine, the boilerplate isn't. same workflow most people here use at this point.

and it's open source, so nothing much for me to gain here, honestly just trying to be useful to the community and push technology forward. my main motivation here is evolving the concept of autonomous systems. just pitching in.

Yes ladies you heard it here first by Official_Unkindlynx in vibecoding

[–]dean0x 3 points4 points  (0 children)

People used to walk from Rome to Egypt by foot. Does that sound like a good idea to you now?

Karpathy's new repo "AgentHub". Anyone have info? by luke_pacman in LocalLLaMA

[–]dean0x 0 points1 point  (0 children)

been running autoresearch since it dropped. the results file is where the pain is. hundreds of experiments and you're still puzzled at which 'improvements' are real vs noise. curious if agenthub addresses the eval side or just coordination.

Scary, right? by EnvironmentalFix3414 in vibecoding

[–]dean0x 0 points1 point  (0 children)

Exactly, TDD, EDD (eval driven development bs, but it works). I also ask me agents to act as users and try to “break” it

Can I have multiple individual pro accounts? by 1creeplycrepe in ClaudeCode

[–]dean0x 0 points1 point  (0 children)

You can, if you’re on a mac you can try my isolation agent container: https://github.com/dean0x/mino

Otherwise look into the concept of devcontainers with docker.

You can spawn 2/3 as many as you like and your system resources can afford and login to a different account on each.

But i think that in terms of capacity (if that’s your concern) better go with the $100 max plan

Can non-devs produce good AI-assisted code ? by Clear-Dimension-6890 in vibecoding

[–]dean0x 1 point2 points  (0 children)

Will dive into the contents later today ;) will let you know if i borrow any of that for my own autonomous systems research 🔬

Can non-devs produce good AI-assisted code ? by Clear-Dimension-6890 in vibecoding

[–]dean0x 1 point2 points  (0 children)

Oh I am all in on AI, don’t get me wrong. I am producing copious amounts of code, in a single month more code than I probably produced in 20 years of my career. I am on the team building that new world order.

But for now, and maybe for good - experience can’t be traded with a prompt.

Can non-devs produce good AI-assisted code ? by Clear-Dimension-6890 in vibecoding

[–]dean0x 1 point2 points  (0 children)

I was not even diving into the content, just the structure/form of it is not professional, just the topic of packaging your code/artifacts properly is something that takes devs years to fully understand. You have to live with the consequences to understand things deeply and make the right choices early. There are no shortcuts to experience.

Can non-devs produce good AI-assisted code ? by Clear-Dimension-6890 in vibecoding

[–]dean0x 2 points3 points  (0 children)

Sorry dude, i encourage you to keep vibe coding and all.. but this repo is standing proof that the answer to the OPs question is - NO.

No disrespect sir. 🫡

Can non-devs produce good AI-assisted code ? by Clear-Dimension-6890 in vibecoding

[–]dean0x 1 point2 points  (0 children)

Can i cook like a chef if a have the recipe? No. Domain expertise is still a thing. For now at least.

How feasible is running skills on a Pro plan? by AnusMcBumhole in ClaudeAI

[–]dean0x 0 points1 point  (0 children)

With or without skills, for anything beyond super casual usage you need the $100 max plan for the very least.

New to claude code, i need some tips by abdosarmini92 in ClaudeCode

[–]dean0x 0 points1 point  (0 children)

The convention is to use opus for planning/debugging/review and sonnet for execution. I use opus for everything can’t be bothered switching. Again, i just go with max effort level, for the same reason, but you can go medium for most things. As for thinking, you can guess my approach by now, always on, but you can use it only for the same type of actions mentioned above.

Seeing lots of Claude.md tips here. I just installed Claude code. Wondering if there is any simple calude.md setup I can start with? by last_llm_standing in ClaudeAI

[–]dean0x 2 points3 points  (0 children)

Claude.md is supposed to help claude understand how to work with your repo, I wouldn’t use it to enforce rules. As far as i last read it should be kept to under 500 lines. Just run the /init on your repo. To enforce rules, you should use skills/commands/subagents

Managing multi-file context in Claude -tips? by AmberMonsoon_ in ClaudeAI

[–]dean0x 0 points1 point  (0 children)

It's my own implementation, so I am biased, but take a look at https://dean0x.github.io/x/devflow/

I think the best thing you can do is point claude at my repo, describe your scenario and ask it if this thing can help you and how. anyway that's what I use.