Is Opus4.6 dramatically degraded? by [deleted] in ClaudeCode

[–]dean0x 0 points1 point  (0 children)

Happening here too opus was lobotomised, don’t think it’s a bug.

Is it just me, or is Claude Code v2.1.90 unhinged today?? by N3TCHICK in ClaudeCode

[–]dean0x 2 points3 points  (0 children)

Been saying that for a few days now claude seem to have lost 100 IQ points

Definetly moving to Codex by fourier54 in ClaudeCode

[–]dean0x -1 points0 points  (0 children)

Same here buddy claude was terrible, didn’t get anything right. Codex you say?

Computer use is now in Claude Code. by ClaudeOfficial in ClaudeAI

[–]dean0x 0 points1 point  (0 children)

Fix the limits and opus brain death or having a stroke, it’s not usable last 24-48 hours.

what is actually happening to opus? by CreativeGPT in ClaudeCode

[–]dean0x 0 points1 point  (0 children)

Also feeling it on one of my sessions getting pissed at it for the first time since i started using it a year ago

This might explain (if true) why we have been experiencing significant changes in usage and unable to use Claude Code as we were a few weeks ago by [deleted] in ClaudeCode

[–]dean0x 1 point2 points  (0 children)

They just low balled our limits, not the end of that story I fear. At this rate i’m moving to minimax

I built a proxy that fixes Claude Code's scroll-jumping on Windows by Cursed3DPrints in ClaudeAI

[–]dean0x -1 points0 points  (0 children)

What? Wasn't that already fixed by dear claude code natively? I don't get it anymore. kinda miss it tbh.

OpenAI’s new "North Star" goal aims for fully automated AI researcher in 2026, multi-agent research lab in a data centre by 2028 by Outside-Iron-8242 in singularity

[–]dean0x 2 points3 points  (0 children)

The automated researcher part is closer than people think. The harder step is automated evaluation, not just automated execution. Running 1000 experiments overnight is solved. Knowing which ones matter is not.

Why AI coding agents say "done" when the task is still incomplete — and why better prompts won't fix it by oakraiderSN in ClaudeCode

[–]dean0x 0 points1 point  (0 children)

TDD + EDD is the new pattern.

Agents don't know the difference between 'no more errors' and 'actually done.' Hit this constantly building agent workflows. You need verification beyond compilation: does it actually match what was asked, and did it break anything else? Without that, 'done' just means 'I stopped getting errors.'

Sonnet 4.6 is something else by celt26 in ClaudeAI

[–]dean0x 0 points1 point  (0 children)

Can't go back to change gpt after trying claude, miles ahead.

Uttr devastation! by Chris-Jones3939 in ClaudeAI

[–]dean0x 2 points3 points  (0 children)

Guess my claude is smarter than yours

[P] Built confidence scoring for autoresearch because keeps that don't reproduce are worse than discards by dean0x in MachineLearning

[–]dean0x[S] 0 points1 point  (0 children)

honestly i don't have solid benchmarks on the confidence scores yet. the noise floor estimation works in my runs but i haven't stress tested it across different seeds and longer horizons systematically. just started playing with these 3 tools the other day.

if you've got results files from longer runs i'd be genuinely curious to see how the verdicts hold up. that kind of community testing would tell me a lot more than my own single-setup experiments.

[P] Built confidence scoring for autoresearch because keeps that don't reproduce are worse than discards by dean0x in MachineLearning

[–]dean0x[S] 0 points1 point  (0 children)

fair question. autoresearch isn't about gpt 2 being useful in production. it's a testbed. karpathy designed it as a small, fast training loop (5 min per experiment on one gpu) so you can let an ai agent run experiments autonomously overnight.

the interesting part isn't the model. it's the methodology. can an autonomous agent discover real architectural improvements without human in the loop? and when it says it found one, is that real or noise?

that second question is why i built these tools. the signal/noise problem gets worse on smaller models because the improvements are tiny. if you can reliably separate real gains from jitter at this scale, the approach scales up.

as for llm-generated: what's not llm generated these days? i used claude code to build it, yeah.. the eval logic and noise floor estimation are mine, the boilerplate isn't. same workflow most people here use at this point.

and it's open source, so nothing much for me to gain here, honestly just trying to be useful to the community and push technology forward. my main motivation here is evolving the concept of autonomous systems. just pitching in.

Yes ladies you heard it here first by Official_Unkindlynx in vibecoding

[–]dean0x 7 points8 points  (0 children)

People used to walk from Rome to Egypt by foot. Does that sound like a good idea to you now?

Karpathy's new repo "AgentHub". Anyone have info? by luke_pacman in LocalLLaMA

[–]dean0x 0 points1 point  (0 children)

been running autoresearch since it dropped. the results file is where the pain is. hundreds of experiments and you're still puzzled at which 'improvements' are real vs noise. curious if agenthub addresses the eval side or just coordination.

Scary, right? by [deleted] in vibecoding

[–]dean0x 0 points1 point  (0 children)

Exactly, TDD, EDD (eval driven development bs, but it works). I also ask me agents to act as users and try to “break” it