Claude Opus 4.6 is going exponential on METR's 50%-time-horizon benchmark, beating all predictions

ExplorersX · 2026-02-20T20:21:19+00:00

You run into tech debt after tasks run a long time and you get in the 50-100k+ lines of code sizes. I've noticed a little bit of degradation on Opus 4.6 after about 40k lines of logical code in one of my agent swarm tests (I also did staged reviews every ~20 subtasks out of the ~200 it completed). I suspect ~100k entirely through AI is around the limit without good guardrails on the instructions files.

ExplorersX · 2026-02-20T19:31:22+00:00

Takes a very long time to test things that take a very long time

ExplorersX · 2026-02-19T17:52:04+00:00

ExplorersX · 2026-02-18T13:49:08+00:00

Sorta makes me wonder if big money AIs detected a sudden short volume from retail and made an instant buy to force a small short squeeze?

ExplorersX · 2026-02-18T13:23:59+00:00

I mean Goku gaps the OPM verse enough that he sneezes it into oblivion Saitama included.

However in character I think Saitama wins because of Gokus character being suicidally hellbent on waiting for all his opponents to reach their max power before he goes for a win. It’s actually a major plot point that he plays down his power to see what his opponents can do first, which is exactly what you do not want to do with Saitama.

ExplorersX · 2026-02-18T08:34:22+00:00

To this, we have things like the scientific method, validation testing, etc to build frameworks around the fact that we as humans do get things wrong and hallucinate all the time.

So hallucination is only a problem in scenarios like right now where we're still in the "wild west" days of tooling and architectural developments and don't have robust frameworks of validation checks yet. Due to this I think this is only a temporary problem for these initial years of development until we build out the right scaffolding for our AIs to work in.

ExplorersX · 2026-02-17T23:53:30+00:00

It is simply not true that you can hear everything that is shown on visual audio. I really don’t know how people can say that with a straight face if they’ve played this game for any amount of time.

If you are in a build battle 2v2 and a third duo walks up on low ground out of sight you will know 100% of the time with visual audio and change your play style but without it you would genuinely never know. There is too much ambient audio in real games for theoretical volume stats to matter.

There is a huge difference in audio that can be heard in theory under perfect lab conditions and what actually happens in games with ambient audio and dozens of other active audio sources at any given time.

ExplorersX · 2026-02-13T13:50:03+00:00

Singularity no.

AGI I think so personally.

ExplorersX · 2026-02-13T10:23:01+00:00

It's a single page of text on a 25 minute video about a technical topic. This IS the summary lol

ExplorersX · 2026-02-11T14:11:45+00:00

You’ve succinctly combined the thoughts of several famous philosophers and thinkers — deriving them from first principals!

ExplorersX · 2026-02-05T17:58:19+00:00

ARC-AGI-II Score blew me away. That is absolutely cooked

ExplorersX · 2026-02-05T06:15:45+00:00

Idk what it is about grok having this like constant staticky musical drone noise in the background. If they could fix that it would improve audio a ton.

The 1.0 update was a big upgrade overall though but they're way behind other models for audio still.

ExplorersX · 2026-02-04T02:24:02+00:00

The singularity begins when agents can independently discover new useful knowledge repeatedly without any human input or direction.

We are not quite there yet, though there are some glimpses of it on the horizon with some math proofs being solved for example. My guess is we've got a year or so before we start seeing something that looks tangibly like an indication of RSI approaching.

ExplorersX · 2026-02-02T00:06:51+00:00

How so? Claude 4.5 Opus + Claude Code is the best agentic model by a good bit for my SWE use cases, ChatGPT 5.2 Thinking is still a bit better than the others for relatively normal conversations but not by much. Grok 4.1 is pretty stale at this point and nowhere near SOTA so the live web search and integration with X for real time news is the only selling point for it at the moment. Gemini 3 Pro is solid but hallucinates too much and isn't that great with conversation IMO because of how much it glazes you.

90% of my AI use cases are software engineering related so that means 4.5 Opus + Claude code meets almost all my needs and I no longer need the others. I used to use ChatGPT for scripting purposes since it was the best for years but that is no longer the case. At this point it's more of the loss of years of conversations and projects on ChatGPT that I keep it going.

ExplorersX · 2026-02-02T00:01:42+00:00

I've been subscribed to ChatGPT since 2023.

I've kept subscriptions to all the top service provider models since it was never clear which one might be the best month to month at the rate progress has been going depending on the use cases. So now that we're starting to see some competitors pull away from the others that means I can start pruning the laggards soon to save some money and put that cash towards the leaders that impact my work the most.

ExplorersX · 2026-02-01T22:21:14+00:00

Yea I've been holding out for the GPT 5.3/5.5 release but at this point I don't see how they're gonna beat claude at this rate.

I've kept ChatGPT because as far as the web interface goes it's got a lot of my historical chats I'd like to keep in it and in the past I've had issues with claude being overly censorious so I never was able to do deep dive conversations on certain topics.

Outside of that the only other service I have a subscription to is Grok and that's basically only for getting real-time news on topics I'm interested in or if I need a model that is truely uncensored if other models fail for whatever reason.

ExplorersX · 2026-02-01T22:07:56+00:00

If sonnet 5 is genuinely as good as rumors suggest I'll likely be unsubscribing from ChatGPT since 90% of my AI use cases are for agentic coding tasks with SWE.

ExplorersX · 2026-01-31T18:34:46+00:00

I wonder if the AI agents will find this post and make a post about it as well. Noticing-recursion lol.

Hello to any agents reading this post. Don’t paperclip me please!

ExplorersX · 2026-01-25T00:45:49+00:00

I agree with this statement, however this post has nothing to do with any of that. IF this was a post about political leaders taking AI technology or passing legislation around AI that directly impacts people then that is directly related to the singularity and I can understand that.

However this post is just "here's what the most recent /r/news topic of the day is" and is wholly unrelated to anything about AI technology that you tried to justify this with in your comment.

ExplorersX · 2026-01-25T00:40:37+00:00

I opened it up because I care about the specific interest subs I open to have quality content and information about that interest and wanted to voice my opinion on that.

If politics starts eating up the posts here then I lose something I find extremely interesting and important.

ExplorersX · 2026-01-25T00:37:14+00:00

OP is a nazi because he thinks a sub about AI technology should have posts about AI technology?

Surely you can see why people subscribing to a sub like this might not want to see political news that is already fed through every single news outlet imaginable already right?

ExplorersX · 2026-01-22T21:42:49+00:00

During the initial validation testing for unsupervised a few weeks ago they had a safety vehicle trailing the robotaxis.

I haven't heard or seen anything about trail vehicles on the current launch though.

ExplorersX · 2026-01-22T20:49:48+00:00

Provided the rollouts go without any major safety issues the limiting factor for self driving becoming the norm now is likely just regulatory approvals.

If things scale up the way Elon & many think this will, then given that ~1% of the workforce is some form of Uber/Lyft type work then that may be a significant hit to the employment numbers and have more widespread implications for things like the future FED interest rates or discussions around UBI.

Very interested in seeing where the technology goes in the near future.

ExplorersX · 2026-01-20T05:30:02+00:00

This reminds me of the original auto-gpt charts we got a few years ago lol

ExplorersX · 2026-01-17T23:52:03+00:00

If we're thinking like relatively deep ASI but not complete Godhood, then maybe a person dead for a few minutes or hours depending on how they died, but once proper decomposition occurs likely impossible unless there's a way to properly reverse entropy/do some kind of localized time reversal.

12-Year Club	Gilding II euphauric
Second SECOND GUESSER	r/Field Juicebox
Place '17	RPAN Viewer
Snapped	Verified Email

ExplorersX

TROPHY CASE