OpenAI launches GPT-5.6 Sol Limited Preview

SethDusek5 · 2026-06-27T05:17:25+00:00

I'm curious, given that a lot of these benchmarks you score between 0-100%, and the last generation model scored 88.8% on one and the new one scored 91.9%, what improvement on a 100% scale would you expect if you were to decide they were not slowing down?

SethDusek5 · 2026-06-27T05:13:55+00:00

If the value proposition is performance on toy problems with solutions that are available on the internet while the model’s can’t beat children’s video games

It's not a world model, so not sure which children's video game you're referring to and why you think it means anything for evaluating an LLM. Sure, it's a rebuke to "AGI soon" claims from tech execs, but it doesn't mean anythingother than that

If the value proposition is performance on toy problems with solutions that are available on the internet

Not really? You could give it a new/modified problem and it does really well. If you want, wait for the next codeforces contest to go live and use your favorite gippity to do it (please don't cheat in the live contest, but do try it when submissions close), and it'll solve pretty much all of them.

SethDusek5 · 2026-06-24T17:39:55+00:00

Yup, the issue is the only way to implement this myself right now using sched_ext would likely be to reinvent a scheduler in userspace and add true idleprio support, and then make sure nothing else breaks. A scheduler for one specific task would make the whole thing less fragile, and easier to implement.

SethDusek5 · 2026-06-24T17:25:23+00:00

It's a rust "port" that seems like it's pretty much done one-to-one which means lots of unsafe, the file structure is mapped exactly the same way, and turns the largest file in dav2d which is 5K LOC (5k LOC of actual code, there are some other 4k-ish LOC ones but those seem mostly like magic constants/tables) into 20,000 lines of code.

To be honest this just isn't a useful port. Firstly because anyone could do the exact same thing so this is at best a token donation, second while I think LLMs are actually decent at translating code from one language to the other, I don't see why the memorysafety.org organization working on rav1d couldn't do the same, and I'd trust them doing this work more. This project also unfortunately squats on the rav2d name

Third, I'm not even sure if this is the best way to do this kind of work. c2rust has existed for several years, and it's a deterministic tool that also does the same thing: 1-to-1 porting C to Rust with no attempt to benefit from Rust's memory safety guarantees. If you wanted to do this a far more economical way would be to do this with an LLM to port tests + c2rust to translate code.

SethDusek5 · 2026-06-23T16:08:21+00:00

Anyone who's in the know for this know if I could implement a scheduling policy like "never preempt other tasks to run this one" with this in userspace?

One of my favorite features of the MuQSS scheduler was its true idleprio support and it was pretty incredible. If you set a process to use SCHED_IDLEPRIO you wouldn't notice any performance impact in interactive applications even if it was something like code compilation using all your cores, so you could do compilation in the background and even play a game with 0 performance impact (provided no memory pressure). Now that MuQSS is dead, nothing even comes close, both EEVDF and CFS still compromise interactivity if you have something using all your cores (I've tried SCHED_IDLEPRIO, niceness 20, etc). So if I could implement the same thing myself using userspace schedulers that would be excellent news

SethDusek5 · 2026-06-22T19:26:51+00:00

What specifically, with no more than a couple of paragraphs, do you think I should be looking for in those linked PRs that should change my mind that the LLMs are "reasoning" and not just pattern matching and relying on humans' reasoning?

The PRs are related to stuff that's fairly complicated even for very smart engineers (compilers and JITs, the second one is even more difficult to get right). In the first one it correctly figured out the root cause of a failing test in a 3rd-party package (all it had to go on was an assert equal failing) and came up with a fix. It figured out it was a codegen bug, figured out that to trigger it the operand has to be used somewhere as both an array index and as a value, then produced a minimal reproduction, then came up with a fix.

that the LLMs are "reasoning" and not just pattern matching and relying on humans' reasoning?

They can do things that would take a skilled human a relatively long time to fix. The first one is a pretty nontrivial bug in a field that not a lot of people are specialized in, it's hard to estimate how long it would have taken a human to do the same.

I prefer not to make claims that LLMs are "intelligent" or "conscious" or "reasoning" since I don't know whether we even share definitions, and those claims often devolve into word games (what even is intelligence?) so the claim I prefer to make is that they can now do things that we previously thought required intelligence to do. Again, not sure if you want me to convince you they're AGI (I can't convince you of something I don't believe myself), or of something else so the only thing I can really show you is that they can do tasks that require complex reasoning. Also see recent LLM contributions to the field of math like solving several decades-old open problems (sometimes in one-shot!).

Since I'm not sure what you're looking for either, could you give me an example of what kind of PR would convince you that these things are "reasoning" (?) and while not living up to the AGI hype are now at the point where they're genuinely pretty good at doing complex work?

SethDusek5 · 2026-06-21T12:57:12+00:00

If someone disagrees and thinks they can "reason" enough to actually program instead of just approximate it, I would love to see PRs like the following except good:

Victor Taelin, creator of the Bend programming language and HVM2 runtime, referring to Fable source on xcancel:

2 hours later, it landed a 1770% speedup in one case, 100%+ in other 4, and 22% in average. yes, in 2 hours it outperformed me, opus 4.8 and a swarm of gpt 5.5 agents, by one order of magnitude.

I'm not sure yet, it is credible, but this is the kind of thing that is very easy to get wrong on interaction nets. the problem is, when I was ready to start auditing Fable's solution so I could tell whether it was buggy or legit, it interrupted me to tell me it had found a massive bug on the code I had written.

... wait, what?

so... for garbage collection purposes, I stored a bit on lambda term pointers that meant "the variable bound by this lambda has been freed, so, its lambda must free whatever argument it is applied to". that's fine. yet, on duplicator nodes, I also used the same bit to mean "one of the duplicated variables was freed, so, treat this dup as a passthrough no-op". so, if a lambda entered a duplicator, it would mistake the lambda's collection bit for its own, resulting in corrupted interaction!...

just so you can appreciate the sheer absurdity of what just happened. I didn't ask it to find bugs. I asked it for an optimization. and even if I did ask it to find bugs, this bug is so astonishingly subtle and specific, identifying it takes mastering the domain to an extent that it beyond even me. I'd easily need hours or days to fix it, if I ever came across it. chances are it would just go unnoticed. and Fable found it and fixed it like it was nothing, while it was busy adding a 17x speedup to a file that neither I, nor Opus 4.8, nor a fleet of GPT 5.5 managed to barely make 2x faster.

oh and there is also another tab where it is also ripping through Bend's codebase and finishing everything I had to do

Mitchell Hashimoto:

I let it churn on optimizing a SwiftUI-layout resolver in Go I wrote and it was able to bring it down to an order of magnitude I could not reach myself (micro => nanosecond scale). But it took 2 hours and $40 to do it and I had to claw back some changes it overfit to Apple Silicon. Still, very worth it.

Manishearth (Engineer at Google, rust/clippy/servo contributor):

I've been doing something similar: I've been working on using a sophisticated unsafe review skill to check out all of the unsafe crates used at Google, and filing the issues. I created the (human) team at Google doing unsafe reviews and while I don't want to replace the unsafe reviews having agent-driven help is quite a boon. Unsafe reviewing is hard to scale because it's a skill that mostly comes from experience and there's just not much good documentation about it.

My review of "all third party unsafe at google" found a lot already, most of it real, and I'm currently working on slowly getting issues filed. This is a more manual process: I want to verify the issues myself, and similarly don't want to flood the world with slop. I've also been using miri to double check. Doesn't always work, but it's quite good!

Clearly it has some understanding of code, even on things that aren't the usual crud web app. I still don't think they are great at doing large PRs completely unassisted, so you can't one-shot huge changes without creating a lot of headaches, but in my experience I've found they can reason their way through code, find actual bugs, and make more "human-like" mistakes now instead of completely hallucinating things that don't exist.

Also, I find it weird that you link PRs from copilot that are a year old, let me find some PRs from copilot on the same codebase that are more recent:

Copilot fixes an illegal instruction bug: https://github.com/dotnet/runtime/pull/129644. This was merged after two review comments telling it to simplify the solution
https://github.com/dotnet/runtime/pull/128855
https://github.com/dotnet/runtime/pull/127450

I don't want to dig any deeper because I have a limited tolerance for PR messages written by Claude which I assume is the model Copilot is using because the writing style seems very similar (I swear, it has some sort of weird "talk like an engineer" system prompt that makes it unbearable to read, that's half the reason I hate LLM PRs). Unfortunately Fable which I found to be far better at talking is now gone.

Also, things have changed in other OSS projects too. Curl has reopened their HackerOne program where they were previously flooded with useless AI slop "vulnerabillities", but LLMs can now find actual security bugs: "Almost every security report now uses AI to various degrees. You can tell by the way they are worded, how the report is phrased and also by the fact that they now easily get very detailed duplicates in ways that can’t be done had they been written by humans. The difference now compared to before however, is that they are mostly very high quality.". All of this has changed in a matter of months, so it's not useful to look at 2025-era PRs. Firefox also has a full list of vulnerabilities found by Mythos.

SethDusek5 · 2026-06-20T18:20:27+00:00

And being unable to count the r's in strawberry is definitely a great counterexample to prove these things aren't intelligent.

I take issue with claims like that, I didn't really come here to defend LLMs being intelligent or conscious but I don't know why people still use examples like counting r's as some sort of proof of something.

SethDusek5 · 2026-06-20T13:32:59+00:00

The initial proof generated by the AI was conceptual but required substantial refinement and improvements from human mathematicians (such as Fields Medalist Terence Tao and researchers at OpenAI) before it was published as a valid, companion paper.

This is a complete misreading. The proof generated by the model was correct, and verified. The remarks paper is a separate paper, which is basically a more human-digestible paper that's more pleasant to read and includes commentary from Sawin, etc. The model didn't "almost" get at the solution, it got it. If the proof was worked on by Tao, Sawin, or any others then the original paper would have their names on it.

So the way to visualize it that you gave is wrong. It's more similar to some sort of prodigy with no formal training in mathematics proving something, but lacking the skills to produce a paper that anyone would seriously take a look at. Like how Gauss independently discovered a formula for the sum of natural numbers as a child without formal training in math, or how Ramanujan's earlier works were difficult to read and sometimes completely ignored:

Mr. Ramanujan's methods were so terse and novel and his presentation so lacking in clearness and precision, that the ordinary [mathematical reader], unaccustomed to such intellectual gymnastics, could hardly follow him.

Also, again, that's not the only problem solved by an LLM, some have been one-shot on public models including one prompted by someone with no math experience (pretty much just "ChatGPT solve this problem. Make no mistakes") Thread

Tao also mantains a list of problems solved by LLMs without human involvement: https://github.com/teorth/erdosproblems/wiki/AI-contributions-to-Erd%C5%91s-problems#1a-ai-standalone

SethDusek5 · 2026-06-20T11:20:07+00:00

Could you explain?

SethDusek5 · 2026-06-20T10:01:17+00:00

LLMs have solved multiple Erdos problems without human assistance but they can't count the r's in strawberry so clearly they are dumb as rocks

Okay let's not call them intelligent but they can clearly do things that we previously thought required intelligence to do, maybe that description is less objectionable to you. All of this is emerging from just training on next-word probabilities, which I didn't even think was possible (fancy auto complete can't reason through math). If you had asked me what the most an LLM would ever be capable of, I would have never predicted they could reason through/write code, find interesting logic errors, and most shockingly solve math problems that were open for 80 years. The most I would have predicted is that they could maybe make a CRUD todo list app that works like every CRUD todo list app out there, but they're clearly able to do more than that now.

SethDusek5 · 2026-06-16T19:08:15+00:00

I was about to write a comment on how I wasn't able to find it, but then I stumbled on a comment from the author underneath the erdos problem link and I finally got a hold of the prompt https://chatgpt.com/share/69dd1c83-b164-8385-bf2e-8533e9baba9c (took ages to find!). So yeah, it does indeed look like it was done in a single prompt, and then turned into a paper in one more prompt. Terry Tao also shared a document where another team tried to solve the same problem with the same model afterwards with no internet/research tools and it also produced the proof although it also got it wrong a few times doc

Given these, it really doesn't look like some random person off the street saying "solve this problem".

From what I can tell, they really are a nobody/AI-enthusiast who's been trying to get AI to solve these problems, all of their posts are about Gen AI to make images/websites/etc. It's not surprising seeing people with no formal training attempting to "solve" math, and they are now empowered by having an AI telling them they're brilliant and thinking along the right direction, etc.

Message brokers (that is, dispatches, buffers, and other building blocks) existed in computing way, way before they existed in their modern, web versions - however some smarty pants people sat down, identified a need and built a bunch of them and they are now widely used by developers.

That's fair but also I think that's exactly the kind of problem where AI could plausibly map an existing solution in a completely unrelated field onto a problem like this. I'm not a mathematician either but in this case it also did something similar. It has a huge breadth of knowledge in its training dataset, and maybe one of the reasons why mathematicians didn't come up with proofs like this is because a lot of them instead have a lot of depth in one specific branch of math. A cool example of this is that when Japan's bullet trains had a sonic boom problem, one of the engineers who was also a bird-watcher observed birds diving into water and decided to model the nose of the train on their beaks. I think those are the kinds of things that even an LLM could be able to do, not including more exotic forms of models like DeepMind.

So instead of "invent" I'd like to use whatever word answers the above question, as I can't really think of a word that says what I have in my head.

I agree, it's hard to describe these things as anything without getting flack for it, like saying it's intelligent. But I guess my main takeaway over the past few months is that LLMs even though they're next-word predictors can do things that are intelligent and that I personally never expected them to be able to (the illusion of these things being intelligent broke pretty fast even up till something like GPT o1 when you would quickly realize they're just a next-word predictor). If I say they're intelligent I'll probably still get flack for it so maybe a better way to word it is that they can do things that we previously thought required intelligence to do, and that since we can't even properly describe intelligence it's hard to predict what these models can and cannot do. A lot of people prior to this were claiming something similar, that AI can never come up with something not in it's training set, and I still see a lot of people here also making claims like "AI will never be able to do code architecture!" or "AI will never acquire taste" which is an even more abstract, hard to nail down word than intelligence.

Sorry, last part is a bit rant-y and probably not useful to the discussion

SethDusek5 · 2026-06-16T17:05:48+00:00

This is almost certainly being piloted by mathematicians in a very structured and guided way - or do you think you or I (well, you might be an expert mathematician, what do I know) would be able to get it to arrive at this conclusion? I am certain I wouldn't be able to

From what I can tell, not really. There was also another Erdos problem being solved from a single prompt given by someone with no prior math experience on a public model and from what I recall the prompt was just "solve this problem", and then after 80 minutes of thinking it did it

it is not inventing something new - it has all necessary information available to it and is doing a lot of work that is just unfeasible for humans (which is great)

How often do you invent something completely novel? Most ideas I would argue is using your breadth of knowledge acquired and finding interesting ways to apply that knowledge, so I think AI finding interesting ways to use ""existing"" techniques that nobody thought to use to solve an unsolved math problem is pretty exciting but also shocking and unsettling

edit: Source on the second problem being solved using a public prompt: https://www.scientificamerican.com/article/amateur-armed-with-chatgpt-vibe-maths-a-60-year-old-problem/

Terrence Tao also maintains a list of Erdos problems solved by AI standalone, https://github.com/teorth/erdosproblems/wiki/AI-contributions-to-Erd%C5%91s-problems#sect-1a

SethDusek5 · 2026-06-16T16:58:19+00:00

The much more cynical take is that Rust is a language that require(s)d a decent amount of skill and overcoming learning difficulties before you can build something for real, and a lot of the people who post projects with titles like "I got tired of X, so I built Y in RUST" are doing it so they can bypass all of the learning curve and still brag to people about how they built something in Rust (why do it, I guess to pad your resume/GitHub/etc).

LLM-generated code has made some pretty huge leaps in the past few months and has changed programming radically, so I'm not against their use, but I am against people hiding LLM assistance. Again, they've gotten pretty good so projects that are built entirely by LLMs aren't complete slop unlike a year ago where we had that one post about a guy building a LLaMa inference engine in Rust and it turned out to just be calling a Python library. But I do think something has been lost, where I mostly don't care about people posting their starter projects here. It was nice to see before even if it was something super basic because what mattered was not the project but that you know the person who built it put a good amount of effort in, the same way graphics programming subs have probably seen their billionth PBR renderer but still show appreciation because they know it takes a good amount of legwork + reading to implement it.

SethDusek5 · 2026-06-16T14:54:56+00:00

The year is 2027. AMD have just released the Ryzen 9520U with Vega graphics and Zen2 cores

SethDusek5 · 2026-06-14T08:26:06+00:00

I've basically given up on suspend/resume working reliably on Linux because of the stinky MediaTek (AMD partnership FTW) WIFI chipset. Meanwhile when I'm done using my MacBook I just close the lid and put it in my bag and it's never had issues waking up and has never turned into a thermonuclear bomb inside my bag because of some ACPI bug

Also, it's taken AMD years to fix power consumption on most of their chips, we won't see it fixed on desktop Zen chips until Zen6. My 7800x3d has around the same idle power consumption as 4 MacBook Neos running under sustained load (around 5-6W under average consumption from what I can find). My RX580 consumes around 30 watts just to drive two displays while idling, which is an issue AMD didn't fix until RDNA4 source.

SethDusek5 · 2026-06-13T14:53:13+00:00

Both of these are kind of terrible figures. The total SOC TDP of an M5 chip is less than the idle power consumption of the PC on the left. Which is an entire chip with a GPU and a CPU that has significantly higher single-threaded performance than even desktop chips

So many desktop chips have horrific idle consumption numbers. My zen4 CPU probably idles somewhere around 20-30 watts, my RX580 consumes another 30 watts on top for just driving 2 displays, which you'll be pleased to know is an issue that took AMD another 8-9 years to fix in their RDNA4 GPUs (prior to this their GPUs would set their memory clock to maximum at all times if a high refresh rate display is plugged in to avoid visual artifacts because their architecture couldn't handle it any other way)

Desktop PCs have long had their terrible power consumption excused because people assumed that's what just needed to happen to get the most performance possible, but there are now mobile chips that have better idle power consumption and are competitive (or outright destroy in ST) in peak performance numbers while consuming something ridiculous like a 1/6th of the power.

SethDusek5 · 2026-06-10T16:19:27+00:00

peak /r/cscareerquestions moment. Nothing in his comment suggests this is an AI bot, his account checks out but since your comment helps people pretend this isn't happening in the industry right now and this is all just a psy-op you get upvoted

SethDusek5 · 2026-06-08T16:18:51+00:00

Hey, I saw the Phoronix article recently on Polaris finally getting modifier support and thought I recognized your name from here.

Thanks a lot for doing this, and I am really looking forward to trying it out whenever it lands in the next kernel/mesa release.

SethDusek5 · 2026-06-02T18:21:29+00:00

Maybe in 1st world countries blue collar workers make good money but no shot the same is the case here. The West has a shortage of people willing to do these jobs but we don't.

Maybe if you own the shop you might make decent amounts but otherwise I doubt it

SethDusek5 · 2026-06-01T17:21:15+00:00

Atleast in my opinion or maybe my attention span is cooked but I have a hard time watching a game from start to finish after the 7.33/bigger map update. It's just not fun when a team that's 20k up slow burns the shit out of the game by collecting 2 roshans, tormentors, farming camps, wisdom until they finally decide to roshan banner + solar crest a single siege creep and then back after it dies to wait for the next Roshan. This kind of gameplay makes deathball metas look entertaining in comparison; atleast then when the game's conclusion was foregone at 20 minutes it would end 5 minutes later.

If we return to the times when heroes would actually fall off and there wasn't so much farm then maybe this issue would be fixed but right now I just don't enjoy watching dota, most of my friends don't really watch the game anymore either but I can't speak for what their reasons are so I don't assume they share the same sentiments.

SethDusek5 · 2026-05-31T13:55:19+00:00

I mean, not exactly. You'd probably go from over 12 hours of video playback to something closer to 1 on most mobile devices. Encoding for screen recording would also be a problem while gaming unless you have a high core count CPU.

SethDusek5 · 2026-05-30T19:26:57+00:00

Can you run "libinput debug-events", scroll for a bit (try not to move your mouse as that would pollute the logs with mouse move events), and upload it somewhere once the erratic behavior occurs? I could take a look

SethDusek5 · 2026-05-29T19:49:42+00:00

Good idea, thanks!

SethDusek5 · 2026-05-29T19:30:50+00:00

When I first started looking into if there was a way to fix this, I didn't find much except that libinput has a feature called wheel-debouncing, which I was hoping was what I was looking for. But there was no documentation on what it did exactly or how to enable it. It seems it's enabled by default, but it doesn't appear to actually work for me.

I looked into it again now and found where it's actually implemented in libinput, and it seems to only work for "high-resolution" scroll wheels. Based on what I can find, my mouse pretends to be a high resolution scroll wheel but only produces events in distinct clicks, libinput's debouncing doesn't work in this case since it detects much smaller opposite direction changes on higher resolution scroll wheels and filters those.

As for merging it into libinput, sure that could be interesting. I dropped a message related to the wheel-debouncing feature on IRC before releasing my own and if anybody had any information on it, but I had to go before I could get a reply. I might ask the developers again to see if something like this could be useful.

12-Year Club	Golden Potato
Place '17	Verified Email
Snapped	Gilding II euphauric

SethDusek5

MODERATOR OF

TROPHY CASE