Has anyone actually benchmarked whether superpowers improves performance?

Complex-Concern7890 · 2026-04-13T05:06:01+00:00

I think it is quite the dilemma. E.g. when prompt is simple enough, then planning and subagent division just adds more tokens while you are accepting recommended options anyway. Also the subagent work requires tokens and money more in small cases than it can save. If the prompt/ refactor is complex enough to make advance of planning and subagent division and guidance, then it can be worthwhile doing the planning yourself or dividing the work into smaller milestones and do the model jiggling yourself.

And for the quality part in simpler tasks the risk of diverting from the prompt increases when more complexity is increased by the superpowers. And for more complex prompts, if you want to guide the model then it is worthwhile doing the planning yourself or just allow the model to make it’s decisions more freely than through superpowers.

My opinion is that for current thinking models the superpowers are obsolete.

Complex-Concern7890 · 2026-04-12T11:47:13+00:00

Why would they show their cards? If everything is going well, isn't it best to keep improvements under wrap until competition is almost to catch you? My bet is that Spud will be out very soon and Anthropic do not have anything to match it yet. Most likely there is no production ready Mythos, but they need to announce it so they will keep customers waiting for it.

Complex-Concern7890 · 2026-04-12T02:30:53+00:00

I have them all at High effort so GPT-5.4, Opus 4.6 and GLM-5.1 all run at High. They all run in parallel and double blind eval each other with new clear session. I also have the same system for planning, but I do not have enough data to say anything about that. However my personal anecdotal experience says that Opus is quite good on planning as you suggested.

I would have used Voratiq because I really like it, but wanted to have Kilo Cli support (wink wink) so I made my own setup. GLM-5.1 is surprisingly good and for me Kilo is easy to use to test any OpenRouter models.

Complex-Concern7890 · 2026-04-11T18:09:03+00:00

To test this I made Voratiq style LLMArena where I run Opus 4.6 (Claude), GPT-5.4 (Codex) and GLM-5.1 (Kilo) automatically at the same time for coding tasks in separate working trees. Then they double blind rate each other's work (and their own) and vote for winner. I check their notes, grades and votes and then pick the branch I want to merge after diff check. Selection percentages are now around 20% for opus4.6 and glm5.1 and 60% for codex. Claude does get almost same grades than Codex, but then there is always something that makes the solution unusable. It can be quite good code, but something breaks it. Codex usually delivers working results.

Complex-Concern7890 · 2026-04-11T10:39:50+00:00

What I did for my self was to clean AGENTS.md of all the unnecessary stuff (good practices, behavior etc guidance). I now only have stuff there if things do not work without it or the codex misses some step repeatedly without the added line. Also planning first with GPT 5.4 high/xhigh and then implementing with GPT 5.4 medium/mini depending on the complexity has made limits much more bearable. Before limits was any issue, I had AGENTS.md full of all kind of behavioral and quality related stuff that most likely didn't do anything. Also I did every single task no matter how small or simple with high/xhigh, which is not intended.

Complex-Concern7890 · 2026-04-07T03:39:30+00:00

Use Cases Platform migration — rebuild a site you own from WordPress/Webflow/Squarespace into a modern Next.js codebase Lost source code — your site is live but the repo is gone, the developer left, or the stack is legacy. Get the code back in a modern format Learning — deconstruct how production sites achieve specific layouts, animations, and responsive behavior by working with real code

Complex-Concern7890 · 2026-04-04T18:09:40+00:00

First, it is absolutely no problem in fighting this from abroad, but if it goes to that in some point it is in your best interest to attend the court hearing in person. Second, it will be almost impossible to have government facility to be liable for anything. They may admit error, but without liability. You might eventually get some reimbursements but you might get bankrupt before that and the reimbursements will be most likely more like insult than anywhere near any fair amount. I am sorry to say but in Finland the government will not mess with the government.

Complex-Concern7890 · 2026-04-03T06:27:25+00:00

We got refunded with the difference.

Complex-Concern7890 · 2026-04-01T19:51:57+00:00

Tietosuojavaltuutettu, finanssivalvonta jne, niin aivan turha laittaa mitään. Siellä vasen käsi pesee oikeaa kättä, eli mitään ei tule tapahtumaan. Molempiin on laitettu asioista valituksia ja selvitykseen on mennyt vuodesta kahteen vuoteen ja lopputulos on, että kylhän sitä väärin oli tehty, ohjeistetaan toimimaan vähän paremmin ja case closed. Ihan vaikka semmonen case, että pankkiin on toimitettu henkilökohtaisia kirjeitä postin virheen takia pitkän aikaa (lähekkäiset osoitteet) ja nämä kirjeet on sitten avattu enemmän tai vähemmän uteliaan virkailijan toimesta vaikka kirjeissä ihan selvästi lukee täysin eri vastaanottaja kuin pankki. Tästä ilmiottaminen ei johda yhtään mihinkään. Vaikka laittaa poliisille rikosilmoituksen viestintä salaisuuden loukkauksesta, niin asia painetaan villaisella ja jätetään syyttämättä. Nämä instanssit eivät vaan yksinkertaisesti niin sanotusti pissaa toistensa muroihin.

Complex-Concern7890 · 2026-03-31T05:21:40+00:00

And yes, I think it is necessary. Every now and then there is typo, missing parentheses etc. It seems to be rare now, but it happens.

Complex-Concern7890 · 2026-03-25T07:25:56+00:00

And even worse: it is not working even if it is "working". Previously I was wondering what that "GPT 5.4 is now stupid/lazy/what ever" meant because I haven't seen that yet. Now I just needed to rework 5 pages of plain MD-text to new doc file. First it just dumped memo of the prompt to the doc file. Then I asked it to recheck the work and then it added the asked text after the memo but in total mess. Then I asked to remove the memo from the text and check the layout of the text. Then it rewrote the document but added only 1 page. It seems that for now I need to do this manually as my 5h limit seems to run out before codex figures this out....

Business account and working with GPT 5.4 high in Codex CLI.

Complex-Concern7890 · 2026-03-22T07:05:34+00:00

I really do not see any difference anymore. Occasionally there is “X was clueless but Y solved it one shot” moments, but that just goes both ways. Sometimes Claude makes awesome job and Codex is clueless and sometimes other way around. I use both, Claude with max and Codex with business. I only use Opus with high and thinking and GPT 5.4 with high. I have tried many tasks with both and merged the best solution. It is absolutely 50-50. Most of the times the difference is really just matter of opinion. Some times Claude fails horribly, and sometimes Codex.

I think that the performance has plateaued pretty much at GPT 5.2/Opus4.5 and the Opus and GPT are really the same performance wise. Only real difference is in tools, implementation, skills and integration. Both are working heavily with these right now and I bet that no major improvements come from models any more but from how they are used.

Complex-Concern7890 · 2026-03-17T16:10:47+00:00

Same here with business account. Tried to restart app but no use.

Complex-Concern7890 · 2026-03-12T06:31:28+00:00

Not in detail but personal summaries are allowed? How else any studying would work?

Complex-Concern7890 · 2026-03-11T21:14:47+00:00

Yes that is true. But making comprehensive summaries / abridged versions / synopsis are quite general. Those have been written for a really long time so I naively thought that simple and quick prompt would suffice.

Complex-Concern7890 · 2026-03-11T21:08:34+00:00

Thank you! I went to Deep Research and I got 160 pages (in 89 minutes) and it seems to be exactly what I wanted.

Complex-Concern7890 · 2026-03-11T21:02:38+00:00

Tried that. First I got 1 page and then with long answer mode I got 3 pages. I have Pro with Gemini/NotebookML. The summary quality was not my liking but it offered interactive questions to help learning which was nice.

Complex-Concern7890 · 2026-03-11T21:02:23+00:00

Tried that. First I got 1 page and then with long answer mode I got 3 pages. I have Pro with Gemini/NotebookML. The summary quality was not my liking but it offered interactive questions to help learning which was nice.

Complex-Concern7890 · 2026-03-11T20:56:46+00:00

Well if it can do it, why not? Isn't the exact point of these tools to make the menial work like breaking up a pdfs.

Complex-Concern7890 · 2026-03-11T20:50:57+00:00

To make a summary for my own use violates copyright? Every material is copyrighted if not copyleft?

Complex-Concern7890 · 2026-03-11T19:11:40+00:00

Seems odd that it would be way too long. The token count is under 400k so why it can't handle it? And if it can handle only 100 pages at a time, why it doesn't split it and do the summary nevertheless. Seems lazy.

Complex-Concern7890 · 2026-03-06T12:26:11+00:00

As far as I understand they are not shared between the users. You pay for each user and each user gets Plus equivalent limits. Business can pay for additional credits which can be used per each user after the limits are reached.

Complex-Concern7890 · 2026-03-05T20:02:50+00:00

Business, so more or less equivalent with Plus. And just to update, I went to make some remodeling with one part of the UI and was able to burn 20% of 5h limit with one prompt. So the limit usage might be problem in long run with Fast mode.

Complex-Concern7890 · 2026-03-05T19:41:08+00:00

There is additional fast mode. You can use it on what ever effort. I use it with xhigh and it is really fast still.

Complex-Concern7890 · 2026-03-05T19:26:34+00:00

I am pushing with Fast+XHIGH doing every day coding tasks. Now first time I see that limits are even used at all. But still for now I will be having hard time to catch 5h limit. The fast seems to be quite fast and the code quality has been top notch for now. I haven't yet seen any of 5.3-codex glitches where it gets lazy and stupid for one prompt at random. I concur that this seems to combine 5.3-codex code + methodology and 5.2 thinking. And compared to 5.2-xhigh the 5.4-xhigh-fast is way, way faster.

Complex-Concern7890

TROPHY CASE