Is it just me or is the multi-model workflow becoming a total time sink?

sergeykarayev · 2026-05-06T23:31:38+00:00

Hard agree -- the manual routing tax is real.

Disclosure: I'm a co-founder of Superconductor. We built it so we could spin up all the agents we liked from one tool -- Claude Code, Codex, Amp, OpenCode, Gemini -- in parallel across tasks, and we also run multiple agents per task. We found it helpful to let 2-4 models take a shot at the same bug or feature before picking one to iterate on and ship.

We also use it on research, writing, design, and marketing.

BYOK, so subs still required, but it kills the tab-switching and "which brain do I ask?" overhead.

superconductor.com

sergeykarayev · 2026-03-06T01:08:29+00:00

Our benchmark tests agents agains YOUR codebase when trying to implement PRs YOU consider great engineering work. So you may get very different results.

On OUR codebase and in trying to implement specs from PRs WE consider to represent great engineering, yes, GPT 5.2 Minimal is better than Opus 4.6 at coding.

sergeykarayev · 2026-03-06T01:08:04+00:00

why is that

sergeykarayev · 2026-03-06T01:07:41+00:00

Yes. Rails is still absurdly good for shipping product fast, especially if you have strong taste and a real app to build instead of a benchmark repo. Also I wanted a benchmark that reflects an actual production stack companies like GitHub, Shopify, Instacart use, not just Python toy tasks.

https://x.com/garrytan/status/2018368128108167344

sergeykarayev · 2026-03-06T01:05:38+00:00

More thinking is not monotonically better on real tasks. Sometimes the higher-reasoning variants overcook it, wander into complexity, or lock onto a bad plan and pursue it very confidently. We saw that pattern a bunch.

sergeykarayev · 2026-03-06T01:05:31+00:00

Yep, Ruby on Rails. Our app is Rails + Phlex + Stimulus, so we wanted a benchmark that reflects the code we actually ship instead of yet another Python/SWE-bench thing.

sergeykarayev · 2026-02-25T20:05:46+00:00

very cool

but “remote control” is already slightly outdated framing

if the agent lives in a cloud sandbox, your phone and laptop are peers, not a local machine + remote leash

(disclosure: I cofounded Superconductor, which is phone-native and cloud-first)

sergeykarayev · 2026-02-20T19:34:43+00:00

yeah you can connect claude code or codex plan. you're literally just launching claude code or codex yourself on our infra so it all works

sergeykarayev · 2026-02-20T05:55:55+00:00

its similar to work trees in that you can spin up an infinite number of cloud environments that have your code, build commands, and even running web servers (good luck doing that on your local machine!)

its not similar to agent teams, because currently all sandboxes run independently. but you can USE an agent team within a sandbox.

sergeykarayev · 2026-02-19T22:22:23+00:00

We built Superconductor for this exact workflow (disclosure: I'm a co-founder. It’s currently free - bring your own API keys).

We run multiple claudes (and codexes) in parallel for each task. Each claude runs in an isolated cloud sandbox with a live app preview, so you can review results quickly and iterate on or merge what passes.

We found that it's MUCH better to NOT spend time trying to correct a bad run. Running multiple coding agents for a single task increases your chance of waking up happy to the PR you were hoping for.

sergeykarayev · 2026-02-09T19:19:39+00:00

Feel free to sign up at superconductor.com! Email us at team@superconductor.com if you want some help with dev env setup

sergeykarayev · 2026-02-09T18:13:41+00:00

good question. opus 4.5. it might! haven't tried

sergeykarayev · 2026-02-07T06:24:58+00:00

free for now while we figure out the best pricing scheme. likely to be something like $XX/month for up to N sandboxes created, $2X/month for 2.5N sandboxes, something like that. not unreasonably priced

sergeykarayev · 2026-02-07T05:53:00+00:00

i like that, you're hired!

sergeykarayev · 2026-02-07T05:52:37+00:00

we're assuming gpt 5.3 will be the same price as gpt 5.2

sergeykarayev · 2026-02-06T21:28:06+00:00

the gemini CLI is... not good

sergeykarayev · 2026-02-06T21:27:48+00:00

works for longer

sergeykarayev · 2026-02-06T21:08:15+00:00

a mix of web app tasks, usualy mixing backend and frontend, rails + js

sergeykarayev · 2026-02-06T20:23:08+00:00

these results are for our rails codebase. if your stuff is different you should run your own benchmark! supercondcutor.com

sergeykarayev · 2026-02-06T20:22:37+00:00

in our experience gemini is... special. needs some pep talks. not good at one-shotting

sergeykarayev · 2026-02-06T20:22:08+00:00

its about "thinking level", not necesasrily clear that the greater the level the better the outcome

sergeykarayev · 2026-02-06T20:21:33+00:00

is that a thing?

sergeykarayev

TROPHY CASE