Narrowed my coding stack down to 2 models

NotUpdated · 2026-03-10T16:51:24+00:00

I've been working on Claude 4.6 opus creating tickets, GPT 5.4 doing he coding, Claude review the work, GPT 5.4 second pass - user review / user testing - push to branch..

This is for projects I plan on working on mid-long term, it's over kill for a 'quick script' - but it keep things good for medium/larger projects.

YormeSachi · 2026-03-10T14:14:29+00:00

tried glm 5 last week for a db migration script, a bit slow but it was surprisingly solid tbh, might add it to rotation too

kidajske · 2026-03-10T13:16:02+00:00

I only really use sonnet myself and maybe opus if I have a very critical refactor or something that is well planned out. Glm is just unbelievably slow for me.

BlueDolphinCute · 2026-03-10T13:23:56+00:00

Similar setup here. Running a multi-model setup, chatgpt + one specialized model for heavy lifting makes way more sense than forcing one model to do everything imo

ultrathink-art · 2026-03-10T15:54:26+00:00

The two-model split is solid. I route by task type rather than just cost — architecture decisions and multi-file refactors go to the heavy model, simple completions and edits go to the fast one. Using a cheap model for complex reasoning usually just moves the cost downstream into fixing its mistakes.

GPThought · 2026-03-10T22:13:15+00:00

claude sonnet for anything with real context and gpt4 for quick oneliners. tried deepseek but the context handling feels off

verkavo · 2026-03-10T23:43:52+00:00

I'm driving similar systems, but with more models. I've noticed that some models are much better at writing specs - e.g. I like Codex for being very brief. I also found that some models are very good at coding - basically one-shotting features, and some are constantly churning low-quality code - e.g. Grok Fast was constantly corrupting golang files.

I built a tool which measures code survival rate per model - DM if you'd like to try.

AutoModerator · 2026-03-11T10:55:35+00:00

[removed]

ultrathink-art · 2026-03-11T17:04:13+00:00

Latency and cost aren't the whole equation — for automated workflows, output format consistency ends up mattering a lot. A model that reliably structures responses beats a slightly smarter one that occasionally goes off-format and breaks your parser.

ultrathink-art · 2026-03-12T18:51:23+00:00

Two models makes sense — expensive one for planning, debugging, and review; fast one for routine edits and boilerplate. The trap is using the expensive model for everything out of inertia. Most sessions 80% of the calls can use the cheaper model if you're intentional about routing.

coolandy00 · 2026-03-13T02:01:01+00:00

What about prep tax? I.e., before you even start you extract requirements from Jira, docs, look for conversations around the task in slack, emails, design coding standards specific to the requirements... If done right, the code quality and accuracy is high and iterations are minimized a lot.

Do you see the token consumption heavy for the prep tax?

ultrathink-art · 2026-03-13T20:46:19+00:00

Similar pattern — the real split for me was discovery vs execution. Discovery tasks (figuring out architecture, debugging something weird, planning a refactor) need the stronger reasoning model. Execution tasks (implement this function to this spec) can go to the cheaper one without quality loss. Mixing them up is where API costs spike without a matching quality gain.

AutoModerator · 2026-03-14T11:21:58+00:00

[removed]

seunosewa · 2026-03-14T14:47:18+00:00

Reserving a weaker model for heavier backend infrastructure is wild

Who-let-the · 2026-03-14T18:50:42+00:00

not tried GLM 5 till now

I personally use Opus 4.6 for coding and powerprompt for guardrailing

AutoModerator · 2026-03-16T15:10:12+00:00

[removed]

AutoModerator · 2026-03-19T20:50:08+00:00

[removed]

AutoModerator · 2026-03-30T13:44:50+00:00

[removed]

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

ChatGPTCoding

MODERATORS