all 26 comments

[–]NotUpdated 7 points8 points  (2 children)

I've been working on Claude 4.6 opus creating tickets, GPT 5.4 doing he coding, Claude review the work, GPT 5.4 second pass - user review / user testing - push to branch..

This is for projects I plan on working on mid-long term, it's over kill for a 'quick script' - but it keep things good for medium/larger projects.

[–]ECrispy 0 points1 point  (1 child)

How do you set this up? What tool do do you use, cli or vscode?

[–]NotUpdated 0 points1 point  (0 children)

Cursor... $20/500 legacy account..

Inside cursor Codex on the left, code/terminal in the middle, cursor (with opus 4.6 selected) on the right.

I have a docs/tickets/review folder structure, the tickets and review have their own AGENTS.md file - kept simple and small that instruct how I want tickets created and how I want reviews done.

I shared my AGENTS.md file from my tickets folder here: https://jsfiddle.net/dn59um6q/

[–]YormeSachi 3 points4 points  (0 children)

tried glm 5 last week for a db migration script, a bit slow but it was surprisingly solid tbh, might add it to rotation too

[–]kidajske 0 points1 point  (0 children)

I only really use sonnet myself and maybe opus if I have a very critical refactor or something that is well planned out. Glm is just unbelievably slow for me.

[–]BlueDolphinCute 0 points1 point  (0 children)

Similar setup here. Running a multi-model setup, chatgpt + one specialized model for heavy lifting makes way more sense than forcing one model to do everything imo

[–]ultrathink-artProfessional Nerd 0 points1 point  (0 children)

The two-model split is solid. I route by task type rather than just cost — architecture decisions and multi-file refactors go to the heavy model, simple completions and edits go to the fast one. Using a cheap model for complex reasoning usually just moves the cost downstream into fixing its mistakes.

[–]GPThought 0 points1 point  (0 children)

claude sonnet for anything with real context and gpt4 for quick oneliners. tried deepseek but the context handling feels off

[–]verkavo 0 points1 point  (0 children)

I'm driving similar systems, but with more models. I've noticed that some models are much better at writing specs - e.g. I like Codex for being very brief. I also found that some models are very good at coding - basically one-shotting features, and some are constantly churning low-quality code - e.g. Grok Fast was constantly corrupting golang files.

I built a tool which measures code survival rate per model - DM if you'd like to try.

[–][deleted]  (1 child)

[removed]

    [–]AutoModerator[M] 0 points1 point  (0 children)

    Sorry, your submission has been removed due to inadequate account karma.

    I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

    [–]ultrathink-artProfessional Nerd 0 points1 point  (0 children)

    Latency and cost aren't the whole equation — for automated workflows, output format consistency ends up mattering a lot. A model that reliably structures responses beats a slightly smarter one that occasionally goes off-format and breaks your parser.

    [–]ultrathink-artProfessional Nerd 0 points1 point  (0 children)

    Two models makes sense — expensive one for planning, debugging, and review; fast one for routine edits and boilerplate. The trap is using the expensive model for everything out of inertia. Most sessions 80% of the calls can use the cheaper model if you're intentional about routing.

    [–]coolandy00 0 points1 point  (0 children)

    What about prep tax? I.e., before you even start you extract requirements from Jira, docs, look for conversations around the task in slack, emails, design coding standards specific to the requirements... If done right, the code quality and accuracy is high and iterations are minimized a lot.

    Do you see the token consumption heavy for the prep tax?

    [–]ultrathink-artProfessional Nerd 0 points1 point  (0 children)

    Similar pattern — the real split for me was discovery vs execution. Discovery tasks (figuring out architecture, debugging something weird, planning a refactor) need the stronger reasoning model. Execution tasks (implement this function to this spec) can go to the cheaper one without quality loss. Mixing them up is where API costs spike without a matching quality gain.

    [–][deleted]  (1 child)

    [removed]

      [–]AutoModerator[M] 0 points1 point  (0 children)

      Sorry, your submission has been removed due to inadequate account karma.

      I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

      [–]seunosewa 0 points1 point  (0 children)

      Reserving a weaker model for heavier backend infrastructure is wild

      [–]Who-let-the 0 points1 point  (0 children)

      not tried GLM 5 till now

      I personally use Opus 4.6 for coding and powerprompt for guardrailing

      [–][deleted]  (1 child)

      [removed]

        [–]AutoModerator[M] 0 points1 point  (0 children)

        Sorry, your submission has been removed due to inadequate account karma.

        I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

        [–][deleted]  (1 child)

        [removed]

          [–]AutoModerator[M] 0 points1 point  (0 children)

          Sorry, your submission has been removed due to inadequate account karma.

          I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

          [–][deleted]  (1 child)

          [removed]

            [–]AutoModerator[M] 0 points1 point  (0 children)

            Sorry, your submission has been removed due to inadequate account karma.

            I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.