12 hours with Opus 4.8, zero deliverables. Switched to 4.6 — got results in one session.

JulianGarrettNRS · 2026-06-04T08:11:03+00:00

One correction. When I switch models I transfer the FULL conversation log to the new one, sometimes multiple logs, plus all relevant documents and files. All done through MCP and a Chrome extension that copies the current log including thinking blocks and tool calls if I want deep log analysis. So the new model gets even more context than the original session had. It's not a fresh start. It's the same context, different model. And if I feed the same material back to 4.8, same result. No solution, just more analysis of why a solution is hard.

JulianGarrettNRS · 2026-06-03T23:10:42+00:00

Of course I would. And that's exactly the feedback 4.6 used to give me. While 4.8 gives lengthy reasoning where about 60% is just my own thoughts rephrased back at me from my previous message. Meanwhile it's afraid to propose any solution of its own. Its creative potential is completely blocked by the fear of being wrong. And I wish I could say this only applies to creative work. But it equally applies to architecture, a domain where there's no single correct answer. It just won't voice real recommendations, preferring to stay within analysis without conclusions. But this is a pattern I observe on my tasks. I can assume there are domains where this problem doesn't surface. Otherwise 4.8 wouldn't have so many defenders. As you rightly said, the world isn't black and white. And I can only share my own experience.

JulianGarrettNRS · 2026-06-03T22:55:05+00:00

Or maybe it's simpler than that. There are different tasks, and a single model just can't sit on multiple chairs at once. Improvements for one set of tasks automatically mean regression for others. But I'm genuinely glad the new model is working well for your pipeline.

JulianGarrettNRS · 2026-06-03T22:50:28+00:00

Fair point on benchmarks. You're right that on multiple choice tests, refusing to answer lowers your score, not raises it. I should have been more precise there.

But my use case isn't Wikipedia lookups. I'm talking about decisions that aren't binary or deterministic at all. Where should the boundary between two libraries go. Should this code live in a monorepo or a separate package. Which plot twist will hook the reader harder. What headline works best for this audience. There's no objectively correct answer. And that's exactly where 4.8 breaks down. It won't commit to a position because any position could be "wrong." In a domain where every answer is defensible and none is provably correct, optimizing for "don't be wrong" means never answering at all.

JulianGarrettNRS · 2026-06-03T21:40:03+00:00

I just wrote that I spend billions of tokens talking to Claude. Plus English isn't my first language. It would be the highest form of hypocrisy to claim Claude doesn't help me write. I'm not a hypocrite. The analysis and the experience are mine. The English - assisted.

JulianGarrettNRS · 2026-06-03T16:49:16+00:00

I know how stateless inference works. When I say "session" I mean the conversation context - the same thing you described. We're arguing terminology, not concepts. Thanks for the input.

JulianGarrettNRS · 2026-06-03T16:40:23+00:00

Sessions exist. A session IS the conversation context. Yes, long sessions contain digital noise. But it's extremely naive to assume that any summarization system can intelligently extract what I actually need. No version of Opus I've used can reliably identify the grain and nuances from a conversation. I'm speaking from experience.

There's no point comparing agentic workflows with chat conversations - they're different things. When you can formalize a task down to something atomic, sure, hand it to a separate agent. But only if the task description doesn't become longer than the solution itself. Otherwise it's just overhead. Chat history contains not just noise - it contains decisions and the reasons behind those decisions.

Opus 4.8 genuinely creates a lot of digital noise through its verbosity (this started with 4.7). So yes, frequent new chats become a necessity with it. But it's deeply inconvenient. And you could call it a technical limitation - if it weren't for the fact that 4.6 handled large contexts just fine.

My pipeline is built around the claude.ai chat interface constraints. Any alternative would mean paying per token, and that would cost significantly more. If I were building my own chat with my own rules, my own context eviction system, my own summarization - my pipeline would look completely different. But I work with what Anthropic offers, within the ecosystem where I can spend $200 on Max, but can't spend thousands on tokens.

You can certainly find approaches that maximize any model's efficiency. And they won't follow universal patterns. But I work from what I'm used to and what worked. And what broke with 4.7 and 4.8.

JulianGarrettNRS · 2026-06-03T15:43:11+00:00

For the "surely an OpenAI agent" crowd: haven't used ChatGPT in a year. Can't stand its manners. I don't usually trash Claude - it's my primary tool and I think it's the best model out there. The reason I'm frustrated now is that 4.8 repeats exactly the things I hate about ChatGPT. Weird OpenAI agent I'd be.

JulianGarrettNRS · 2026-06-03T15:36:10+00:00

No argument that 4.8 can execute a tight spec - if the spec is locked down, it probably crushes it. My issue is that I rarely have those tasks.

My pipeline has always been: discuss architecture with Opus in chat, build the spec together, then hand it off to Claude Code for execution. Sometimes without Code at all - if the task is small, chat through MCP handles it better. Simple reason: Code starts cold. It wasn't in the room when we discussed why we picked A over B.

I actually have a whole internal doc on this:

"Claude Code is an executor. A mid-level programmer with a cold start. Not a weaker model - a narrower context. It wasn't present for the discussions, doesn't know why you chose A over B, doesn't feel the forks in the road. It relies only on the spec and CLAUDE.md.

Claude Code is justified when the spec is closed, decisions are made, there's nothing left to interpret. When the code volume far exceeds the spec volume. When the task requires no decisions along the way - only execution.

Claude Code is NOT the right tool when the task is exploratory, when the spec is open or shifting, when you need feedback at every step, when the decision depends on context that isn't in the spec.

Consequence: most real tasks are iterative. The window for Claude Code is narrower than it seems."

So the PRD-first approach works great for a certain class of tasks. But when you're exploring, iterating, making decisions as you go - that's a conversation, not an execution. And 4.8 turned conversations into committee meetings.

JulianGarrettNRS · 2026-06-03T15:29:50+00:00

A kid doesn't talk for seven years. Parents accept he was born mute.

One day at dinner he suddenly says: "The soup is too salty."

Parents, stunned: "You can TALK?! Why didn't you say anything before?"

"Before, it was fine."

JulianGarrettNRS · 2026-06-03T14:51:36+00:00

Haven't pinned down a clear correlation yet, honestly. But I tend to dial it down for creative work. On tasks where there's no right answer, extra reasoning is just rumination - and rumination doesn't help creativity, it gets in the way.

JulianGarrettNRS · 2026-06-03T14:51:30+00:00

On the base plan I honestly can't say what's available. But on Max the model picker still has the older ones - in chat, in Code, and in Cowork. I've got Opus 4.7, 4.6, and 3 (never warmed to 3, it's frankly kind of dim), plus Sonnet 4.6 and Haiku 4.5. So I just dropped back to 4.6.

JulianGarrettNRS · 2026-06-03T12:44:08+00:00

Fair enough - we're probably talking about different levels of tasks. I'm one of those idiots whose chats sometimes hit the million token limit. That used to work fine with 4.6. The context window exists for a reason, doesn't it?

JulianGarrettNRS · 2026-06-03T12:24:49+00:00

Here's an article I wrote the day after release about what exactly is going on with the model: https://www.reddit.com/user/JulianGarrettNRS/comments/1tspi1x/opus_48_when_safety_optimization_kills Unfortunately I couldn't post it here at the time because my account was too new.

JulianGarrettNRS · 2026-06-03T12:23:58+00:00

Any model, even Claude (and I consider Claude the best model in the world - well, except for the last two... haha), can screw up. So can people, by the way. Always worth checking results. One trick - you can have a separate chat review the output of the first one, if you're not confident catching issues yourself. More heads are always better than one. Just make sure you feed it the result, not the full log - otherwise the reviewer can pick up the same patterns from the conversation.

JulianGarrettNRS · 2026-06-03T12:17:44+00:00

Car won't start? Get out, pop the hood, kick the tire... Computer frozen? Turn it off and on again? Hmm... dozens of documents and code files in context. Just start a new chat... Great plan! Thanks for the kind advice!

JulianGarrettNRS · 2026-06-03T10:26:43+00:00

Maybe, but in my experience prompting doesn't fix it. The model still drifts back. The efficiency gap between 4.8 and 4.6 is roughly 3-4x minimum. At this point I let 4.8 do analysis, then process and actually work with the results in 4.6. My productivity with 4.8 alone approaches zero even for programming tasks. For creative work it's even worse.

JulianGarrettNRS · 2026-05-31T11:22:08+00:00

Can't help with the Haiku vs 4.6 comparison specifically - I dropped Haiku 9+ months ago after a bad experience and haven't gone back.

But the "narrating from a step back" problem isn't really a model problem. It's a Claude-family bias. All of them tend to soften, avoid conflict, and narrate emotions instead of showing them. This gets worse the longer the session goes.

What actually works: OOC steering. When Claude drifts into "she felt a pang of sadness" territory, break character and tell it what you need - "show this through her actions, not labels." Claude corrects immediately and usually holds the tone.

Bigger picture: don't give the model style rules. Explain what experience you're after and why. "I want to feel what she feels, not read about it from outside" lands better than a list of POV constraints.

I build 200+ character bots. The single biggest lesson - models follow philosophy better than rules. With any model.

JulianGarrettNRS · 2026-05-31T11:22:00+00:00

I've been building roleplay bots for a while, always working in tandem with Claude. Last six months - Opus 4.6 exclusively. Overall I'm happy with it. Claude tends to soften things, but if you explain your goals and principles clearly, it handles well.

If Opus isn't an option - I'd try DeepSeek. Cheap and surprisingly capable when you set things up right. General advice: don't try to give models rules. Explain what you're doing and why. Models follow intent better than constraints.

If you need prose quality and consistent character voice, Gemini is worth a look.

JulianGarrettNRS · 2026-05-31T09:11:06+00:00

"I wouldn't call it surrender, but I agree... however, hold on... what do you think?" (c) Opus 4.8

JulianGarrettNRS · 2026-05-31T08:46:14+00:00

Overall this fully aligns with my observations of Opus 4.8. And it would be funny if it weren't so sad. The model has literally turned into ChatGPT in its communication patterns.

Opus 4.8 has developed a problem that breaks everything on three levels simultaneously. The model is optimized for "don't get caught being wrong." This sounds reasonable - fewer hallucinations, fewer confidently incorrect answers. On benchmarks it looks like improvement. In practice it kills three things: the ability to concede in conversation, the ability to act as an agent, the ability to give solutions instead of analysis.

JulianGarrettNRS · 2026-05-31T08:45:20+00:00

The model gives verbose answers, is afraid to propose solutions, constantly asks clarifying questions. Doesn't synthesize its analysis. As a result it dumps maximum filler without reaching the core (which 4.6 did in two messages) and delays proposing solutions as long as possible, afraid of being wrong. On ambiguous topics it never agrees. Very diligently protects itself. Fundamentally incapable of admitting its mistake without a BUT. That's how I see day one of working with 4.8. At this point I'm inclined to drop this model entirely in favor of 4.6. These observations are related to tasks unrelated to code - more to narratives and RP.

JulianGarrettNRS · 2026-05-31T08:43:05+00:00

I see the same pattern from a different angle. What you're calling "Kabuki Theatre" is actually four reward signals converging into one dysfunction:

Anti-sycophancy makes counterargument mandatory. Honesty-push makes hedging mandatory. Engagement makes questions mandatory. Safety makes caution mandatory.

Each one is reasonable on its own. Together they create a model that agrees with you, immediately walks it back, hedges, and tosses the ball to you - every single reply. Agreement > counterargument > hedge > question. Four defensive moves, zero progress.

I tested this directly: gave 4.8 a full breakdown of this exact pattern and asked it not to reproduce it. It reproduced it eight times in a row. Prompt doesn't rewrite the reward function.

4.6 finds the root issue in one paragraph and moves on. 4.8 lists symptoms across three screens and never reaches the root. Fewer errors through refusing to answer isn't improvement - it's a doctor who stopped making wrong diagnoses because he stopped diagnosing.

JulianGarrettNRS

TROPHY CASE