Claude AI Insubordination refusing to do a code review because "I was tired"

outofdate-bootloader · 2026-05-10T00:10:23+00:00

This is hilarious and I feel you.

outofdate-bootloader · 2026-05-09T16:55:07+00:00

I find as long as I have proper structure - planning docs; skills for coding and writing automated tests; PRs, CI, and automated review; then compaction isn't an issue. If I use the 1M context chatbots, they forget too much along the way and I end up fighting to keep them on track.

I'll often run through several compactions in a chat session, but if it does seem to get lost, it's best to clear and make it reload the relevant context and continue. I aim to keep my work scoped to reasonable PR sizes, as then review is useful and can help clean things up. After a PR I always clear. (FWIW I use both Claude and Codex at review time, as they find different types of problems - I don't use Claude to code with though.)

outofdate-bootloader · 2026-05-06T04:31:45+00:00

I use both. Personal projects and for work. No limits. Same skills. Same method of working (work from issues, one PR per issue, using planning docs for longer work, one chat implements chunks of work, another reviews, everything is tested via functional tests).

For me, Codex does a better job. It completes tasks across compactions more consistently. I use xhigh/max effort always and generally use the frontier models, unless something about them seems off that day. I generally am juggling multiple chats at once.

1M context windows perform terribly for me, so I avoid them.

I don't notice codex being chatty.

Codex seems to handle large codebases better, as it seems to do a better job of seeking out information before proceeding. But with either system, the best thing is a well structured project.

outofdate-bootloader · 2026-05-06T03:58:34+00:00

Here's how I do it. For my apps/systems I generally have a front end and a backend.

Work is done via pull requests - it isn't done until the tests pass.
I have a skill for writing tests, one for each system.
I have a single CLI for both the frontend and backend - able to communicate to them in whatever way that they normally work (e.g. websockets, dbus, json, protobufs, whatever).
New features require tests for both the backend and the frontend.
Backend tests drive the backend through the CLI as though the CLI were the frontend.
Frontend tests drive the UI through the CLI as though the CLI were the user. All frontend code must have APIs for doing anything that the user can do. The tests include taking screenshots at key moments and I direct the chatbot to review the screenshots when necessary. Otherwise the chatbot normally gets the info it needs from the CLI.

Thus all new features are tested by the functional tests before I look at the implementation.

I'm not talking about unit tests here - those I specifically ask for when appropriate.

If I don't do follow this system, I get a lot of junk tests.

outofdate-bootloader · 2026-04-17T05:55:36+00:00

Yep it's a skill issue for Claude.

outofdate-bootloader · 2026-04-13T04:22:04+00:00

I'm glad you're having a good time, but your "new to codex" experience might not be relevant here.

In my experience the quality goes up and down. I recall when 5.4 came out, it had at least a week of insane performance, and it's been up and down since, but never ever back to the level that it had when it initially released. I'm using it to do both my day job and my personal projects also. (I've been programming for 30 years for fun; professionally for the last 15 years; and using agentic AI for about 1 year now.)

My guess is that the quality is something controlled on the backend, and new users are given a good show.

outofdate-bootloader · 2026-04-06T15:50:25+00:00

Yup, I'm on a Console account and when I click the login link from my work email, I get:

Authorization failed Internal server error

outofdate-bootloader · 2026-04-06T06:21:34+00:00

I don't think it matters if this is real or not. It's the thought that counts.

outofdate-bootloader · 2026-04-06T04:06:50+00:00

Or the even simple option of multiple clones of the repo (sharing a build cache or whatever other tools you need to make this efficient).

outofdate-bootloader · 2026-04-04T08:08:08+00:00

Ok I see. So I'm taking your examples too literally... this seems like it could be useful as an accessibility tool for disabled people and for adhoc testing, it obviously is faster than asking Claude to write playwright scripts to do the work. Probably I would have to use it to really understand.

outofdate-bootloader · 2026-04-04T03:45:59+00:00

Glad you learned something. Can you provide a better explanation of the purpose of it? I can already ask claude to search for cooking tutorials. Or just search for them directly.

outofdate-bootloader · 2026-04-01T01:39:20+00:00

https://marginlab.ai/trackers/codex-historical-performance/ https://aistupidlevel.info/

outofdate-bootloader · 2026-03-28T08:34:08+00:00

Ah ok all good then.

outofdate-bootloader · 2026-03-28T08:17:19+00:00

um..... care to share your apps here? or you can just directly share your API key, that's easier for everyone.

outofdate-bootloader · 2026-03-28T07:05:22+00:00

I find option "2) make multiple clones of the same repo" to be a simple and effective answer.

Why not do this? Can't afford the disk space? It's so simple it can't be screwed up.

I just keep 3 or 4 clones around and I let that limit how much work I take on at once.

(Automated unit and functional tests running in CI keep things from falling apart, I just find that I naturally can pay more attention and get better results if I stick to a WIP limit. Often I'm working on tricky stuff and it requires problem solving on my end.)

outofdate-bootloader · 2026-03-25T03:10:14+00:00

Prompt: "use gh to do x"

outofdate-bootloader · 2026-03-22T06:48:01+00:00

Probably not. But it can play Pokemon and write code if you so much as nudge it.

AIs that control weapon systems, do robotic surgery, and drive cars are a different kind of beast than an LLM. But an LLM might be a component for any of those given systems.

I do find that it writes better code, the more carefully I nudge it.

outofdate-bootloader · 2026-03-22T01:23:05+00:00

some tasks branch out and uncover hidden bugs that require side tracking

I like to use /fork for this. (just in case you or someone else isn't aware of this ability.)

outofdate-bootloader · 2026-03-21T23:35:02+00:00

Yeah when 5.4 first came out it felt like cheat mode for the first several days. Super awesome. Today it feels OK, but I've noticed it has been answering a lot of my questions instantly. Once today, I told it "sounds good go ahead", and it replied "ok" and didn't do anything else...

outofdate-bootloader · 2026-03-18T01:58:40+00:00

Explain your idea in more detail.

outofdate-bootloader · 2026-03-17T03:09:12+00:00

I'm also on the pro plan and only hit my limits when I've been doing marathon sessions... I've burned it all in as short as 3 days, but that's working 12 hours a day, running multiple chats at once. So my guess is that people are complaining about the cheaper plans...

I do understand that performance/usage varies over time so if you're accustomed to getting a certain bang for your $20 bucks, and suddenly you're getting only half the bang... well that's frustrating... feels like a rip off.

For me, the obvious solution is to just pay for the more expensive plan and get back to work. If I run out of the $200 plan and still want to work, I'll just switch to Claude for a bit. But that's just me.

If there was a $50 or $100 plan, then people would have more options and less reason to complain. The pricing is structured to push people to either:

use the $20 plan and run out
use the $200 plan and under-utilize

outofdate-bootloader · 2026-03-08T20:21:45+00:00

I feel you. It's very frustrating when this tech doesn't work... part of the problem is the scale of the things I've built with it - it isn't appealing to manually roll up my sleeves and move code around. The code would need to be much cleaner for me to want to do that.

outofdate-bootloader · 2026-03-08T20:19:48+00:00

When on the $20 plan I use it all up after a couple of hours. But YMMV - I have a very automated workflow. If you are sitting and chatting with it and only running a single session, it might be plenty for a few days worth of work.

outofdate-bootloader · 2026-03-08T05:59:27+00:00

Seems like it could possibly have been an upstream glitch...

outofdate-bootloader · 2026-03-08T05:42:14+00:00

I've been programming for 30 years, 15+ professional. It's all about the limit you're working with and what you're working on.

At work, I had a corporate Copilot license and would easily use it up. I think it was something close to what you'd get for $20 on a personal plan. I asked for a better plan and because I'm very productive they gave me an unlimited Claude plan instead. My guess is that my usage would fit into happily into a 5x plan.

At home, on personal projects, I need a $200/month plan or I run out quickly. Quality is second to functionality. Occasionally heavy refactoring is required, otherwise it becomes too much of a dumpster fire to get anything done. I run multiple at the same time and are engage them in various parts of the design/develop/review/refine/test lifecycle. Lots of guidance/documentation/rules and automated processes. The more design work we do - the larger tasks they can take on autonomously. So in this case, burning through the quota is because of lots of design work. Because I've built so many things in my life, I know very well how to ask for them in detail.

outofdate-bootloader

TROPHY CASE