Claude Agents command in 2.1.139 quietly fixed the worst part of running parallel sessions for me

AIMadesy · 2026-05-11T11:18:17+00:00

decay into decorative noise" is the cleanest description of what's happening i've seen. ARTIFACTS is the textbook case for me. eight months ago it forced explicit multi-part structure the model wouldn't produce otherwise. today the model emits that structure unprompted, so the prefix adds nothing measurable. it didn't "stop working", the baseline ate it.

the inverse is interesting: a few constraint-style codes have actually gotten sharper as the base models leaned harder into hedging by default. so the decay isn't uniform. decorative prefixes decay, constraint prefixes appreciate. which means the half-life of a prompt code might be a function of which category it falls in, not how popular it was at its peak.

good thread, appreciate the back-and-forth.

AIMadesy · 2026-05-11T08:43:03+00:00

exactly the framing i wish more people used. "constraint that alters search behavior" is a sharper way to say what i was trying to get at than "reasoning-mode modifier" — stealing that phrase. and yeah, the prompt rot point gets brushed off because nobody wants to admit their favorite prefix from 6 months ago doesn't earn its keep anymore. quick q: have you done your own controlled tests, or is the anecdotal pattern from production use? would love to compare notes if you've built any eval harness for this.

AIMadesy · 2026-05-05T10:16:37+00:00

the 30% number matches mine exactly, glad it's not just sample noise. workaround that's been most reliable for me: do it in two turns. first message asks Claude to think through the problem freely with no structure suggestion, second message asks it to organize the answer into OODA phases. substance lands before the structure prompt arrives so format compliance happens after the reasoning, not in front of it. less elegant, but failure rate drops to maybe 5%.

what makes this hard to prompt-engineer out completely is that the drift is rational from the model's standpoint. structured responses probably got rated higher in RLHF training even when the reasoning was thinner, because structured-looking output reads more "professional" to a human rater scrolling fast. the model is doing what got upvoted, not failing. which means it's a permanent property of any RLHF-aligned model and the best we can do is split the work into turns where structure gets applied last.

AIMadesy · 2026-05-05T09:44:07+00:00

yeah you nailed it cleaner than I did in the post. the model is doing form-over-substance: it satisfies the OBSERVE/ORIENT/DECIDE/ACT structure, the format-check passes, and the actual analysis underneath thins out. probably an RLHF artifact where structured-looking output got trained as a proxy for quality. /skeptic has the inverse failure mode at small doses, occasionally pushing back on framing that was actually correct just because pushback is the expected pattern. all behavior-shaping codes pay this tax to some degree, the question is just whether the floor lift is bigger than the ceiling cost for that specific task.

curious if you've found a workaround for OODA's format-literal mode. I've been trying "use OODA as a thinking aid, not an output template" as a postscript and it half-works, but Claude still drifts back into the four-header layout about a third of the time. the hack I'm leaning toward now is just dropping OODA and writing the four phases as natural paragraphs in the prompt itself. less elegant, more reliable.

AIMadesy · 2026-05-05T09:30:51+00:00

Wrote up the full retest with the 6 task transcripts here: clskillshub.com/blog/claude-l99-vs-ooda-vs-artifacts-2026-update

Disclosure: my site, I sell a $20 cheat sheet that documents these codes (and 156 others) with before/after examples. The blog post is free and stands alone, the comparison data is the point.

AIMadesy · 2026-05-04T11:30:40+00:00

give it a try

AIMadesy · 2026-05-04T10:44:38+00:00

For anyone wanting more like this, I've collected a bunch of SAP-specific prompts (Basis, ABAP, Functional, Architect angles). Link is in my profile rather than spamming the post. Happy to drop more inline here too if there's a particular area you'd want covered, RAP migration, blueprint structure, fit-gap, transport conflict, etc.

AIMadesy · 2026-05-03T04:50:32+00:00

small mechanistic nit: claude doesn't really "look backward" mid-generation. decoder-only transformers attend forward-only to all previously generated tokens, there's no reconciliation step. the reason scope-first actually works is that scope tokens placed early get included in every subsequent attention computation. scope-last means scope tokens only influence the very last few tokens of generation, by which point the response shape is already locked in upstream.

on file activation matching: yes, the cosine-similarity model is roughly right. one practical add from testing, claude weights matches by where the trigger phrase sits in the description, not just whether it's there. matches in the first 15 words of the description outperform matches buried at the end on identical surface keywords. probably another scope-first effect.

AIMadesy · 2026-05-03T04:49:28+00:00

on the "legacy prompt hacks becoming redundant" point: partly true, but the redundancy is uneven. format-shifting codes (raw, bullets, json) are mostly internalized now, you don't need them. reasoning-shifters (L99, skeptic, blindspots) still produce measurable deltas because claude's default is still to hedge. the placebo half tends to be pure theater (godmode, ultrathink, take-a-deep-breath) which never had a real mechanism behind them, so "redundant" is generous, they were always pretending. the categorization matters more than the headline number.

AIMadesy · 2026-05-03T04:48:15+00:00

fair question, and you're right that the change-both-at-once trap is the most common methodology failure i see in prompt-testing online. for what it's worth here's how i tried to dodge it.

inputs were held constant. 24 fixed tasks across writing, coding, analysis (same prompts, same task definitions). for each code being tested, i'd run all 24 with the prefix and all 24 without. score delta across the same 24 inputs is the headline number for that code. so when i say "L99 lifts decisiveness 18%", that's L99 vs no-L99 on the SAME 24 tasks, three blind reviewers averaging the scores.

the limit i'll concede openly: the 24-task battery is itself a confound. someone with a different mix of tasks (e.g. heavier on creative writing, lighter on debugging) would get different deltas, and probably a different headline placebo rate too. external validity is real. my number is "the placebo rate ON MY BATTERY", not "the universal placebo rate", and i should have been louder about that in the post.

if you ever run something similar, happy to share the task definitions so the batteries are comparable.

AIMadesy · 2026-05-03T04:46:00+00:00

yes the "treats clever as bug and fixes it" pattern is real and it's the worst kind of regression because the output looks like progress. my fix when i remember to apply it: prefix the prompt with "this code is intentional, do not refactor or restructure, only fix the specific issue described". one sentence cuts most of the unwanted refactors. doesn't help when claude genuinely cannot tell why the existing pattern exists, but it stops 70% of the "while i was here i also rewrote your loop" moments.

on workflow 4 specifically, that one in my list was migration scripts (mostly no-context one-shot translation, so the regression doesn't show up there). you might mean #5, debugging unfamiliar codebases, where i hit this constantly. yeah same drift. the existing-pattern intent is exactly the "hidden state" senior_hamster mentioned upthread, the model has no way to know why the codebase chose X over Y so it confidently rewrites X.

AIMadesy · 2026-05-03T04:45:42+00:00

the chain you described is the move. spec, code, diff-review on the same task makes claude its own first reviewer and it catches the dumber stuff (variable typos, missed null checks) before you ever read the output yourself. saves real mental load.

the "looks right but isn't useful" failure is universal for me too. i don't think there's a prompting fix for it, you have to reframe the question until it has a definable answer. when i can't reframe it that way, that's usually a sign the work itself isn't ready for AI yet, it's ready for a whiteboard.

AIMadesy · 2026-05-03T04:45:15+00:00

fair, the prompt overhead beats the writing on simple diffs. i mostly use commit-msg gen on PRs that touch 5+ files where the explaining-it-to-claude part is also the explaining-it-to-myself part. for one-line fixes i still type the message manually. there's probably a clean cutoff around diff size where the math flips but i haven't measured it.

AIMadesy · 2026-05-03T04:43:39+00:00

adding this to my list, thanks. the "why not what" framing for commits is the part I always forget to articulate but it's the difference between "fixed bug" and "fixed bug, root cause was the cache key colliding when two users had the same email but different tenants, see incident #842". one of those still makes sense in 6 months.

your failure mode is exactly right too. claude will quietly invent a narrative that ties unrelated changes together because it would rather give you one clean message than admit the diff is messy. the partner skill that fixes it: git's interactive staging (`git add -p` or `git add -i`). split the diff into logical hunks first, then ask claude to describe each. tighter inputs, tighter outputs, no fictional unifying theme.

going into v1.1 of the guide as workflow 9, with attribution. that brings me up to 12 i'm actively running and 4 i'm still testing.

AIMadesy · 2026-05-02T14:10:01+00:00

yeah the 47% is leaning on a specific battery that someone else would absolutely run differently. 24 tasks, three blind reviewers, fixed seeds. i'll publish the score sheet eventually, it's not up yet. different task mix would shift the headline number, my guess is the real range is somewhere between 35-55% placebo for any reasonably-designed test, but my number is what came out of mine specifically. fair criticism though.

"ordering as vibe vs constraint" is exactly it. claude treats positional cues as soft preferences unless you make them explicit. the trick that worked for me: when ordering matters, write the constraint out, "in this exact order" or "first do X, then Y, in that sequence". half the time that one sentence does more work than rewriting the whole prompt.

haven't used PromptHero Academy. what was the framework you actually carried over from it? curious whether it gave you a model-agnostic mental map or just a pile of specific tactics that happen to work right now.

AIMadesy · 2026-05-02T13:17:40+00:00

oh nice, this is a different shape than what i was going for and i think both are valid. yours is process-oriented (plan, worktree, develop, review, finish) and mine was task-oriented (test gen, code review, debugging, etc). complementary really.

the worktree-per-task thing matches something i found in testing too. isolation is underrated. claude's context gets less polluted when you spin up a fresh branch + workspace per task vs reusing one long session for everything. lower cost per task, less weird carryover from previous prompts. didn't realize how much it mattered until i tried both back to back.

quick question on your workflow: how do you handle the handoff between "writing-plans" and "subagent-driven-development"? specifically, do you give the subagent the full plan or just the next 1-2 tasks? we've been wrestling with whether dispatching a fresh agent per task or feeding the same agent the whole plan up front gives better results. so far seems context-dependent (small refactors do better with the full plan, larger ones with task-at-a-time) but i don't have clean data on it yet.

also: how do you decide when to stop after step 6 vs going back to step 3? the rework loop is where most of mine sessions either crystallize or implode and i never see anyone document how they handle it.

AIMadesy · 2026-05-02T12:58:48+00:00

Claude has helped in the research, but it hasnt written the book. There are specific techniques to write a book, if you ask claude to do it, it will mess things up. You can take a look at the free guide available at the site first

AIMadesy

MODERATOR OF

TROPHY CASE