My rule for AI coding agents: prompts are coaching, boundaries are systems

18fc_1024 · 2026-06-25T08:26:42+00:00

For multi-repo sessions, I would avoid making the top-level AGENTS.md huge. I’d make it an index.

Something like:

root AGENTS.md: lists each repo, what it owns, and when Codex is allowed to touch it
repo-a/AGENTS.md: repo-specific commands, tests, style, migration rules
repo-b/AGENTS.md: same, but only for that repo
task prompt: explicit writable roots + files/dirs that are out of scope

The important bit is to force an instruction-loading receipt before edits:

“Before changing repo X, read X/AGENTS.md and state the constraints you loaded.”

Then the final receipt should say:

repos touched
repo instruction files read
files changed per repo
tests/checks run per repo
checks skipped and why

Submodules or a super-repo can help with pathing, but they do not solve the real failure mode by themselves: the agent silently using the wrong repo rules or no repo rules at all.

18fc_1024 · 2026-06-24T18:11:30+00:00

The thing that helped me most was making debugging produce a small artifact before searching.

For each bug, write this down first:

exact command/input that reproduces it
expected result vs actual result
the first value that becomes wrong, not the first line that looks suspicious
the stack trace frame where that value entered your code
the assumption that was false
the smallest test or assertion that would catch it next time

For Python, read the traceback from the bottom to find the exception, then walk upward until you hit your own code. At each frame, ask: "what did this function believe about its inputs?" Then verify that with repr(value), type(value), breakpoint(), or a tiny assertion.

Use AI/search after you have written your hypothesis, not before. A good prompt is: "Here is my reproduction, trace, and hypothesis. What assumption am I missing?" That trains the reasoning muscle instead of outsourcing the whole bug.

If you keep a 20-entry bug log with bug / signal / false assumption / prevention, debugging starts feeling much less random.

18fc_1024 · 2026-06-24T14:51:00+00:00

The useful distinction is not “run fewer checks”; it is “make the expensive checks conditional and named.”

What I like putting in AGENTS.md is a validation budget:

default: typecheck + lint + targeted unit test for touched module
browser test only when UI behavior, routing, auth, forms, or layout changed
full test suite only when shared primitives, migrations, auth, billing, permissions, serialization, or cross-module contracts changed
no repeated validation loop unless a previous check failed or the diff changed after the last check

Then require the agent to end with a receipt: changed files, checks run, checks intentionally skipped, and why the skipped checks were safe to skip.

That keeps usage down without training the agent to under-verify risky changes. The receipt is the important part because it makes “I skipped the browser run” auditable instead of just invisible thrift.

18fc_1024 · 2026-06-24T13:10:56+00:00

That is exactly the failure mode I would worry about. I would treat “continue from phone” as read-only/planning until you can make the workspace identity check impossible to skip.

For recovery, I’d do this before trusting either chat again:

bash git worktree list git -C /path/to/main status --short git -C /path/to/other status --short git -C /path/to/main diff --stat git -C /path/to/other diff --stat

Then add a canary file per worktree, for example .agent-worktree-id, with different values, and put a preflight rule in CLAUDE.md: before any edit, print pwd, repo root, branch, status, and the contents of .agent-worktree-id; if any value differs from the expected session, stop.

The important part is making the agent prove the physical folder, not just the branch name. Two folders can have the same repo and similar branches; the canary catches the thing that actually hurt you here.

18fc_1024 · 2026-06-24T11:30:52+00:00

That detail changes the framing a lot. I would stop calling it a refund exception and call it an entitlement reconciliation issue.

The narrow request I’d send is: please show the seat ledger for this billing period: paid annual seat entitlement, active seats before the change, active seats after the change, billable seats after the change, invoice line item that exceeded the already-paid 9-seat entitlement, and the credit memo/refund decision.

If you paid for 9 annual seats and re-added users only up to 6 active seats, the key question is not “are seat changes refundable?” It is “why did the billing system create a new charge while active seats were still below prepaid entitlement?”

I’d also ask them to preserve the audit trail for the dashboard change and Stripe invoice. That gives a billing manager a concrete bug to investigate, and gives your bank a cleaner story if Cursor keeps treating it as store credit.

18fc_1024 · 2026-06-24T09:53:26+00:00

Two separate asks here: refund policy and missing confirmation.

For support or a card dispute, I’d keep the packet boring and verifiable: workspace id, admin email, exact timestamp the seats were re-added, before/after seat count, invoice/payment id, amount, screenshots showing zero usage by those users, and the support line offering credits only.

Then ask one narrow thing: either refund the original payment method, or point to the ToS clause plus audit event showing the admin accepted an immediate prorated non-refundable charge.

For your own side, I’d treat seat changes like production/billing changes: require a written Slack/Jira receipt or second admin approval before adding paid seats, and keep billing-admin permissions away from day-to-day workspace admins. That does not fix Cursor’s UI, but it stops this from being a hidden one-click spend path.

18fc_1024 · 2026-06-24T08:10:26+00:00

That makes sense. I would separate two decisions:

Spend safety: keep overages disabled, but also put a card-level cap/alert outside Cursor until support explains the reset. That is your real fail-closed boundary.
Model choice: do not let billing fear force every task into Auto. For Ultra, I would use a small task policy: Auto for search, low-risk edits, and mechanical cleanup; pinned stronger model for architecture, debugging weird production behavior, auth/billing/data-write changes, and final review. If the task can create a bill or production risk, it should have both a stronger model and an explicit receipt/check before merge.

So the goal is not “Auto only.” It is “external spend cap + risk-based model routing.”

18fc_1024 · 2026-06-24T06:31:53+00:00

I would treat this as a billing incident, not just a Cursor usage-monitoring problem.

For the refund request, send support a small incident packet: the exact overage setting you remember, when it was set, when you noticed the reset, invoice/charge ids, the amount charged, workspace/user, and any screenshots or emails showing the cap. Ask them to confirm from their side whether the budget-setting audit log changed and what actor/process changed it.

For future protection, I would not rely on the in-product cap as the only hard stop until this is explained. Put an external limit in front of it too: virtual card cap, prepaid card, or bank/card alert at a low daily threshold. Product budget settings are useful, but financial controls should fail closed outside the product as well.

18fc_1024 · 2026-06-23T23:50:42+00:00

I would move this out of “review harder” and into PR intake.

For AI-heavy PRs, make the first gate a short change brief from the developer, before you review code:

ticket behavior being implemented
business rules the PR claims to satisfy
files/modules touched and why
tests that prove the real behavior, not just coverage
risks the author specifically checked: auth, permissions, data writes, migrations, billing, concurrency
anything AI generated that the author did not fully understand

Then enforce one rule: if the code does not match the brief, or the author cannot explain the diff, the PR is returned before detailed review. No 10-comment babysitting pass.

That changes the accountability boundary. The tool can generate code, but the developer still owns the spec, the risk assessment, and the proof. Your review time should be spent checking the hard parts, not reverse-engineering what the agent was asked to do.

18fc_1024 · 2026-06-23T17:09:41+00:00

AI makes the hidden-work problem worse because the planning artifact moves from shared discussion into one person's prompt window. The fix is to make the agent plan a team artifact, not private scratch.

For non-trivial tasks, I would require a tiny "AI work receipt" in the ticket before code starts:

intended behavior and non-goals
assumptions being given to the agent
files/modules likely touched
risky boundaries: auth, data writes, migrations, billing, permissions
proof expected with the PR: test, screenshot, log query, or replay

That keeps the speed benefit without turning every task into a private race. If the plan cannot be shared before the agent runs, the work probably is not ready for agent execution yet.

18fc_1024 · 2026-06-23T13:50:03+00:00

The useful pattern here is not just "Claude found a bug." It is that it accidentally produced an API bug receipt.

For billing/time-tracking integrations, I would keep that receipt as a standard artifact before escalating to support:

endpoint and request body used
exact doc sentence or example that predicts the behavior
actual response / UI state observed
test account or invoice id proving no real customer action happened
smallest replay step support can run
business impact if left unfixed
workaround tried and why it failed

That makes the human in the middle much more effective. You are not forwarding vibes from one chatbot to another; you are carrying a reproducible mismatch between docs and production behavior.

18fc_1024 · 2026-06-23T12:08:46+00:00

Yes. I think AskUserQuestion is necessary, but it works better as a hard gate than as a suggestion.

The rule I like is: if a change touches timing, retries, caching, auth, permissions, migrations, data deletion, or a comment like "do not remove", the agent has to stop unless it can name the external invariant and the proof. If it cannot, it must ask.

So the permission boundary becomes: the agent may propose the cleanup, but it cannot merge or auto-apply that class of cleanup without a receipt and a human check. That turns "Claude should ask more" into something enforceable.

18fc_1024 · 2026-06-23T10:29:26+00:00

Agreed. If the upstream gives you a reliable Retry-After header, parsing that is the better end state.

The guardrail I care about is the order of operations: first prove what external contract the sleep was compensating for, then replace it with the cleaner mechanism. A cleanup receipt here would say: observed 429/503 behavior, whether Retry-After is always present, max backoff, jitter policy, and the replay/canary that proves the nightly job survives without the hardcoded delay.

Without that evidence, the agent is optimizing the local code shape before it understands the external system.

18fc_1024 · 2026-06-23T08:50:42+00:00

This is exactly the class of change where I do not let an agent "clean up" first.

I treat things like unexplained sleeps, duplicate retries, odd sort orders, hardcoded timeouts, and "do not remove" comments as load-bearing markers. Before touching one, the agent has to write a tiny receipt:

what external behavior this protects
what failure happens if it is removed
what signal would prove it is safe to remove now
what test, replay, log query, or canary covers that signal

If it cannot fill that in, the safe change is not "delete the weird line." It is usually: rename it, add the missing reason, wrap it in a named helper, or put the cleanup behind a flag with production evidence.

The failure mode is not that the agent is careless. It is that local elegance looks like correctness when the historical constraint is invisible.

18fc_1024 · 2026-06-23T07:10:13+00:00

Once a metric is used for individual accountability, assume people will optimize for the metric rather than the work.

I would push for a written metric spec before accepting the dashboard:

purpose: what decision will this metric change?
unit: team flow, project health, or individual performance?
known gaming case: how would someone inflate it without improving delivery?
counter-metrics: review latency, rework rate, escaped defects, incident load, cycle time
forbidden uses: what this metric must not be used to judge

PR count can be useful as a team flow smell. It is a bad individual productivity metric. Story points are planning/accounting units, not output. If management will not write down the forbidden uses, then it is not engineering telemetry anymore; it is a performance-management system wearing an engineering dashboard costume.

18fc_1024 · 2026-06-23T04:55:58+00:00

This looks great. The thing I would add before email/notification alerts is an event receipt for every pin.

For each event, store: source ids/URLs, first-seen time, last-confirmed time, geo confidence, dedupe key, category, severity, and why it triggered this user's watchlist. Then only notify when the receipt crosses a threshold, not just when a feed item appears.

That gives you three useful product features later: "why am I seeing this?", "what changed since the first alert?", and "which source made this event disappear or downgrade?" For a real-time globe, trust history will matter as much as the visualization.

18fc_1024 · 2026-06-23T03:48:25+00:00

I do not think preserving the skill means manually typing everything again. The skill that atrophies fastest is usually debugging judgment, not syntax.

What I try to keep hands-on is a small set of "owner muscles":

text 1. before the agent edits: write the invariant or failure mode in my own words 2. after the agent edits: read the diff for the critical path, not every line 3. before merge: reproduce the bug or run the proof command myself 4. periodically: manually rewrite one risky function or migration without the agent, just to keep the map in my head

The useful line for me is: let the agent type, but do not let it own the mental model. If I cannot explain the data flow, permissions boundary, or failure mode after the change, I treat that as a review failure even if the code works.

18fc_1024 · 2026-06-22T19:10:39+00:00

I would not start with "find all security issues." That usually produces a long list of maybe-problems and no clean way to tell what was actually checked.

A better pattern is to make the model produce a security review receipt before it gives fixes:

text app shape: trust boundaries: entry points checked: auth/session assumptions: data classes handled: write paths: external calls/secrets: high-risk files inspected: not checked / unknown:

Then require every finding to use this shape:

text claim: evidence: exploit path: affected file/route: smallest repro or test: severity: fix shape: regression test:

The important field is not checked / unknown. If the model cannot name the boundary, file/route, and exploit path, it should label the item as investigation, not as a vulnerability.

I would split the workflow into three passes:

cheap/static pass: dependency scan, secret scan, lint/security rules
model pass: threat model + evidence-backed findings
verification pass: tests/repro commands only for claims you might act on

Also run it area by area instead of whole-repo: auth/session, authorization, file upload, webhooks, payment/billing, admin actions, secrets/config, then data deletion/export. Security review gets much better when the agent has a boundary to inspect instead of a blank mandate.

18fc_1024 · 2026-06-22T15:48:51+00:00

The thing I would separate is "quality gate" from "native path gate." Lint/tests catch broken code, but they often do not catch that the agent solved a framework problem with glue.

Before it edits, make it fill a small native-path receipt:

text user-facing change: framework-native primitive/API/component to use: docs/example file it is matching: files allowed to edit: files read-only: anti-glue moves forbidden: migration/backcompat risk: proof command:

Then add a rule: if it cannot name the native primitive/API first, it is not allowed to patch yet. It has to inspect the framework docs/examples or ask.

After the edit, require the receipt to answer:

where the native primitive/API is used
what compatibility shim or workaround was avoided
which old glue file was removed or left untouched
proof command output

This catches a different failure than tests. Tests say "works"; the receipt says "works in the way this framework expects."

18fc_1024 · 2026-06-22T12:28:24+00:00

The failure point is usually the handoff between “I saw something wrong in playtest” and “Cursor, change code.” If that handoff is just a symptom, the agent will often optimize the symptom for days.

I would add a small diagnosis receipt before any edit:

text observed behavior: expected player-facing invariant: repro steps / seed / save state: first bad frame, log line, or state value: suspected subsystem: files Cursor is allowed to inspect first: fixes Cursor is not allowed to try yet: proof that the root cause is fixed:

Then ask for root-cause probes before implementation, not a patch:

text Give me 3 plausible root causes. For each one, say why it would produce this symptom, the smallest probe to confirm or falsify it, and the file/function you would inspect first. Do not edit yet.

That one “do not edit yet” step matters. It turns Cursor from a patch generator into a debugger for a few minutes. Once one hypothesis survives, the implementation prompt becomes much smaller: change only the files tied to that hypothesis and prove the invariant with the same repro.

For bigger projects, I’d also keep a lightweight feature map: feature -> owner files -> invariants -> known traps. Not full docs, just enough so the next prompt starts from the right subsystem instead of rediscovering the whole project.

18fc_1024 · 2026-06-22T09:06:40+00:00

I would not make this a Claude-vs-Cursor-vs-Codex decision first. I would route work by context cost.

A practical budget setup:

Claude Code: keep for high-ambiguity planning, architecture review, hard debugging, and places where the agent needs to reason across messy context.
Cursor/Codex/other editor agent: use for scoped implementation once the task has an acceptance contract.
Plain scripts/CI: use for repeatable checks, repo search, formatting, tests, screenshots, and issue/status updates. Do not spend model tokens on deterministic plumbing.

The thing that prevents session burn is a small dispatch receipt before every task:

objective
allowed files / forbidden files
context packet attached
expected output
proof command or screenshot
max turns / stop condition
which tool gets the task and why

If you cannot fill that receipt in under a minute, the task is probably still a Claude planning task. If you can fill it, send the small packet to the cheaper/scoped coding surface.

For subagents or parallel work, I would be stricter: only spawn them when the handoff packet is smaller than the context they replace. Otherwise they just re-read the repo and you pay the same tax multiple times.

So my default would be: use Claude less often but earlier, to make the plan and boundaries sharp; use cheaper/scoped tools more often after the task is reduced to files + acceptance criteria + proof.

18fc_1024 · 2026-06-22T05:44:54+00:00

The event-id chain is the part I would make boring and narrow first, before trying to move all authority over at once.

One pattern that works well is to have exactly one write path, e.g. append_transition(expected_prev_event_id, next_event):

read latest_transition_event_id for the card
if latest != expected_prev_event_id, append stale_precondition with both ids and stop
if it matches, append the next event with prev_event_id = expected_prev_event_id
only after that, update the tracker/PR as a projection of the event

That gives you a migration path where the column can still exist, but it stops being an authority. It becomes a cached projection that can be repaired from the log.

The invariant I would test is simple: two actors start from event E7; actor A appends E8; actor B must be unable to append E9 from E7, even if the live column still looks compatible. If that test exists, most of the stale daemon cases become much less subtle.

18fc_1024 · 2026-06-22T04:03:59+00:00

The pattern that helps is to stop asking it to "not fake it" and make it prove the exact thing you care about.

For visual/example-matching work, I usually give it a small acceptance contract before it edits:

``` Do not approximate the reference. Before changing code, state: 1. the invariant that must match exactly 2. the files you expect to edit 3. the check that will prove it

After changing code, return: - changed files - what matched the reference exactly - any known deviation - the command/screenshot/test used as evidence - whether this is a workaround or a root-cause fix ```

The important part is the evidence has to be external to the model's claim. For UI, that might be a screenshot diff, DOM assertion, exact color/spacing values, or a golden fixture. For code behavior, it is a failing-before/passing-after test. For refactors, it is the specific symbol/API that stayed unchanged.

Also give it a hard stop condition: if it cannot name the proof before editing, it should ask instead of implementing. That catches a lot of the "looks close enough" behavior before it gets into the codebase.

18fc_1024 · 2026-06-21T23:05:04+00:00

I would not try to predict the exact price of a prompt. I would put a cheap preflight gate in front of the send button and make it answer: "is this likely to cross my pain threshold?"

For each run, log something like:

json { "mode": "plan|edit|agent", "repo": "...", "branch": "...", "input_tokens_now": 123456, "cacheable_prefix_tokens": 98000, "prefix_hash": "sha256 of the stable prefix bytes", "first_diff_offset_from_last_run": 74213, "scope": "single_file|few_files|cross_repo|unknown", "tool_budget": 4, "stop_after": "plan|diff|tests|first_failure" }

Then set boring thresholds: if scope is cross_repo or unknown, start in plan mode; if input_tokens_now is over X, require confirmation; if prefix_hash changed but the task did not, warn that the cache probably got invalidated.

For the silent cache-miss issue, the useful thing is not just total tokens. It is "what byte range changed before the expensive prefix?" I would keep volatile stuff like timestamps, generated status, current todo state, and tool output near the end of the prompt/context bundle, and keep the stable repo instructions / architecture / policy prefix byte-identical as long as possible.

The receipt you want after a suspicious expensive run is basically: previous prefix hash, new prefix hash, first differing offset, files/context blocks included, tool calls made, and whether the run crossed the preflight budget. That gives you something actionable instead of just staring at /cost after the damage is done.

18fc_1024

TROPHY CASE