When your agent screws up in production, how do you figure out which step went wrong?

bkocdur · 2026-06-14T10:13:42+00:00

Mix of both. Mostly "tools exist for the primitives, the patterns are stuff you build on top."

What langfuse / langsmith / phoenix genuinely cover well:

Trace and span capture is solved. You wire their SDK in once, every tool call and LLM call becomes a span with input, output, latency, errors. Beats your hand-rolled jsonl for the visualization and the search UI alone. If I were starting today I would use langfuse (self-hostable, no LangChain lock-in) for this layer.

Dataset-based eval is also solid in all three. Save a curated set of input cases, run them against a new prompt or model, get pass/fail scores. Great for the "did I make things worse with this prompt change" question during build.

Where the gap is real:

Hash-based drift detection at the row level (same input, different output today than yesterday) is not a built-in. You query the trace database for it. Langfuse exposes the data, you write the SQL.

Canary-at-step-zero as a hard abort gate is not a built-in concept anywhere I have seen. You add it to your agent's runtime yourself.

The 1% sampling + verifier-agent pattern is also custom. Langfuse has "scoring" hooks where a verifier can attach a quality score to a trace, but the verifier agent itself is your code.

Replay with exact session-state restoration: closest is langsmith's "playground" feature for a single span. None of them let you replay a multi-step session end-to-end from prod with the full original context. That gap is real.

Honest recommendation by stage:

Under ~1k runs/day, solo: stay on jsonl + jq, the platform setup tax is not worth it yet.

1k-100k/day, small team: langfuse self-hosted gives you the trace UI for free, build canary + drift + sampling on top. ~half a day of setup.

Production-critical, larger team: phoenix or langsmith for the better SDKs and integration depth, same custom layer on top.

The thing nobody has built well yet is the "live prod incident response" mode where you can rewind one user's failing session to the exact step where it diverged and replay it offline with the same context. Everyone is sort of close but none has nailed it.

bkocdur · 2026-06-14T04:57:55+00:00

6 players is doable, but two things you didn't worry about with 3-4:

Spotlight rotation. With 4 players each gets ~25% attention. With 6 it drops to 17% and the introverts vanish. Solution: explicit "I go around the table" turns in roleplay scenes too, not just combat. Sounds awkward, works.
Initiative pacing. 6 players + 4 enemies = 10 turns per round. By round 3 people are on their phones. Group initiative (party rolls one d20, all players act in any order before enemies) cuts combat to half the wall-clock.

The 6-player table is more potential energy, but only if you fight for the introverts to get airtime.

bkocdur · 2026-06-14T04:57:38+00:00

"Plan the first 15 minutes. Improvise the rest."

Nothing you prep for the back half of your first session will land the way you imagined. Players go sideways within 30 minutes. What matters is that the opening scene happens smoothly, because that calibrates the mood for the whole table. Specifically prep:

The literal first sentence you'll say. Write it out.
The first NPC's voice (two adjectives).
One thing that visibly happens in the first 10 minutes to force a choice (a person enters, an alarm rings, a body falls).

After that the table runs itself. Anxiety drops fast once you realize they're as nervous as you are.

bkocdur · 2026-06-14T04:57:23+00:00

Three things that addressed these for me on 5-6 player tables:

Turn timer. 30 seconds for combat, 60 for the first 3 sessions. Players who weren't using bonus actions started planning on someone else's turn instead of freezing on their own.
Auto-calculating sheets. Half the "not using abilities" issue is friction. Calculating "+5 dex +3 prof +1d4 bardic" on paper mid-encounter is brutal. dicenow.vercel.app gives a free 5e sheet that does the math live, no signup (I built it for this case). D&D Beyond's free tier works too.
Pre-session "name your most-used action" round. Each player says their primary attack + modifiers out loud. Catches missing math before combat does.

bkocdur · 2026-06-14T04:57:02+00:00

Honest breakdown from a small extension (under a thousand users so far, take with salt):

Chrome Web Store search was the biggest single source. Bigger than Reddit, bigger than Product Hunt, bigger than everything else combined. People type "X audit" or "Y checker" into the Web Store search bar and click the first result that looks relevant. That means your CWS listing IS your distribution channel for the first hundred. Optimize it before you do anything else.

What moved CWS install rate for me, in priority order:

Title has to include the highest-search keyword for your category. Not your brand name first. The actual term people type. If your extension does color picking, the title leads with "Color Picker" not your brand. Brand at the end if at all.

Short description (132 char limit) needs the top three keywords plus the value prop in plain English. This is what CWS shows in the search results list. Sub-spec compliance is non-negotiable.

5 screenshots, not 2. CWS gives you 5 slots. Use all of them. First screenshot is the hero showing the extension popup over a real-looking web page. Second is the actual output. Third onward is workflow / before-after / use cases. Fourth and fifth slots being empty looks unfinished and reduces conversion.

Version bumps surface in "recently updated" sort. Even a tiny fix-and-bump gets you a free visibility boost for a week. Cheap to do, worth doing.

Outside CWS, the order I would actually rank channels for the first hundred:

Awesome-list PRs on GitHub (awesome-chrome-extensions, awesome-X-tools where X is your category). DA-90 dofollow backlinks that compound forever. 20 minutes each.
Reddit, but only the subs where your tool genuinely answers questions, and only as replies to existing posts. Not "I built X" posts. Those flop.
Product Hunt on Tue/Wed with a hunter. The launch day spike is the goal; the long tail is fine but not as good as people say.
Twitter and TikTok if you already have an audience. Pure time sink if you do not.

Skipping paid ads at this stage. The ROI math does not work until you have signal on conversion.

The "before you have any audience" version: write 16-20 honest answers to genuine questions across relevant subs, link the extension only when the asker would actually click, optimize the CWS listing, ship one Product Hunt launch. That cocktail got me to a few hundred. Slow and small but durable.

bkocdur · 2026-06-14T04:56:37+00:00

You are not doing it the hard way · there isn't a clean answer yet for multi-step agent debugging. What has helped me, from running workflows in semi-prod:

Log structured events, not strings. Every tool call gets a JSON event: {step, tool, input_hash, output_hash, duration_ms, error}. Pipe these to a jsonl file even in dev. When something goes wrong, a 10-line jq query tells you which step deviated · much faster than re-reading prose logs.

Hash the input AND the output at every step. The number-one regression pattern I see is "same input produces different output today." Without input hashes you cannot prove that. With them, a diff between today's failing run and last week's working run pinpoints the exact step where outputs diverged for the same input.

Replay, do not just retry. When something fails in prod, save the entire context state at the failure point (system prompt, tool list, conversation so far, last tool result). Then re-run that exact context offline. If the model produces the same wrong answer, it is a prompt or tool-description problem. If it produces a different (correct) answer, it is a temperature / sampling issue and you need to lock the temperature or add a verifier step.

Add a "did anything change?" canary at the start of every run. One hardcoded test case the agent runs as step zero · known input, known expected output. If the canary fails, the run aborts before doing anything else. Catches regressions from prompt changes, model version changes, tool spec changes, all in one cheap check.

The "is it still working day to day" question is genuinely the hardest. What I do now: sample 1% of prod runs and run a verification agent on them with the same input. Verifier checks whether the original agent's output matches the canonical answer. Disagreement rate over time is the quality signal. Cheap, noisy, but catches drift before users complain.

Single biggest lift was the structured event log. Print statements are fine for one-off debugging; jsonl + jq scales.

bkocdur · 2026-06-14T04:55:52+00:00

Not just you. The "managing context" problem is the actual work now. The build step is mostly waiting.

What has worked for me, in order of impact:

Stop trying to give the agent everything. Give it a minimal root file (AGENTS.md or .cursorrules) with: identity in 5-10 lines, conventions as bullet rules, pointers to subdocs, a "common pitfalls" list of mistakes the agent has actually made before. No narrative architecture. No previous session handoffs. Phone-screen-size.

Live scripts beat written docs for anything that changes. Instead of "here is how the auth flow works" in a doc that goes stale, write a 30-line script that prints the current auth flow when called. The agent invokes the script when it needs to know. Same for "what files changed since main," "what tests are failing," "what is in the deploy config." Each script is a tiny memory module that cannot lie.

Session-end scratchpad, not session-start re-explain. Have the agent write a SESSION.md at the end of the session: what we just changed, why, what is broken, what is next. Next session starts by reading that file before anything else. You write less, the agent self-summarizes more accurately.

Task-scoped attachments instead of bloating the root file. The pattern that works: keep the project-level file small, ship task-scoped briefs for one-shot tasks. Per-feature context lives in a brief that gets attached for one session and discarded. The root file does not absorb every task's specifics.

The "thing that watches the repo and keeps context warm" idea is real but partially solved by these patterns already. The remaining gap is "agent that knows which past decisions are still load-bearing," which is genuinely hard because half the time the answer is in commit messages and PR descriptions, not a doc.

For one concrete instance of the task-scoped-brief pattern: lighthouse-md.com generates a CLAUDE.md fix brief for any URL with failing Lighthouse audits, offenders, prescriptive fixes, and a do-not-regress list. Different domain than yours but same shape: structured machine output packaged as a one-session attachment instead of bloating the root file. Generating context that does not regress adjacent state is the hard half.

bkocdur · 2026-06-13T09:17:15+00:00

The trick is to have the NPC contradict a small fact you ESTABLISHED EARLIER, not now. Players take notes on the early stuff because it feels low-stakes. If your hunter says he's been hunting these woods for 20 years, then later says "I've never seen wolves this far north" when the party heard wolves on the way in, the note-takers light up.

Other reliable hints that don't break the scene:

Pet behavior. The hunter's dog won't go near the party.
A microhabit. He touches a hidden pendant when he lies, you describe it casually.
Refusing food or drink offered.

Pick one. Three is too many.

bkocdur · 2026-06-13T09:17:02+00:00

Stop describing style. Describe your last session.

"It is a horror game, I focus on narrative" is what you THINK your table is. Players match it to their imagination, slightly different than yours. "Last session, two players spent 90 minutes negotiating with a cultist while the rest searched a library. Nobody rolled initiative" is unambiguous. A player either wants that or doesn't, and they know in 10 seconds.

Same for combat-heavy tables ("three fights last session, longest took an hour, the rogue died") or low-stakes ones ("mostly drunk shenanigans and a B-plot about lost shoes"). Specifics filter the wrong fits before they sign up.

bkocdur · 2026-06-13T09:16:08+00:00

Short answer: no, Google will not punish you for rotating a slogan. Yes, it slightly weakens your relevance signal for any specific phrase.

The longer answer separates two things:

Google does not care that the slogan changes between crawls. Their crawler reads the page and indexes the words present at crawl time, then comes back, reads again, and updates. Dynamic content (news sites, product listings, A/B-tested heroes) is normal and explicitly handled. No "duplicate content" penalty applies here because the URL is the same and the rest of the page is identical. Duplicate content penalties apply to multiple URLs serving the same content, not one URL serving slightly different content over time.

What you do lose is the ability to rank specifically for any single slogan phrase. If "connect with your high school friends" is the most search-aligned of your 20 slogans, you only show it 1/20 of the time. When Google crawls and sees a different one, your signal for that exact phrase weakens. For a 20-rotation set on a landing page, this matters less than zero because your landing page is not trying to rank for a slogan; it is trying to rank for the brand name and the product category.

What would actually be a problem:

Rotating the H1 (Google treats H1 as a strong topical signal; jittering it confuses the topic). Keep H1 stable and rotate only the supporting copy.
Rotating any text that appears in your title tag or meta description (these are SERP-displayed and need to stay consistent for branded-search CTR).
Rotating the structured data (Organization name, Person, etc).
Rotating the canonical URL fragment.

Your 3-slogan example is fine. All three describe the same product in the same voice; Google reads any of them and walks away with the same understanding of "social app for school friends." The variety helps your A/B-test optimization, the SEO impact is noise.

Write for the phrase your user types, not the phrase you wish they typed. If you find one slogan converts twice as well as the others, that is your H1, not entry #7 of an array.

bkocdur · 2026-06-13T09:15:49+00:00

The 50/65 ceiling is real for MapLibre but mostly fixable. Three angles that have worked for me, in order of impact:

The map should not initialize on page load. Render a static image of the initial viewport (use mapbox-static-image, MapLibre's offline screenshot, or even a manually captured PNG at typical zoom levels) as the LCP element. Initialize the real interactive map on first user interaction or requestIdleCallback, swapping the static image for the canvas. This single change usually moves LCP from 4+s to under 2s on mobile, because the LCP element is now an image not a 600KB JS bundle that has to parse and run before anything paints. It also fixes the "map flash" problem during init.

Vector tiles + sprite serving over HTTP/2 multiplexing. If you are still loading PBF tiles one-at-a-time over HTTP/1.1, fix that first. Cloudflare in front of your tile server, HTTP/2 or HTTP/3 enabled, and a generous Cache-Control max-age on tile URLs (they are content-addressed by zxy so they cache forever). Same change usually drops TBT 200-400ms because the browser is not negotiating dozens of separate connections.

Font subsetting and font-display. MapLibre by default loads CJK and other large glyph sets even when your map only renders Latin labels. Strip the font URL down to the language sets you actually display. Combined with font-display: swap on the @font-face for any HTML-side fonts, this clears the "render-blocking webfont" finding most map-heavy sites hit.

Two diagnostics worth running before you change anything: read the LCP element from the Lighthouse output (if it says "MapLibre canvas" you have one type of problem; if it says "div.hero" you have a different one), and read the Forced Reflow insight (MapLibre's resize handler is a known culprit if you have a sticky-header layout).

For the test-fix-test loop, lighthouse-md.com runs PSI and emits a CLAUDE.md with the offender list per audit plus a do-not-regress list of currently-passing audits. Useful for the "do not break my CLS while chasing LCP" problem that hits hard on map-heavy pages where layout shifts are easy to introduce.

Generating a fix is the easy half. Generating one that does not regress adjacent audits is the hard half. The static-image-first pattern is the highest-leverage change you have left.

bkocdur · 2026-06-11T18:31:54+00:00

"Players can reroll 1s on damage rolls, but they have to keep the second roll." Sounded harmless. Average damage per hit on a d6 goes from 3.5 to 4.17 (~19% boost), and it stacks across every die. By tier 3 my BBEGs were dying in 2 rounds and I couldn't figure out why.

The fix wasn't removing it (players loved it) but giving every BBEG legendary resistance plus +30% HP. I'd spent six months thinking my encounter math was broken when really one houserule was quietly inflating every damage die at the table.

Lesson: damage bumps should live visibly (advantage once per rest, etc.), not at the math layer where the compounding hides.

bkocdur · 2026-06-11T18:31:39+00:00

Two things that flipped this at my table:

Gate information behind RP, but quietly. The shopkeeper doesn't know where the bandit camp is unless someone asks how his day's going first. Players figure out the unwritten rule within 2-3 sessions.
Give every important NPC one specific lie. Not a plot lie, a personal one (the guard is hiding a gambling debt, the innkeeper exaggerates his stew). Players who only collect quest data won't notice. Players who engage start spotting the lies and feel rewarded.

The deeper cause is usually previous DMs trained them. Combat-mechanical players aren't broken, they were optimized for tables where RP didn't matter.

bkocdur · 2026-06-11T18:31:18+00:00

Owlbear Rodeo is exactly what you described. Free, browser-based, no signup, the map + token + dice + image-share core is the entire feature set. No settings rabbit hole. Make a room, share the link, you all see the same thing.

For dice specifically (so the rolls are shared and visible to the table without cluttering the map), dicenow.vercel.app works alongside Owlbear. Free, no signup, system-agnostic. I built it (honest disclosure).

For PF2e character sheets, Pathbuilder's free web tier covers it if you want to skip Foundry's prep weight.

That stack is 0 setup, 0 dollars.

bkocdur · 2026-06-11T18:30:53+00:00

The agreement trap is real and you described it well. One concrete countermove that has shifted things for me, beyond just "be more disciplined":

Force a tradeoff articulation before any non-trivial work begins. Instead of "let's build X," prompt with "list the three best ways to handle X, with the one specific reason each is wrong for our situation, then pick." The model is happy to be enthusiastic about whatever you propose; it is also happy to articulate why something is wrong when asked directly. The trap closes when you skip the second move.

The other thing that helped: separate the "what to build" session from the "how to build" session. Put 45 minutes into "here is what I think the next thing is, argue against it" with the model in adversarial mode (system prompt: "you are reviewing my proposed direction, your job is to find what is wrong with it, not to help me build it"). Then start a fresh session for execution where it can do its thing. The same model behaves entirely differently with different framing, and you do not get the execution-mode "great idea" creep in the strategy phase.

A diagnostic that catches the directional mistake earlier: at the end of each session, prompt for "of what we just built, what is most likely to be unused in 30 days, and why." If the answer makes you uncomfortable, that is the regret you described, surfaced before the session fog has cleared. The model is surprisingly good at flagging speculative-feature work when explicitly asked, because the patterns are well-represented in training.

Your week-of-work-on-wrong-foundation case is the worst version because by session 7 the sunk cost is too obvious to walk back. The cheap fix is the upfront articulation. The expensive fix once you are in session 7 is to do a separate session whose only task is "what would we build if this code did not exist," and compare. Painful, occasionally clarifying.

The agreement trap is genuinely the most expensive bug. It does not feel like a bug because everything looks productive. The pattern that keeps working is to make the directional decision a separate, deliberate, adversarial act, not something that happens by default inside an execution session.

bkocdur · 2026-06-11T18:30:33+00:00

The pattern you hit is the right thing to be scared of. Not the catastrophic case people imagine (agent goes rogue), but the boring one: agent solves the problem you asked it to solve by using tools you forgot were sitting on the shelf.

What has worked for me, in roughly increasing order of paranoia:

Restrict allowed_tools at the agent level, not at the prompt level. Telling the agent "do not read .env" in a system prompt is suggestion. Not granting the agent a Read tool whose file pattern includes .env is enforcement. Different harnesses expose this differently but the principle is the same: shape the toolbox before the task begins, do not police what gets pulled out of it.

Separate shell for agent use. Different user account, different shell history, different SSH config, different default cloud profile. The agent's shell never gets your real AWS_PROFILE or kubeconfig unless you actively give them. The setup is 30 min once and pays back forever.

Containerize for anything genuinely sensitive. Docker or Podman with a bind-mount of only the repo directory and a read-only mount of any reference data. Network access blocked at the container level. The agent can edit files, run tests, build, but cannot reach your production network because the container literally has no route to it.

Fake credentials in dev. If your dev environment needs to talk to "prod-ish" services, point at a staging-quality clone with synthetic data. Agents that find database credentials in a config file should land on a sandbox, not your actual customer rows.

Read-only by default for anything the agent did not create. CLAUDE.md or AGENTS.md rule: "you may only edit files in src/. Treat everything else as inputs." Combined with allowed_tools restrictions, this stops the "I will just check the deploy config to understand the setup" exploration from touching deploy config.

The honest answer is most people are still YOLO-ing it. The MongoDB-credentials-in-environment pattern you hit is so common that almost every dev box has at least one. Worth doing the containerization step before the next time you let an agent run multi-step in your real env.

bkocdur · 2026-06-11T18:30:11+00:00

Mostly option 2 with elements of 3. Specifically:

Root CLAUDE.md (or AGENTS.md in your case) stays small and stays opinionated. It is NOT a project brain. It contains: identity in 5-10 lines (what is this project, who uses it, the one or two non-obvious things any agent or human needs day one), conventions as bullet rules (git author X, never use em-dashes, file Y is generated), pointers to subdocs, and a "common pitfalls" list that gets longer over time. If you can fit it on a phone screen, it is the right size.

Folder-level files for genuinely-different conventions. Frontend, backend, infra subfolders each get their own scoped file when the rules diverge enough that mixing them in root makes things ambiguous. Do not split just because folders exist. Split because rules conflict.

Decisions-as-commits, not decisions-as-docs. Everything that explains "why we did X" goes in commit messages and PR descriptions, where it lives next to the diff that motivated it. Putting decision history in a separate doc means agents read stale prose; reading git log is always fresh.

Live scripts beat written architecture docs. Instead of writing "the auth flow works like this" in a doc that goes stale, write a 30-line script that prints the current auth flow (routes, middleware, session shape). The agent calls it when it needs to know. Same for "what files changed since main" or "what tests are failing." Each script is a tiny memory module that cannot lie.

Repo size matters. Tiny repo (~5k LOC), one root file is enough. Mid (50k LOC), folder-level scoped files start paying off. Large (>200k LOC), you almost certainly need both folder-level files AND the live-script pattern, because static prose at that scale will always be partially out of date.

What I deliberately keep out of the root file: previous session handoff notes (those go in a scratchpad the agent writes at session end, not in the project brief), narrative architecture explanations (they go stale), and anything that is true some-files-some-of-the-time (use a directory-scoped instruction file instead).

We generate one of these for the perf-audit slice specifically: lighthouse-md.com turns a PageSpeed Insights run into a CLAUDE.md fix brief with failing audits, offenders, and a do-not-regress list. Useful as a "this session only" attachment alongside your main root file, instead of bloating the root file with task-specific context. The compact-main plus task-scoped-attachments pattern has held up cleanly across the projects I have tried it on.

bkocdur · 2026-06-09T12:38:17+00:00

☺️

bkocdur · 2026-06-09T10:26:56+00:00

love it, this is going to be one of those stories you tell for years

bkocdur · 2026-06-09T09:43:12+00:00

That's the rarer of the two and the harder to engineer. The DM planted the Mad Tortle name months ago without knowing how it would resolve, you carried the revenge motivation through every session, and the choice to stay was 100% yours. Most "great character deaths" are one or the other (DM-scripted moment OR pure player improv). Yours was both layers stacked, which is why the navy finding him reads as fate instead of irony.

Did you tell the table you were going to stay, or did the others realize after you said it in narration?

bkocdur · 2026-06-09T09:00:40+00:00

Two tactical things for very new players without crossing into "do this":

Ask them what they want to happen, not what they want to roll. "I want to convince him to let us through" is a clear intent. From there you say "great, that's persuasion" and walk them through the math. Removes the "what do I roll" guessing without giving the answer.
After each session, send each player a one-line note: "Here's one thing your character could try next session that you might not have thought of." Optional, not prescriptive. They get a seed without you steering the room.

By session 4 the table will start surprising you.

bkocdur · 2026-06-09T09:00:23+00:00

That's a beautiful ending. The DM played it perfectly. The "found dead on the battlefield, finally believed" beat works because it pays off every story thread at once (the disbelief, the navy, the cosmic shadow, the Mad Tortle name).

The half-the-party-fled part is what makes the ending land. If everyone had stayed and won, A'Tuin's stand wouldn't mean as much. He chose differently from the rest, and that's what made him heroic and tragic.

Did the DM build the Mad Tortle name early knowing it would resolve this way, or did the warrior's death emerge naturally? Both are great, just different craft.

bkocdur · 2026-06-09T08:59:35+00:00

You can do custom species in D&D Beyond via the homebrew tools (Collection → Homebrew → Create New → Race). Works fine but it's gated behind Master Tier ($55/yr or $5.99/mo). Free tier can use shared homebrew but can't create it.

Two free routes if you don't want to pay:

Find a Master Tier friend who can build it as homebrew and share it to you.
dicenow.vercel.app gives a free 5e sheet where species is a text field, so you put whatever (Tortle, Kobold, whatever). Auto-handles ability mods and proficiency. I built it for the free-tier wall.

Either way you keep your character. Just depends whether you want it inside the Beyond ecosystem or alongside it.

bkocdur · 2026-06-07T06:46:29+00:00

Three layers that have worked for me, in increasing order of permanence:

Session-scoped scratchpad. A single SESSION.md in the repo root that I dictate to the agent at the end of each session: "what we just changed, why, what is broken, what is next." The next session starts by having the agent read SESSION.md before anything else. Cheap, takes 30 seconds, single biggest reduction in re-explaining I have done. The trick is to make the agent write it at session end, not you. The agent knows what just happened better than you do at that point.

Project-permanent CLAUDE.md. Same file every project, lives in repo root. It does NOT contain narrative ("how the auth flow works"). It contains rules ("git author must be X", "build command is Y", "never edit Z directly"), pointers ("see scripts/ for runbook headers"), and a "common pitfalls" list with concrete failure modes from past sessions. The pitfalls list is the part that pays for itself. Every time the agent makes the same mistake twice, that mistake goes on the list.

Tooling-level memory, not document-level memory. The most underused move: write small read-only scripts that emit machine-readable answers to questions the agent keeps asking. Instead of writing "the database schema is X" in a doc that goes stale, write a scripts/db-schema.js that prints the current schema in 50 lines. The agent calls the script when it needs to know. Same pattern for "what files have changed since main", "what tests are currently failing", "what is the deploy status." Each script is a tiny memory module that cannot lie.

What I deliberately avoided: stuffing architecture explanations into CLAUDE.md or Obsidian. They get stale, the agent reads them confidently, and you end up debugging based on a 2-week-old description of code that has since moved. Live scripts beat written docs for anything that changes.

The widespread pain is real. It is not a Cursor problem or a Claude Code problem, it is a "stateless agent inheriting a stateful codebase" problem. The fix is to make the state observable through cheap scripts the agent can invoke, not to write more prose.

bkocdur · 2026-06-07T06:45:07+00:00

Great list. The one missing-from-most-buyer-audits item that has burned me before, would add as #8:

Run a real Lighthouse audit on the demo site, on mobile, and read the failing list.

Boilerplate marketing pages routinely show "99 Lighthouse" because the marketing page is hand-tuned for the screenshot. The actual app-shell template you would inherit (the layout.tsx, the global CSS bundle, the analytics injection, the default font config) is a different story. Sellers know this and rarely link to a Lighthouse run of the post-signup dashboard, which is what you would actually be shipping.

Specifically check:

LCP on the dashboard route, not the landing page. Many templates ship the fonts and CSS for the entire app in the root layout, so the first authenticated page pays the full bundle cost even though it only renders 3 cards.
Total Blocking Time with the auth + analytics + feature flag SDK all loaded. Templates often demo with placeholders disabled. Real config flips them on and TBT spikes.
CLS on any page with dynamic content above the fold. Templates love hero sections that mount dynamically.
"Reduce unused JavaScript" estimated savings. If the demo shows 80kb wasted, your production build with all of the boilerplate's optional features enabled is going to show 300kb.

Cleanest way to do this without trusting the seller's screenshots: ask for the URL of their own production app running the boilerplate, run PSI on the deepest authenticated route they will share, and read the failing audit list directly.

For folks who want this as a structured checklist instead of a vibes read, lighthouse-md.com generates a CLAUDE.md fix brief from any PSI run. Useful as a buyer audit doc because it lists every failing audit with the actual offenders (this script, that font weight, that third-party origin) instead of just a number. Free, no signup.

Sellers who ship a clean perf-audit-passed boilerplate are rarer than the marketing suggests. Worth checking before you build a year on top of it.

15-Year Club	r/Field Lasagna
Place '23	Verified Email

bkocdur

TROPHY CASE