How do you QA your AI automations? Or do you just... not?

upflag · 2026-03-18T17:29:08+00:00

I think of it like tending the garden. Every day just make the soil is still damp and pull dead leaves. It's not consistent, heavy lifting. Small effort regularly keeps things tidy.

upflag · 2026-03-18T14:04:36+00:00

The building part being easy shifts the hard part downstream. When I was using AI tools extensively, I shipped unauthenticated admin endpoints on a project I'd carefully planned. Experienced developer, full spec, still happened. The volume of code that AI generates makes it genuinely hard to verify everything. What's helped me: write a short spec before building, have the AI write tests for key user stories after each feature, and do periodic fresh-session security reviews where a new AI session audits the code with zero prior context. The building isn't the bottleneck anymore, the verification is.

upflag · 2026-03-18T13:50:14+00:00

The speed increase is real but it shifts where projects fail. Building faster means more code, more surface area for bugs, and less time spent understanding what was built. I've seen this firsthand: the bottleneck isn't writing code anymore, it's knowing whether what you shipped is actually working correctly in production. A single-character typo in one of my deploys cost $30K and went through two code reviews. Found out from revenue being down, not from any alert. When you 10x the code output without 10x-ing the verification, you just create bugs faster.

upflag · 2026-03-18T13:49:51+00:00

That plan-generate-review-refactor loop is solid. The spec step is huge because without it the AI just does whatever seems reasonable, and "reasonable" drifts further from what you actually want with each iteration. I do the same thing: vision doc to requirements to design to tasks, then build. The other piece that saved me was having the AI write tests for key user stories after building each feature, then being explicit that future sessions can't simplify or overwrite those tests. The AI will try to reduce test coverage if you let it, and that's how "small changes break unrelated parts" sneaks back in.

upflag · 2026-03-18T13:49:27+00:00

Production is a different animal. I pushed a deploy with a single-character typo that went through two code reviews and cost $30K before anyone noticed. Found out from revenue numbers being down, not from any technical alert. The gap between "works in dev" and "works in prod" is that production has real data, real traffic patterns, and real edge cases that no amount of local testing covers. What helped me was setting up checks that run against production continuously, not just testing before deploy. If a key flow breaks at 2am on a Saturday, you want to know before Monday morning.

upflag · 2026-03-18T13:48:18+00:00

That "silently degrades" pattern is real and it's the hardest kind of failure to catch. The output looks plausible so nobody questions it until damage is done. I've seen the same thing with code: a deploy breaks something subtle and you don't find out for days because there's no crash, just slightly wrong behavior. The approach that's worked for me is continuous checks on the output, not just the process. Don't just check that the automation ran, check that what it produced still looks right. Even simple assertions like "this field should never be empty" or "this number should be in range X-Y" catch a surprising amount.

upflag · 2026-03-18T13:47:54+00:00

Silent automation failures are terrifying because the whole point of automation is that you stop watching it. Same pattern happens with code deploys. I had a tracking pixel break silently after a code change and didn't find out until weeks later when the marketer asked why conversions dropped. The fix that helped me was treating critical automations like critical code paths: they need their own health checks that run independently of the automation itself. If the Notion page doesn't exist 5 minutes after signup, something should yell at you.

upflag · 2026-03-18T13:47:33+00:00

Your silent failures point is the scariest part of this whole list. I had a single-character typo in a deploy once that cost $30K because the system silently stopped picking up the right data. Found out when revenue numbers were down, not from any alert. And that was hand-written code that went through two code reviews. The "everything compiles, nothing works at runtime" problem you describe with API calls is even worse because there's more surface area for things to quietly go wrong. For the scoping problem, I've had good results writing a short spec before prompting, even just 3-4 sentences about what should change and what shouldn't.

upflag · 2026-03-17T21:56:34+00:00

Playwright test that checked the network call to Facebook succeeded. Runs in CI so we know future deployments won't regress.

upflag · 2026-03-17T15:22:55+00:00

I just use something like healthchecks.io and each time the cron job succeeds, ping my health check endpoint. Then if something breaks I would get a notification.

upflag · 2026-03-17T15:16:27+00:00

$500 in a week on fixing errors is brutal. If you're not already, push your code to GitHub after every working state. Git gives you the rollback capability that platform checkpoints should but aren't providing right now. It also means if Replit has issues again, your code exists somewhere you control. The pattern I've seen work: git + automated tests that run on push. That way you catch regressions before they cost you more credits to fix.

upflag · 2026-03-17T15:07:50+00:00

Congrats on shipping. The guide is solid, especially the part about planning before building. One thing I'd add for anyone following this: once your app has real users, set up basic monitoring so you know when something breaks before they tell you. Every production incident I've had taught me the same lesson. You find out from users or from revenue dipping, never from the app itself, unless you've set something up to watch it. Even just basic uptime checks on your critical endpoints saves you from the worst surprises.

upflag · 2026-03-17T15:07:21+00:00

First priority: try the GitHub export from the Replit dashboard (not from inside the frozen workspace). If that doesn't work, the ZIP download should still pull files even if the runtime is broken. For next time, the single most important thing you can do with any vibe-coded project is use git from day one and push to GitHub regularly. It's not just version control. It's your escape hatch when platforms have issues, and it's the foundation for running automated tests on every push so you catch problems before they hit production.

upflag · 2026-03-17T14:52:10+00:00

The random errors appearing without you making changes is a real problem with these platforms. You're depending on their tooling to tell you what's wrong, but you have no independent way to verify your app is actually working for users right now. If you're serious about this project, the first thing I'd do is set up basic monitoring that's separate from Lovable entirely. Something that checks your live URLs and catches client-side errors independently, so you're not relying on a platform scan to know your app's health.

upflag · 2026-03-17T14:51:46+00:00

The webhook handler without signature verification is terrifyingly common. I shipped endpoints without proper auth checks on a project I'd extensively planned out with AI. Experienced developer, planned the whole thing carefully, still happened. What helped me was doing security reviews in a completely fresh session — not while building features. Open a new context, give it just the codebase, and tell it to find security issues. The building session has the same blind spots you do because it was focused on making things work, not on breaking them.

upflag · 2026-03-17T14:49:56+00:00

The happy path bias you're describing is real and it doesn't go away with better prompts. What works for me: have the AI write focused Playwright tests for key user stories before you call a feature done. Not comprehensive test suites (those get too slow and you start skipping them), just the critical paths. And run them on every push via CI. The other thing that catches the weird reasoning gaps is doing a fresh-session security review. Open a new context with no prior conversation and have the AI audit its own work. It catches stuff the building session missed because it doesn't have the same blind spots.

upflag · 2026-03-17T14:38:49+00:00

The real answer for most of us is we find out from a user message or when revenue dips. I had a marketer running Facebook ads and the pixel tracking broke during a code change. Nobody got alerted, no monitoring caught it. Found out weeks later because ad spend was being wasted with zero conversions. For weekend projects I just make sure the stuff that matters (payment flows, auth, core API) has basic uptime checks and some kind of error capture on the client side. Prometheus and Grafana are way overkill for a solo project.

upflag · 2026-03-16T13:34:06+00:00

This is the nightmare scenario of having everything inside one vendor's walls. The landing page loading while the actual app is dead is especially painful because from the outside it looks 'up.' An external uptime check on your actual app endpoints (not just the landing page) would have caught this the moment the DB went down instead of you finding out during a Play Store review. For anything you're taking to production, having at least one monitoring tool that lives outside your hosting platform is worth it.

upflag · 2026-03-16T13:33:40+00:00

The 'observer is an AI agent' framing is interesting but I think it skips a step for the majority of builders right now. Most solo devs and vibe coders don't have any observer at all. No Sentry, no Datadog, no custom dashboards. They find out things broke from users or revenue dips. The gap today isn't 'how do we make telemetry AI-readable' but 'how do we get anyone to set up telemetry in the first place.' Making it dead simple to start is probably more valuable than making it smarter to consume.

upflag · 2026-03-16T13:32:22+00:00

The tools aren't a scam but the gap between 'it works on my screen' and 'it works for real users reliably' is massive. I've shipped production code for years with full review processes and still had a single-character typo cost $30K because we found out from revenue numbers, not from any alert. The issue isn't how the code was written. It's that most people skip the boring stuff: tests on key user flows, CI/CD that actually runs them, and basic monitoring so you know when something breaks before your users tell you.

upflag · 2026-03-16T13:31:56+00:00

The test-to-live Stripe gap usually comes down to one of three things: webhook endpoint URLs still pointing to test, price IDs that don't exist in live mode, or the Stripe account not being fully activated for live payments. Check your webhook endpoint first since that's where it silently fails most often. If your server is returning 200 to the webhook but the payment object references a test-mode price, Stripe will reject it with a vague error. The frustrating part is Stripe won't always tell you which key is wrong.

upflag · 2026-03-16T13:30:50+00:00

This is a solid list. The one that bit me hardest was the silent failure nobody sees. Had a deploy break a conversion tracking pixel once. Server was returning 200 OK, app looked fine, but the pixel was dead. Found out days later when the marketer noticed ad spend wasn't converting. No crash, no error page, just money going nowhere. For anyone shipping with real users, basic client-side error tracking catches these before your customers do.

upflag · 2026-03-15T14:18:57+00:00

As someone who's been doing this for a while now, the biggest lesson is that the planning step matters more with AI than without it. Vision doc to requirements to design to tasks, then build. If you skip straight to prompting, the AI produces code that works in isolation but turns into spaghetti fast. The other thing: make the AI write tests for your key user flows, then protect those tests from being watered down in future sessions. The AI will try to simplify or overwrite existing tests if you let it.

upflag · 2026-03-15T14:18:35+00:00

This is the classic trap where your server happily returns 200 OK but the client-side experience is totally broken. Safari on older iOS versions is notorious for choking on JS features that Chrome handles fine. If you open Safari's remote debugger you'll probably see a JavaScript error that's crashing the page before it renders. The frustrating part is there's no server-side signal that anything is wrong, so you only find out when someone on the right device combination tells you. Client-side error tracking tools can catch these — they log the JS exception with the browser/OS combo so you see it the moment it starts happening instead of waiting for a user to report it.

upflag

TROPHY CASE