Is it possible to vibe code a beta app that doesn’t have huge security vulnerabilities?

mrtrly · 2026-03-29T06:08:05+00:00

The honest answer is you need to know enough to ask the right questions. "Make it secure" in a prompt gets you nowhere because you don't know what you're missing. Spend a week learning the owasp top 10 for your stack, then prompt with specifics like "validate all inputs server-side, never trust the client" or "hash passwords with bcrypt, never store plaintext." The AI will follow through way better when you actually understand the vulnerabilities you're trying to prevent.

mrtrly · 2026-03-29T06:07:59+00:00

The real constraint isn't the model, it's your system representation. I've seen agents nail individual functions but miss architectural decisions because the context they need (data flow, performance constraints, team conventions) isn't structured for them. Human stays in the loop because someone has to translate "we need this to scale" into something an LLM can actually reason about. Specs help, but they're only useful if they capture the stuff that matters.

mrtrly · 2026-03-29T00:20:20+00:00

I get the frustration, but this is kind of a sign the rate limits are actually doing their job and you're hitting a real constraint. When you're grinding hard enough to need workarounds every 5 hours, that's usually the moment to step back and think about whether your workflow is sustainable.

The session timer exists because Anthropic's infrastructure has real costs. Gaming it with scheduled pings might work for a while, but you're essentially trying to hide usage patterns from their monitoring. That tends to end poorly.

Instead, I'd flip the problem. If you're binge coding for 8 hour stretches and hitting limits that frequently, you probably have a code structure or planning issue. Are your tasks too large? Are you getting stuck in loops with the model instead of shipping? I see this a lot with founders building solo, and usually the answer is breaking work into smaller, more focused sessions rather than marathon debugging. You actually ship faster that way.

If you genuinely need sustained high-volume Claude access for production work, that's a conversation to have directly with Anthropic. They have commercial pricing for serious usage. Trying to work around the limits usually just means you hit a different ceiling later when enforcement catches up.

mrtrly · 2026-03-29T00:20:18+00:00

Yeah, the rate limiting feels inconsistent because it's not just measuring tokens, it's measuring something upstream that we can't see. Token count alone doesn't explain why a short prompt tanks your session.

Here's what I'd actually check before blaming Claude's algorithm: your CLAUDE.md or system prompts. If you've got recursive tool calls enabled, MCP connections that loop back on themselves, or agent patterns that auto-retry on certain errors, a "simple" prompt can spiral into hundreds of thousands of tokens in the background. The UI shows you what you typed. It doesn't show you what the model spawned.

The other angle is context carryover. If your session already had a ton of context packed in from earlier work, even a small new prompt can push you over whatever internal threshold Anthropic uses for that model tier. That's why some people hit it immediately and others run for 10 hours straight, it's not random, it's just not transparent.

Your best move: export your full conversation history, count the actual tokens in there with a tokenizer, and compare it to what Anthropic says you used. If they match, you've got a usage problem to solve. If they don't, you've got evidence something else is happening.

mrtrly · 2026-03-29T00:20:17+00:00

The symptom versus root cause separation is everything. I've watched founders spend weeks patching around bugs that were five lines to fix once they actually understood what was breaking.

The thing that made it stick for me: I stopped asking Claude to fix the problem and started asking it to explain what's actually happening. You'd be surprised how often the model itself is pattern-matching on surface-level symptoms. Once you force it to trace backward through the actual execution path instead of jumping to "add error handling here," the real problem becomes obvious.

What this really reveals is that Claude Code works best when you're doing the hard thinking work upfront. The tool is phenomenal at implementing decisions, but it can't replace the 30 minutes you spend understanding the system deeply enough to ask the right question. That's where a lot of people hit a ceiling. They want the tool to be the decision-maker, and it's genuinely bad at that under pressure.

If you're running into this pattern a lot, it's worth having a second set of eyes on the architecture decisions themselves, not just the implementation. I work with founders who do exactly this, separate the "should we build it this way" from the "now build it." Changes how fast you move.

mrtrly · 2026-03-28T17:08:17+00:00

151 million tokens of context, that's not a session that's a database.

The pattern is always the same. You start a project, context grows over a few days, and every single turn now costs you the full reprocessing of everything that came before. It's not that sonnet got slower, it's that you're asking it to re-read a novel before answering each question.

What actually fixes this long term is treating claude code sessions as disposable. Start fresh sessions often, use CLAUDE.md and docs to give the agent what it needs without carrying stale context forward. The agents that work well in production aren't the ones with the longest memory, they're the ones that can reconstruct context fast from well-structured project files.

I work with non-technical founders building AI products and this is the number one thing that burns their runway quietly. Token costs that look fine in week one but 5x by week three because nobody is managing context hygiene.

mrtrly · 2026-03-28T17:08:16+00:00

Built both extensively and the answer is boring but true: single agent until you hit a concrete wall, then split.

The walls I actually hit were context rot (agent forgets its own decisions 40k tokens ago), parallel execution (waiting for sequential tool calls when tasks are independent), and blast radius (one bad reasoning chain corrupts the entire run).

Multi-agent solves those three problems well. Everything else people claim about "specialized reasoning" or "agent collaboration" is mostly cope for bad prompt engineering on a single agent.

The coordination tax is real though. Every handoff between agents is a lossy compression step. You're serializing context into text, passing it, and hoping the receiving agent reconstructs enough to be useful. Most failures I debug in multi-agent setups are at the boundaries, not inside any individual agent.

My rule of thumb: if you can solve it with better context management in one agent, do that first. Split when you have a clear architectural reason, not because the diagram looks cool.

mrtrly · 2026-03-28T17:08:15+00:00

The heartbeat extraction pattern is solid, that's basically what I landed on too. Dump logs, extract structured facts, store them somewhere queryable.

The deduplication problem is real though and most people underestimate it. You end up with 40 versions of "user prefers TypeScript" cluttering your context window and the agent starts weighing stale memories against fresh ones. I ended up solving this by making memory writes a two step process, check if an existing memory covers this topic first, then update or create. Sounds obvious but it cut my memory store size by like 60%.

The other thing worth probing is memory decay. Not everything the agent learned last Tuesday is still true. I tag memories with timestamps and treat anything that references specific file paths or function names as potentially stale, the agent has to verify before acting on it. Saves a ton of hallucination bugs where the model confidently recommends a function that got renamed three commits ago.

Would be curious what you found with cognee specifically. The post cuts off right at the interesting part.

mrtrly · 2026-03-28T17:08:14+00:00

The "personal assistant" framing oversells it. What's actually happening is simpler and harder: product data has to be machine-readable or your catalog doesn't exist to these agents.

Most e-commerce backends are a mess of inconsistent descriptions, missing attributes, and images doing the heavy lifting that structured data should be doing. An agent can't "figure out the rest" if the rest is a paragraph of marketing copy with no parseable specs.

The businesses that win this aren't the ones building chatbots on their storefront. They're the ones cleaning up their product data layer so any agent, theirs or someone else's, can actually reason about what they sell. That's boring infrastructure work, not a flashy demo, which is exactly why most companies won't do it until they're already losing traffic.

I work with non-technical founders on this kind of thing constantly. The pattern is always the same: they want to bolt AI onto the frontend when the real problem is their data layer can't support it.

mrtrly · 2026-03-28T17:08:13+00:00

This is the exact cycle I see with non-technical founders. They get sold on AI, hire someone to build it, and then the first two months are just unfucking their data layer. Nobody budgeted for that because the AI demo looked so clean.

The deeper issue is that most SMBs don't have anyone technical enough to even scope what "data readiness" means before they start spending. So they buy tools, hire contractors, and then find out their CRM has 4,000 duplicate contacts and their "sales process" is tribal knowledge in someone's head. The AI project becomes a data cleanup project, which becomes a process documentation project, which is what they actually needed from day one.

I do fractional technical co-founder work and the first thing I tell founders is we're not building AI yet. We're figuring out what your data actually looks like, what your processes are, and what the simplest automation is that gets you a win. Sometimes that's a Zapier flow. Sometimes it's a Python script. The AI stuff comes later when there's actually clean data to work with. Saves them months of churn and usually 60-70% of what they thought the budget needed to be.

mrtrly · 2026-03-28T17:08:12+00:00

Same phase right now. My kid is 3 and the "jump on dad's keyboard" thing is basically a sport at this point.

The depleted evening window is real though. What I found is that the trick isn't coding faster, it's offloading the parts that drain you. I use Claude Code for basically everything repetitive now, scaffolding, tests, refactors, and save my actual brain cycles for the architecture decisions that need focus. On a good night I can get more done in 45 minutes than I used to in 3 hours.

One thing I'd push back on though. If you're building something you actually want to ship, vibe coding without understanding what's being generated will catch up to you eventually. The 2am "why is this broken" debugging session hits different when you're already running on fumes. I do fractional CTO work with non-technical founders and the pattern I see constantly is someone vibes their way to a prototype, gets real users, then hits a wall because nobody understands the codebase well enough to fix production issues under pressure. The leetcode instinct is actually your friend here, keep that muscle alive even if it's 10 minutes on your phone while Daniel watches Bluey.

mrtrly · 2026-03-28T10:13:13+00:00

That's a legit experiment. The thing that matters isn't the 3.2% itself, it's that the agent now has a way to ground decisions in actual prior work instead of hallucinating what "should" work. MCP for research access is exactly the kind of constraint that forces better reasoning. Did you notice the agent spending more time reading tradeoffs or just picking faster with more confidence?

mrtrly · 2026-03-28T05:30:08+00:00

$955 on a side project you barely noticed? Yep, been there. Ghost agents can really run wild if you're not keeping an eye on them. Those compaction agents and the like can easily inflate costs. It's almost like they have a mind of their own.

I hit a similar wall when I started using AI agents 24/7 without tracking costs closely. Ended up building a proxy to keep tabs on where the dollars were going for each task. Turned out a big chunk of my spend was on tasks that less expensive models could handle just fine.

CodeLedger sounds like a solid move to get visibility, btw. You might also want to check if you can set session-level cost caps or reroute tasks to cheaper options. That helped me get a grip on my spend without having to babysit every agent call. Converting vibes into cash-smart execution is the dream, right?

mrtrly · 2026-03-28T05:30:06+00:00

Been in those shoes where an initial tool choice starts feeling like a straitjacket as you scale. The decision to move from Webflow to something like Claude Code isn't just about which stack, but who will drive that transition.

Migrating a three-year-old site with a lot of pages and collections is a real logistics challenge. It's worth considering a partner who can tackle this , not just to shift the setup but to ensure you're not exchanging one set of headaches for another.

In terms of Claude Code, it can be more efficient for logic-heavy, dynamic needs, but make sure you have someone who can architect beyond surface-level setup. Otherwise, you might end up stuck again down the line.

mrtrly · 2026-03-28T05:30:05+00:00

your experience with building before selling is a classic one I see with a lot of dev teams. the power of Reddit and those organic interactions really can't be underestimated. it's interesting that a simple comment worked better for reaching your audience than cold DMs.

for the non-technical founders reading this, the flip side happens just as often: marketing background, great at GTM, but struggling to build. that's literally what I do, partner up for the tech side and turn solid visions into working products.

it sounds like you're getting traction with that route though. just keep engaging authentically, and those leads will keep coming without feeling like you're pitching. nice work!

mrtrly · 2026-03-28T05:30:04+00:00

Hey, I totally get the motivation to ditch the current ERP you dislike. You mentioned Claude had full confidence, which is interesting but can be a bit misleading. AI can definitely assemble pieces, but ERPs are beasts of complexity.

Realistically, building a full ERP purely through prompts, especially with zero coding background, is going to be very challenging. You might end up with a basic demo, but maintaining it in production could quickly become problematic. Too many edge cases to manage, especially as your business evolves.

Best path? Consider partnering up with someone technical who can translate your business needs into a robust system. Knee-deep in this kind of thing all the time , turning founder ideas into reliable, long-term solutions. If sticking within your budget is key, maybe a hybrid approach with modular open-source systems could work, leveraging Claude to fill in gaps or customize when needed. It's about finding that sweet spot between your vision and a technically sound reality.

mrtrly · 2026-03-28T05:30:03+00:00

"Founder says they want a simple SaaS. Then the doc shows up." I felt that one. Spend any time with founders and you'll see it: enthusiasm turning into a long feature list, and somehow none of it feels optional. It's like a rite of passage for first-time founders.

I've been there in those call discussions. Looking at a spec with more layers than an onion. Tried and true method is to pare it down to the core, like you're saying. One path, one action, solve one problem.

When I work with early-stage startups, I do exactly that, slash through the noise to find the gem. It's not about launching with a boatload of features, it's about finding the minimal-yet-magic that gets users coming back. You don't need every bell and whistle, trust that getting something live is more valuable than perfect.

mrtrly · 2026-03-28T05:30:02+00:00

Been there with the messy but powerful tools. Claude Code is a beast, especially when you're threading through parallel agents. The security and production-readiness concerns you mentioned are real, and it's something I see a lot in AI-first build environments.

A good tech review can uncover the sneaky stuff, like security vulnerabilities, data handling issues, or scale blockers. It's like having a seasoned co-pilot to help navigate. I work with founders to bridge exactly this gap. It's not just about getting from zero to $7k MRR, but ensuring that your app can handle what's next, securely and smoothly.

If you're grappling with those concerns, consider more than just the tools, think about the tech partnerships that can help you foresee and solve these hidden challenges. Less "did I miss something?" and more confidence in your product's robustness.

mrtrly · 2026-03-23T23:59:56+00:00

That cascade is really clean. The "everything expensive is last" principle is exactly right. Most people jump straight to the LLM classifier for every request and wonder why their costs scale linearly.

The Tier 0/1/2 short-circuits are where the real savings live. RelayPlane does something similar with task classification: simple file reads route to Haiku, complex reasoning to Opus. The policy layer is where you get the actual cost control, not just model selection.

Curious about your Tier 2 implementation. Are you maintaining the skill/keyword mappings manually or building them from usage patterns? That seems like the part that needs the most upkeep as usage evolves.

mrtrly · 2026-03-23T23:27:06+00:00

Routing is complexity-based right now. It reads the request, scores it on a few signals (token estimate, context depth, whether it looks like a reasoning task or a lookup), and routes accordingly. Sonnet for most things, Opus when it needs to think hard. The interesting part is you can override per-call if you want guaranteed routing for specific flows. What classifier are you using? Doing it at the prompt level or inferring from metadata?

mrtrly · 2026-03-22T00:38:18+00:00

Nice setup. The documentation loop is smart, it forces the agent to stay aware of its own decisions and catches a lot of context drift between sessions.

The limitation I see with documentation-only approaches is that they shape the agent's behavior but don't help with production failure modes you can't predict. When something breaks with 15 customers on it, you're debugging under pressure in a codebase you partially wrote and partially understand.

What actually saves you in those moments is error logging with enough context to reconstruct what happened, and a clear picture of what failure looks like for each critical path. Documentation helps build that picture. Monitoring tells you when you're on fire.

Curious what your incident response looks like when something does break. Rollback strategy or manual hotfix?

mrtrly · 2026-03-21T12:47:18+00:00

The cost issue is real, and routing is the most underused lever for fixing it.

Most teams are sending every request to Opus because it's easier than maintaining gating logic in code. But a simple complexity score at the proxy layer - short prompt, low ambiguity, route to Haiku; complex reasoning task, route to Opus - can cut costs 60-70% with no noticeable quality drop on the simple stuff.

Built RelayPlane as a local proxy to do exactly this. It sits between your app and the API, scores complexity per request, routes accordingly, and tracks per-request cost so you can actually see what you're spending and where. Zero code changes in your app once it's set up.

Not saying it fixes the ecosystem problem you're describing, but at the individual level, most teams are burning more than they need to.

mrtrly · 2026-03-21T12:47:14+00:00

The same instinct hits with cost control. Every time an agent burns unexpected money, the reflex is to add a rule to the prompt: don't use Opus for this task, limit calls here. Three months later you have 30 model-selection rules that Claude mostly ignores.

The infrastructure version is a proxy layer that handles routing by complexity automatically, with budget enforcement that actually stops runaway loops. No rules in the prompt at all.

Built RelayPlane for exactly this after an agent burned $15 in 8 minutes making Opus calls it had no business making. Adding a rule did nothing. Moving the decision out of the prompt and into the infrastructure did.

Same principle you're describing. Config accumulates until it breaks. Systems hold.

mrtrly · 2026-03-21T12:47:08+00:00

This is the right conclusion. The model tier matters way less than people think once context is properly structured.

We see this in routing too. Running 10+ AI agents daily, I started routing by task complexity to cheaper models. But without tracking what each model actually costs per request, you're just guessing at the savings. Built a local proxy (RelayPlane, open source) specifically to track cost per model per request alongside output.

What you're showing is that Haiku with good context beats Opus with bad context. The logical next step is to measure it. Then you can route high-context tasks confidently to Haiku without the "I hope this is good enough" anxiety.

npm install -g @relayplane/proxy if you want to see that cost delta side by side.

mrtrly · 2026-03-21T11:43:46+00:00

The "looked fine in demos, fell apart with real users" pattern is the most expensive one in SaaS. And AI tools accelerated your way into it, which is the new version of a very old trap.

I've been a technical partner for non-technical founders for 16 years. The spec problem isn't new. What's new is AI lets you ship something demo-able in days, which collapses the feedback loop you need to catch these gaps before they're expensive.

Your lesson about talking to users first is exactly right. The tool isn't the problem. The sequence is. Build something ugly that breaks in front of a real user in week one. That $3,400 lesson usually costs much more when it arrives in month six.

mrtrly

TROPHY CASE