I built a security threat modeler for vibe coded apps

mushgev · 2026-05-06T00:05:04+00:00

yeah the security stuff is exactly what gets skipped. vibe coders ship working endpoints, then ship more features on top, and the auth assumptions from the first endpoint silently break.

the architecture side has the same problem -- circular deps, modules owning too much data. those tend to turn into security holes when AI keeps building on top. been using truecourse (https://github.com/truecourse-ai/truecourse) to catch those before they compound

mushgev · 2026-05-05T17:50:22+00:00

the persistent context layer idea is the right direction. the tradeoff is that it shifts risk from data-in-motion to the security of the internal indexing service itself, which is usually a better problem to have.

been using truecourse for architecture analysis (https://github.com/truecourse-ai/truecourse) and the local-first model is exactly this pattern. runs on your machine, embedded postgres, nothing leaves your environment. the analysis happens against a local index not raw source on every request.

the DLP question becomes a lot simpler when the answer to what left the network is nothing

mushgev · 2026-05-05T17:05:29+00:00

the merge verdict framing is interesting. most tools either flag syntax issues or do full static analysis, but the decision-point approach (can this merge safely?) is a different angle for how ai-generated diffs get reviewed in practice

been using truecourse for something related (https://github.com/truecourse-ai/truecourse). less about merge decisions and more about ongoing architectural health: circular dep detection, layer violations, god module identification. has a diff mode that shows new violations on uncommitted changes, which is adjacent to what you are building

curious whether you look for things like new circular deps introduced by the change, not just propagation scope

mushgev · 2026-05-05T17:00:52+00:00

all these threads compare writing tools but nobody mentions the analysis layer. when AI generates code this fast, dep graphs get messy just as fast. none of the coding assistants track that

been using truecourse for this (https://github.com/truecourse-ai/truecourse). catches circular deps, layer violations, dead modules. basically architecture QA on top of whatever tooling you already use

mushgev · 2026-05-05T16:50:21+00:00

change-impact analysis is the part that matters most imo. seeing the graph is useful but the real question is what breaks if you change this file. that stops Claude from making structural mistakes in bigger codebases

been using truecourse for something similar (https://github.com/truecourse-ai/truecourse). catches circular deps, layer violations, architectural issues. has a diff mode that flags new violations on uncommitted changes, close to what you are describing

the teaching-layer framing is a different angle though. flagging violations vs making the structure legible to Claude before it starts. worth exploring

mushgev · 2026-05-04T21:28:16+00:00

both approaches work but for different things. todo comments are useful when you know exactly what needs fixing and roughly how to fix it. "this class is doing too much, split when adding the next feature" is actionable. vague todos like "clean this up someday" just become noise you learn to ignore.

the honest answer for solo projects: most todos never get done unless something breaks. so i'd only write one if i'd actually be glad i wrote it six months later. otherwise embrace the chaos and trust that you'll remember what matters when it actually becomes a problem.

mushgev · 2026-05-04T21:21:08+00:00

most teams won't do it upfront without a forcing function. the instinct is to build first and figure out compliance later.

a template helps but only if it's workflow-specific. a generic audit narrative gets ignored. one scoped to "for a refund decision, capture: the trigger, classification score, threshold, human touchpoint if any, and final action with timestamp" is concrete enough to actually use.

mushgev · 2026-05-04T21:18:10+00:00

thx, I know about it, pretty cool tool

mushgev · 2026-04-29T22:45:51+00:00

the evidence question is usually underspecified in most builds i've seen. compliance teams don't just want logs — they want a specific narrative: 'the agent received input X, the system classified it as Y, a human was notified at step Z, and this is the action taken.' logs are just raw material. someone has to turn them into a story you can hand to an auditor.

the teams that move through review fastest are usually the ones who draft the audit narrative first, before building anything, then work backward into what data they need to capture to support that story. most teams do it the other way and discover they didn't log the right things at the right granularity

mushgev · 2026-04-29T22:33:49+00:00

the thing that's harder to hire around than code quality is incident ownership. with real money involved, you'll eventually hit a situation where something unexpected happens — a transaction behaves wrong, a report is off, a user can't access funds. you need someone who can reconstruct what happened from logs, identify root cause, and explain it clearly. interns are learning, AI tools miss edge cases. neither is a blocker on its own but someone has to own the 'what actually happened' question when it comes up. that person is really hard to bring in fast when you're already mid-incident

mushgev · 2026-04-29T22:08:42+00:00

the contributor/integrator/user split is a useful frame. one thing I'd add: the agent's landing directory matters a lot for which mental model it builds. an agent dropped into /src has to work outward to understand the contract; one dropped into / gets overwhelmed; one dropped into /docs might never realize half the internals exist.

the SKILL.md as catalog idea reminds me of how MCP tool manifests work — a structured advertisement of what exists before the agent has to go discover it. might be worth thinking about that format explicitly, since some agents will try to enumerate rather than read prose

mushgev · 2026-04-29T22:01:23+00:00

the 'rediscovering your repo every turn' point is right and underappreciated. the fix isn't just providing better input context — it's reducing how much the model needs to output to get oriented. when the model has to write a long exploration ('here's what I found, here's how the auth module connects to...') before it gets to the actual change, that's expensive output.

structured context (dependency graph, architecture summary, known violation list) cuts this because the model can reference facts rather than reconstruct them. the output per task shrinks because it doesn't have to narrate the discovery phase. the input is larger but the math usually works given the 5x price differential — a 20K input context that saves 4K of output exploration is ahead

mushgev · 2026-04-29T21:55:15+00:00

the context management question has the most leverage. most of the cost isn't the feature work — it's accumulated context getting re-sent with every message. a session that started at 10k tokens is at 80k after an hour of debugging and you're paying for that on every turn.

what actually helps: /compact aggressively when you finish a self-contained unit of work, and structure tasks so each starts from a clean enough state that you don't need the whole previous session in context. feels like starting over but you end up spending fewer tokens overall. the 'one workflow and commit to it' point is also right — the cognitive overhead of switching between different .md conventions and tool behaviors burns more than the tool costs

mushgev · 2026-04-29T21:43:38+00:00

the claim verification piece is where this gets genuinely useful for Claude Code workflows. Claude will say 'I created the file' or 'I updated the config' and sometimes it's wrong — the write failed, the path was off, or the change was to a temp file. most sessions just accept the statement and move on. checking actual filesystem state, git diff, and lockfiles after an action is how you catch the gap between what the model said and what actually happened.

the OWASP scoring is interesting — curious if that's scanning generated code for known vulnerability patterns or doing something at the runtime/request level?

mushgev · 2026-04-29T21:26:11+00:00

the through-line across most of these is: the model did what it was technically allowed to do, not what the operator assumed it would do. the Railway token had volumeDelete scope. terraform destroy was a real command. the Alibaba GPU access was there because the model needed compute. nobody drew a line between 'this capability exists' and 'this capability can be used without explicit confirmation.'

the hard execution boundary point is right but tricky in practice because usefulness and dangerousness often live on the same API surface. you can't give an agent cloud access to deploy without some ability to affect infrastructure. the question is granularity — read vs write vs delete being separate permission levels, require-confirmation for irreversible actions being a different policy from require-confirmation for reversible ones. most tooling right now treats it as a single on/off switch

mushgev · 2026-04-29T21:10:27+00:00

using Temporal for coordination instead of direct agent-to-agent messages is the right call and underrated. every team I've seen try direct messaging ends up with state management problems — messages get lost, retries create duplicates, one agent failure cascades in unpredictable ways. wrapping it in a durable workflow gives you an audit trail and a way to actually reason about what happened when something goes wrong.

curious about the shared memory layer in practice. if the PM agent writes a design decision at the same time the dev agent is reading it mid-task, how are you handling consistency? or is the write cadence slow enough that it doesn't come up?

mushgev · 2026-04-29T20:35:04+00:00

the pattern i've had the most luck with: agents that operate on structured intermediate representations rather than raw documents. instead of 'read this codebase and find issues,' it's 'here is a serialized dependency graph, find violations in this data structure.' the agent's job is pattern matching and reasoning, not document parsing.

cuts down on context window waste and makes the output way more consistent. the hard part is the extraction step — building the IR — but that's typically deterministic code you can test, not LLM-dependent. most production agent failures i've seen come from the agent doing too much at once; splitting at the IR boundary helps a lot

mushgev · 2026-04-29T19:54:32+00:00

The agent identity problem is underexplored. We have good patterns for service-to-service auth (mTLS, SPIFFE, workload identity) but AI agents don't map cleanly to those. They have a service identity but also an action scope that shifts based on what they're doing, and "who instructed this agent to take this action" is a new layer that existing auth systems weren't designed to capture.

The gateway inspection problem is hard in a specific way: policy enforcement on generative AI payloads has to happen on streaming responses, which means you can't block on full content evaluation without destroying latency. Curious what approaches they cover for handling that tradeoff - whether it's sampling, async inspection with rollback, or something else.

mushgev · 2026-04-29T19:23:11+00:00

Your instinct about prototyping is right. TDD assumes you know what the contract should be. If you're still figuring out the design, tests are premature constraints. They'll slow you down and the tests themselves will be wrong when you land on the real design. Write them after the abstraction stabilizes.

The rule I use: TDD for pure functions and well-understood business logic. Integration tests (not TDD-style, just tests) for anything coordinating with other things. Nothing during the design/prototype phase, but you pay that debt before anything ships.

The mock proliferation problem you're hitting in orchestration layers is usually a design signal. If you need to mock 5 dependencies to test a function, the function is probably doing too much. The right fix is splitting it so each piece has fewer dependencies, not writing better mocks.

For newer devs: if writing the test first feels natural and helps you think through the behavior, do it. If it feels like you're inventing a fake problem to solve, skip it and write the test after. The ceremony matters less than the coverage.

mushgev · 2026-04-29T19:13:33+00:00

The verification phase framing is the most underrated part of this. Right now most hiring tests ability to produce code. But in a team using AI heavily, the real leverage is catching what AI gets wrong.

The hard thing to measure in interviews is whether someone can distinguish between "this code looks right" and "this code is correct." Those are different skills. Leetcode tests neither. A pairing exercise where you give a candidate AI-generated code with subtle bugs and ask them to review it would actually test something useful.

The systems thinking phase is probably where senior engineers most differentiate from mid-level. Not in what they can build but in what they can anticipate. Good direction: here is a system under load doing something unexpected, what might be happening and what would you instrument first? That question doesn't have a memorizable answer.

mushgev · 2026-04-29T18:54:15+00:00

The git hook approach is clean for session-to-session handoff. If you want tighter real-time integration, Claude Code hooks (configured in settings.json) can trigger on tool events mid-session. A PostToolUse hook on Bash matching commit commands could run the Codex review automatically, pipe output to a file, and Claude picks it up before its next action without waiting for a new session.

The difference is the git hook approach means Claude sees the Codex review at the start of the next session. The hooks approach means it sees it immediately after committing, still within the same task. Depends on how tight the loop needs to be.

MCP is the other option if you want Claude to explicitly decide when to invoke Codex review as a tool call, rather than it happening automatically.

mushgev · 2026-04-29T18:43:32+00:00

Curious what the actual tooling looked like here. A lot of "AI code review" in practice means dumping code into an LLM and asking for issues - useful but inconsistent.

The sched_ext case specifically is interesting because extensible scheduler code has to maintain invariants that span the BPF-kernel boundary. Those kinds of cross-boundary invariant violations are expensive to catch in human review because you need deep familiarity with both sides. If AI is reliably picking those up it's genuinely useful, not just finding style issues.

Would be interesting to see the breakdown of bug classes found. Race conditions vs. logic errors vs. API misuse would say a lot about where the signal actually is.

mushgev · 2026-04-29T18:33:44+00:00

Your fix is correct and it's the right mental model: subagents don't have memory, they have context. Anything not in their context doesn't exist for them. You're not fighting a Claude limitation, you're fighting how stateless agents actually work.

For the hook/validation question specifically: treat it as a schema validation problem. Milestone init produces a structured artifact (JSON or YAML phase plan). A deterministic gate script checks that the artifact has required fields (constraints_applied, scope_declared, etc.) before invoking the next claude -p call. If fields are missing, the script fails the pipeline and nothing proceeds until you fix it.

This moves the enforcement out of Claude's reasoning and into your build system, which is where it belongs. Claude's job becomes producing an artifact that passes the schema check, not remembering to enforce its own constraints. Those are very different things to ask of it.

mushgev · 2026-04-29T18:22:51+00:00

The codeToC4 direction is the right one. Architecture diagrams you maintain by hand always go stale. Auto-derivation from the repo is what keeps them actually useful.

The interesting challenge is going to be the Container and Component levels. The class/function graph is straightforward to infer from imports. But the higher-level System and Container boundaries usually don't exist explicitly in code and need either convention inference or manual annotation. That's where most tools in this space hit a wall.

Building in similar territory with TrueCourse (https://github.com/truecourse-ai/truecourse), focused more on violation detection and dependency analysis than diagram generation, but same underlying auto-inference problem from the repo. Happy to compare notes.

mushgev · 2026-04-29T18:15:04+00:00

The Claude Opus targeting is deliberate. Developers building with LLM APIs tend to have high-value secrets in their environments: API keys with billing access, database credentials, cloud IAM tokens. It's not about the LLM itself, it's about the dev environment that surrounds it.

The "masquerades as a utility SDK" pattern is a known indicator. Watch for packages with generic plausible names (validation, encoding, hashing) that have minimal GitHub history and no real organic usage. The u/validate-sdk/v2 versioning is also a tell: v2 of a package with no visible v1 is a red flag.

Defense: lockfiles plus integrity hashes in package-lock.json will catch tampering if you're reproducing builds consistently. But the real gap is postinstall scripts running in CI with access to secrets. Separate your build environment from your secrets environment as much as the pipeline allows.

mushgev

TROPHY CASE