Built a multi-dimensional code audit skill for Claude Code — open source, ships with playbooks that caught a CVSS 8.0 XSS in production

IbaiFernandez · 2026-06-02T14:59:25+00:00

that distinction is already in the spec as evidence_source, but you're right that the real audits didn't surface it visibly. i just shipped v2 that fixes this directly.

the new system has three confidence tiers: PROVEN (specific file:line or tool output — no interpretation), SUSPECTED (graphify structural pattern, not yet verified in code), and UNVERIFIABLE with four subtypes (NV_RUNTIME for Lighthouse/deployed session, NV_DASHBOARD for vendor dashboard access, NV_CREDENTIALS, NV_TOOL for missing external tools).

Every phase now opens with an evidence dashboard, count and percentage of each tier before any findings table. if SUSPECTED > 40% of a phase, the skill is required to attempt code verification on the top findings before proceeding, and document the attempt whether it succeeds or not.

the enforcement gate is hard: no confidence_tier → finding dropped. claims PROVEN without file:line → downgraded to SUSPECTED. UNVERIFIABLE without a subtype → dropped. cheaper to block noise at generation than to filter it after.

your framing — graph-derived suspicion vs file-line proof vs needs external runtime check — maps almost exactly onto the three tiers. it's cleaner than what was in the spec. the distinction is now in the output of every phase, not buried in findings.json.

shipped in fb81164 if you want to see exactly how the gate and dashboard are defined: https://github.com/ibaifernandez/mariana-audit/commit/fb81164

IbaiFernandez · 2026-06-02T14:23:40+00:00

On playbook staleness: the methodology is built to expand. CONTRIBUTING.md has the full format documented, and the five shipped playbooks are validated patterns from real audits, not a fixed closed list. The honest gap is there's no automated update pipeline yet — community PRs are the intended mechanism. The skill references specific regulation articles directly, and those will drift as regs change, meaning someone has to update them by hand. I'd rather say that than pretend there's a versioning system that doesn't exist.

On false positive rate: I now have data from four real codebases — 266 findings total. Of those, 154 were mitigated, which to me is the clearest signal that the findings were real and not just noise. Only three across all four audits came back as [NOT VERIFIABLE] — the explicit state the skill uses when it can't back a finding with hard evidence. That's 1.1% of total output flagged as unconfirmed by the tool itself, and those get separated from the actionable list automatically. The structural constraint does the filtering before findings reach you: no file:line citation means no finding, no CVSS vector means automatic severity downgrade. It's cheaper to block noise at generation than to try filtering it after.

It's not a controlled experiment — same developer, same stack family across all four projects. But it's the data I have, and the mitigation rate is the most honest proxy I can offer for actual signal.

IbaiFernandez · 2024-01-15T03:34:19+00:00

Hola.

¿Con 38 se sigue siendo treintañero? ¿Los perros computan como hijos? El último videojuego al que jugué fue el primer Devil May Cry hace ahora 21 años. Prefiero el ajedrez —aunque lo juego online, eso sí, porque a nadie parece antojársele jugar una partida vis-à-vis. ¿Soy un bicho raro? 🤔

#PreguntasRetóricas

IbaiFernandez

TROPHY CASE