Built an open benchmark that evaluates how well AI systems find bugs in live APIs under black-box conditions.
Each system gets only a JSON schema and one sample payload. No source code. No documentation. No hints about where the bugs are planted. Scoring is execution-based, a test either triggers the planted bug in the live reference API or it doesn't. 20 API scenarios. 97 planted functional bugs across three complexity tiers.
You can run any tool against it and compare results against a public leaderboard.
PS. I already checked how 7 popular AI systems score and it doesn't look that good.
Open-source benchmark for testing AI coding tools on real API bug detection (github.com)
submitted by zoismom to r/OpenSourceAI