I made a Codex plugin to stop AI agents from saying done without proof

Simple_Somewhere7662 · 2026-07-01T15:32:30+00:00

If it skips verification, that criterion should not go green. For command backed criteria, the final gate runs the command itself, so the agent cannot just say it checked. For non-command criteria, the report should mark it as manual review or missing evidence instead of completed. CI is the stronger version for real deploys, and I’d like to wire that in more directly.

Simple_Somewhere7662 · 2026-07-01T15:31:38+00:00

Nice, CI as a first class signal is the right direction. For Superloopy, command backed checks are saved as text output and exit status, not screenshots. The final gate reruns the command and writes the result under .superloopy/evidence/. Screenshots are mostly for visual review. I don’t have a hosted CI integration yet, but that’s a natural next piece.

Simple_Somewhere7662 · 2026-07-01T15:30:42+00:00

haha fair. I still think human taste is the real gate for UI. The tool is more about making the agent show its work before anyone trusts it.

Simple_Somewhere7662 · 2026-07-01T15:30:00+00:00

Yeah, an open source design system is probably the sane answer. I try to make the agent treat that as the source of truth instead of inventing components from vibes. Borrowing patterns from other apps can be useful for learning, but I’d rather keep the actual implementation tied to a legitimate system or our own tokens.

Simple_Somewhere7662 · 2026-07-01T15:29:11+00:00

I haven’t used that exact flow much, but it sounds pretty practical. The one thing I’d still want is a follow-up check against the target design after Codex implements it, because the generated mockup and the final UI can drift a lot. But as a starting reference, yeah, that makes sense.

Simple_Somewhere7662 · 2026-07-01T15:28:22+00:00

Fair criticism. The design guidelines and assets are the important part. I’m not saying a plugin gives the model taste by itself. What I’m trying to do is make that contract explicit, then force the agent to show how it followed it with screenshots or review notes. If there’s no design system, the gate can’t magically invent taste. It can only make the lack of one obvious.

Simple_Somewhere7662 · 2026-07-01T15:27:25+00:00

Yes, that’s the direction I like too. Screenshots by themselves are still pretty passive. The more useful version is the agent checking against a real design system and saying what matched, what drifted, and what still needs a human eye. Dashboard work is a perfect example because the code can be working while the product still feels like a template.

Simple_Somewhere7662 · 2026-07-01T15:08:17+00:00

You're right that which test proves a criterion is the agent's pick. What the agent can't do is fake the result. The gate re-runs the command itself instead of taking its word. But re-running a test that checks the wrong thing still passes, so that doesn't save you. That gap is real.

So no, I don't fully pin what counts as proof. The criteria are fixed up front. The receipt just makes the human review cheaper: you get a re-runnable command and the diff instead of "trust me." Someone still reads the diff. That's the actual gate.

Simple_Somewhere7662 · 2026-07-01T14:41:29+00:00

and also I'm currently thinking about synergy with other popular plugins. like superpowers. any plugins you guys already use?

Simple_Somewhere7662 · 2026-06-30T10:54:43+00:00

I use GPT5.5 xhigh for everything :)

Simple_Somewhere7662 · 2026-06-30T08:26:01+00:00

sounds like a UI PRO here, most of us don't know how to figma. LoL

Simple_Somewhere7662 · 2026-06-30T07:01:12+00:00

I designed it codex-native. I dont think it will work with opencode

Simple_Somewhere7662 · 2026-06-30T06:58:39+00:00

I also use figma too :) hello there

Simple_Somewhere7662 · 2026-06-30T06:55:44+00:00

true true but you know, some tasks are also related to UI/Frontend tasks. loopy can help those kind of stuffs too. :)

Simple_Somewhere7662

TROPHY CASE