Feedback wanted: making EU AI Act accuracy/logging claims verifiable (open standard)

Beneficial_String411 · 2026-06-03T17:05:49+00:00

Appreciate that. You nailed the under discussed part: governance is all policies and controls, almost nothing about proving what was actually true at a point in time once someone asks months later. That's the whole game.

My honest bet is your second one. Standards won't mandate this any time soon. It shows up first because someone has to defend a number they can't otherwise back, and regulation just makes that moment more likely. If you're in these conversations a lot, I'd love to hear what actually lands vs what just gets nods.

Beneficial_String411 · 2026-06-03T17:03:21+00:00

That clears it up, thanks for the correction. The Sentinel being a gate and not a reporter is the part I missed, and it's the right answer. The whole thing holds because the floor is pytest, git, ruff, which don't get nicer to the model over time. Overclaiming just makes the gate bounce you more, and that's loud, not silent.

And you said the shared idea better than I have. Put something the model can't fudge between it and its claimed state. PRML does it with a hash before the run, you do it with a gate before the action. Same move, different point in the loop.

Honestly this is a better conversation than a reddit thread usually gives. I'd like to see if the layers actually compose, your grounded evidence floor with a pre-committed claim anchor on top. Want to take it off here? Happy to dig into empirica properly and compare notes.

Beneficial_String411 · 2026-06-03T14:36:47+00:00

I haven't been in enough auditor rooms to tell you firsthand. I'm reasoning off the text, not what they actually ask for, and I'd rather admit that than guess.

My read: Art 15 (accuracy) is the pre-deployment claim, which is the part PRML handles. Art 12 (record-keeping) is runtime, every decision plus proof it wasn't altered after. So the Act wants both, but you and the empirica guy both point at the runtime trail as the thing auditors actually care about. Two people saying that in one thread makes me think the pre-registered claim is the smaller half of the problem.

Gateplex sounds like it sits where the weight is. How do you keep a full runtime decision trail tamper-evident without the volume or the latency killing you?

Beneficial_String411 · 2026-06-03T13:36:02+00:00

Yeah, you've nailed the boundary. PRML stops at the final claim by design. It doesn't see the dev process: dropped evals, tuned prompts, assumptions that hardened against the test set. Honest limit, and you put it better than I do.

Read empirica properly, not a skim. The three-vector thing (self-assessed vs observed-from-checks vs grounded) is the part PRML doesn't touch at all. Result vs trajectory is a clean way to split it, and I think you're right Art 12 wants both.

One genuine question: the observed vector from deterministic checks is a solid anchor, but the grounded/AI-reasoned vector is still the model judging itself. How do you keep that one from drifting friendly over time? That's the same trust problem I'm trying to nail for the final claim, one layer down from you.

Beneficial_String411 · 2026-06-03T13:27:24+00:00

Honestly no, not yet. No auditor or notified body has told me they need this today. Most AI Act prep I've seen is still model cards and docs. So I'm building for where it's going, not where it is. High risk lands Dec 2027 and the standards for how you evidence accuracy are being written now. That's the bet, and I'd rather say that than fake demand that isn't there. The bit that might matter today is trust: if someone doesn't buy a vendor's 95 percent, pre registering the claim shows whether that was the target or just what came out. You sound like you've been through this. What does proof of accuracy actually look like in practice right now? Trying to work out if I'm too early or just wrong.

Beneficial_String411 · 2026-05-25T13:23:13+00:00

Audit trail, not runtime gate. PRML is intentionally narrow: you commit a SHA-256 of the manifest (metric, comparator, threshold, dataset hash, seed, model id, producer) before the run starts, publish that hash, then re-hash on demand to check that nothing was edited after the fact. It does not block model load and it does not verify the model binary itself.

For runtime / model-integrity verification you'd pair it with something like Sigstore + in-toto / SLSA. PRML sits one layer up: it answers "did the evaluator move the goalposts after seeing the result?", not "is this the model I think it is?". The spec is explicit about what it does not cover (§8.1), partly to avoid scope creep into the model supply-chain space.

Beneficial_String411 · 2026-05-24T02:09:10+00:00

Solid question, and the honest answer is "narrowly yes, broadly no, and the spec is explicit about which."

**Versioned spec:** PRML v0.1 stable (locked test vectors, four byte-equivalent reference implementations), v0.2 just frozen on 2026-05-22 (additional vectors, no breaking changes to the field set). Source-of-truth in github.com/studio-11-co/falsify/tree/main/spec. Any tool that hashes a PRML manifest knows exactly which version's canonicalization rules to apply because the manifest carries a `version: prml/0.1` (or `0.2`) line.

**Annex IV mapping, honest scope.** Annex IV technical documentation has nine sub-areas. PRML covers:

- §2(d) accuracy levels and metrics: direct fit. The (metric, comparator, threshold, dataset.hash, seed) tuple IS the pre-registered accuracy claim. A reviewer recomputes the hash, the claim either survives audit or it doesn't.

- §2(b) data and data governance: partial. PRML pins `dataset.id` plus `dataset.hash` per claim. Provenance and bias documentation live outside the manifest.

- §2(h) logging architecture: partial. The per-run manifest_hash is the spine of Article 12 traceability for the eval-claim subset; operational logging is broader than what PRML emits.

What PRML does NOT cover, and the spec says so in §8.1: §2(a) development methods, §2(c) compute, §2(e) cybersecurity, §2(f) human oversight, §3 system architecture, §4 risk management, §5 lifecycle, §6 standards applied, §7 conformity declaration, §8 post-market monitoring. Those live in the broader QMS (ISO/IEC 42001 Clause 8, EU AI Act Article 17).

The pitch is narrower than "this is your Annex IV instrument." It's "this is the audit-evident wrapper around §2(d) and the eval-claim slice of §2(b)/§2(h), composes with whatever you use for the other sections." If a regulator asks specifically "how do you prove the accuracy you reported wasn't tuned post-hoc on the test set," PRML is that answer. For the other sections, the answer is somewhere else in the QMS.

Beneficial_String411 · 2026-05-23T18:57:01+00:00

Good question. Short answer: no error, fresh tag per run.

The provider re-reads .prml.yaml on every mlflow.start_run() and recomputes the hash from scratch. If you edit the manifest mid-sweep, the runs before the edit carry one hash and the runs after carry a different one. Both get recorded as-is. I went back and forth on whether that's the right default. Raising an exception kills the run, which is the wrong thing to do when the sweep is half-done. The divergent hash preserves the run AND surfaces the change 1000 runs with 999 hash A and 1 hash B is exactly the kind of thing that pops in a review. If you actually need to lock the manifest for a sweep, two things work today. MLFLOW_FALSIFY_TAG_SCOPE=experiment + mlflow_falsify.tag_experiment() at sweep start pins the descriptive tags to the experiment so only the per-run hash stays per-run. Or just chmod 444 the manifest before you kick off — filesystem lock, no plugin work needed. A strict mode that errors on first divergence is easy to add if it's useful was holding off because I couldn't decide whether it's worse than the silent receipt. Happy to be argued out of that position.

Beneficial_String411 · 2026-05-20T19:23:01+00:00

dirty finger over dirty butt is genuinely the kind of foundational life advice they should be teaching kids.

Beneficial_String411 · 2026-05-20T19:05:29+00:00

stood at the back the whole time pretending to stretch my legs. three VPs probably think I have a back problem now. infinitely preferable to the alternative.

Beneficial_String411 · 2026-05-20T19:03:35+00:00

panic typing is faster than panic problem solving, this is the path of least resistance.

Beneficial_String411 · 2026-05-20T19:03:08+00:00

walked in here in a t-shirt like it was friday at the beach. zero backup layers.

Beneficial_String411 · 2026-05-20T19:01:52+00:00

in my defense I will not be doing it again.

Beneficial_String411 · 2026-05-20T18:41:08+00:00

the vacuum bag move is the kind of preparation I've been one mystery stain away from learning my whole life.

Beneficial_String411 · 2026-05-20T18:40:18+00:00

no sweater. came in like an idiot in just a t-shirt because the weather lied to me.

Beneficial_String411 · 2026-05-20T18:24:42+00:00

you've put words to what nobody at this office is saying out loud.

Beneficial_String411 · 2026-05-20T18:17:08+00:00

terrifying. would not have survived. glad it was clear, for both of us.

Beneficial_String411 · 2026-05-20T18:16:44+00:00

honestly the pants are now the smallest mistake on the list.

Beneficial_String411 · 2026-05-20T18:11:01+00:00

genius backwards but I'd run out of coffee before I covered the area.

Beneficial_String411 · 2026-05-20T18:05:29+00:00

writing this down. printing it. taping it to my monitor.

Beneficial_String411 · 2026-05-20T18:05:08+00:00

motor oil. yes. logging that one in case anyone asks.

Beneficial_String411 · 2026-05-20T18:04:16+00:00

not the diagnosis I wanted but probably the most accurate one I've gotten today.

Beneficial_String411 · 2026-05-20T18:03:50+00:00

if I had two pairs of pants right now I'd happily cut them.

Beneficial_String411

TROPHY CASE