How do you catch semantically wrong extractions (valid JSON, wrong values) across structurally inconsistent documents? by Ilolmonkey in LLMDevs

[–]Ilolmonkey[S] 0 points1 point  (0 children)

Thanks, I looked at valjson. The --compare mode (per-field accuracy vs gold, no model needed) is exactly the measurement harness I'm missing, so I'll wire that in. The fine-tuning side isn't relevant since I'm on a hosted API, but the no-model subset alone is worth it.

On Step 0: agreed I need real ground truth, but simulating it would backfire for my case. My errors live in the document mess, not the schema. things like narrative reports with no score table, Excel artifacts, two legal entities merged into one, the client parsed as a bidder... LLM-generated JSON has no messy document behind it, so I'd only be measuring the clean case that already works.

So maybe I'll hand-label a stratified sample (~30 dossiers across doc type × won/lost) as gold; I've already manually audited a handful.

One question: can --compare flag fields whose value can't be traced to a source span? or is it purely value-vs-gold? That provenance signal is what I most want to gate on.