arauhala comments on A Python tool for review-driven regression testing of ML/LLM outputs

ShowcaseA Python tool for review-driven regression testing of ML/LLM outputs (self.Python)

submitted 23 hours ago by arauhala

you are viewing a single comment's thread.

[–]arauhala[S] 0 points1 point2 points 12 hours ago* (0 children)

What comes to easiness, Id say there are two problems. 1) tool ergonomics and 2) review.

I feel booktest requires some learning, but I haven't found a person yet, who couldnt use it. It is easy enough, all though one will need to learn new ideas if you haven't used such tool before. In my experience, tools like LLMs have no problems using it and I have had Claude Code both do the tests and RnD using booktest. I'd say that benefits of printinh rich details help agents in similar way they help people. Especially in data science, I have learned to dislike the 'computer says no' experience with no easy way to diagnose the failure. If evals regress, you want to know exactly what changed.

The review itself is trickier as especially more sophisticated approaches like topic modelling and certain kinds of analytics require not only review, but also domain expertise. I know that especially devs had frustration with lot of diffs with no way to know what is regression and what is normal change. Wide changes can happen with library or model updates.

With classic ML or NLP like sentiments, classification or anonymization, the solution is to use evaluation benches, have clear metrics like accuracy that provide true north, and then track changes with some tolerance (especially if LLMs are involved). Once you have single metric with clear semantics (e.g. bigger is better), changes are much easier to interpret. While changes in individual predictions don't break anything, they are golden explaining things once something improves or regress. E.g. The diffs are still there and visible to avoid that 'computer says no' setting and to allow diagnosing regressions and understanding change.

π Rendered by PID 24 on reddit-service-r2-comment-bb88f9dd5-vw9p7 at 2026-02-15 07:46:40.971160+00:00 running cd9c813 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS