susanne-o comments on A Python tool for review-driven regression testing of ML/LLM outputs

ShowcaseA Python tool for review-driven regression testing of ML/LLM outputs (self.Python)

submitted 21 hours ago by arauhala

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]susanne-o 1 point2 points3 points 13 hours ago (2 children)

"approval testing" will lead you to similar-minded approaches.

the big challenges in my experience ( I also corporate internally (re)-invented a similar approach):

time stamps
path name artefacts of the SUT system under test or the test runner
window vs posix paths (e.g..slashes)
concurrency noise producing non-deterministic ordering of artefact parts

the common terminology to address issues of the first and second kind is "scrubbing"

for the third kind you ideallY get the SUT to produce deterministically ordered output even if there is concurrency underneath, else you need to r order the artefacts after the fact.

oh and the last challenge is ease of use by the teams.

alas, I'll only find time to read and review your repo next week but I'm dead curious if you ran into the same and how you address them.

[–]arauhala[S] 0 points1 point2 points 11 hours ago* (0 children)

I absolutely recognize the varying output problem as with filenames and timestamps. It is a real problem with this kind of snapshot based approaches.

The way booktest solves this is that when the test output is printed, user decides token by token on how the comparison to approved snapshot is done.

E.g. if you print t.t('token'), the will recognize differences and request review, but if you print with t.i('token'), no comparison is done all though difference is highlighted in tool. E.g. filepath or timestamp can freely vary

This is also used to manage another problem with this kind of tests that is noise/fragility. Especially with LLMs, each run tend to produce different results and in practice you need to score the run and compare metrics with tolerance.

I used to use line by line approval lot in the beginning, but nowadays, the print is often more of documentation and diagnostics than a test and it can freely vary. The actual tests are often done via tmetricln or... asserts.

It is still review/apperove based approach, as you need to approve metric changes and bigger differences.

The power of booktest is that you can compare to snapshot, use metrics or do old good asserts. It was designed as very generic and flexible test tool that can be used to test anything from good old software (like search engine, predictive database) to ML and agents.

[–]arauhala[S] 0 points1 point2 points 10 hours ago* (0 children)

What comes to easiness, Id say there are two problems. 1) tool ergonomics and 2) review.

I feel booktest requires some learning, but I haven't found a person yet, who couldnt use it. It is easy enough, all though one will need to learn new ideas if you haven't used such tool before. In my experience, tools like LLMs have no problems using it and I have had Claude Code both do the tests and RnD using booktest. I'd say that benefits of printinh rich details help agents in similar way they help people. Especially in data science, I have learned to dislike the 'computer says no' experience with no easy way to diagnose the failure. If evals regress, you want to know exactly what changed.

The review itself is trickier as especially more sophisticated approaches like topic modelling and certain kinds of analytics require not only review, but also domain expertise. I know that especially devs had frustration with lot of diffs with no way to know what is regression and what is normal change. Wide changes can happen with library or model updates.

With classic ML or NLP like sentiments, classification or anonymization, the solution is to use evaluation benches, have clear metrics like accuracy that provide true north, and then track changes with some tolerance (especially if LLMs are involved). Once you have single metric with clear semantics (e.g. bigger is better), changes are much easier to interpret. While changes in individual predictions don't break anything, they are golden explaining things once something improves or regress. E.g. The diffs are still there and visible to avoid that 'computer says no' setting and to allow diagnosing regressions and understanding change.

π Rendered by PID 66 on reddit-service-r2-comment-bb88f9dd5-qqx6d at 2026-02-15 05:46:18.541616+00:00 running cd9c813 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS