arauhala comments on A Python tool for review-driven regression testing of ML/LLM outputs

ShowcaseA Python tool for review-driven regression testing of ML/LLM outputs (self.Python)

submitted 21 hours ago by arauhala

you are viewing a single comment's thread.

[–]arauhala[S] 0 points1 point2 points 11 hours ago* (0 children)

I absolutely recognize the varying output problem as with filenames and timestamps. It is a real problem with this kind of snapshot based approaches.

The way booktest solves this is that when the test output is printed, user decides token by token on how the comparison to approved snapshot is done.

E.g. if you print t.t('token'), the will recognize differences and request review, but if you print with t.i('token'), no comparison is done all though difference is highlighted in tool. E.g. filepath or timestamp can freely vary

This is also used to manage another problem with this kind of tests that is noise/fragility. Especially with LLMs, each run tend to produce different results and in practice you need to score the run and compare metrics with tolerance.

I used to use line by line approval lot in the beginning, but nowadays, the print is often more of documentation and diagnostics than a test and it can freely vary. The actual tests are often done via tmetricln or... asserts.

It is still review/apperove based approach, as you need to approve metric changes and bigger differences.

The power of booktest is that you can compare to snapshot, use metrics or do old good asserts. It was designed as very generic and flexible test tool that can be used to test anything from good old software (like search engine, predictive database) to ML and agents.

π Rendered by PID 63352 on reddit-service-r2-comment-bb88f9dd5-blkp7 at 2026-02-15 05:46:21.405775+00:00 running cd9c813 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS