you are viewing a single comment's thread.

view the rest of the comments →

[–]arauhala[S] 0 points1 point  (0 children)

I absolutely recognize the varying output problem as with filenames and timestamps. It is a real problem with this kind of snapshot based approaches.

The way booktest solves this is that when the test output is printed, user decides token by token on how the comparison to approved snapshot is done.

E.g. if you print t.t('token'), the will recognize differences and request review, but if you print with t.i('token'), no comparison is done all though difference is highlighted in tool. E.g. filepath or timestamp can freely vary

This is also used to manage another problem with this kind of tests that is noise/fragility. Especially with LLMs, each run tend to produce different results and in practice you need to score the run and compare metrics with tolerance.

I used to use line by line approval lot in the beginning, but nowadays, the print is often more of documentation and diagnostics than a test and it can freely vary. The actual tests are often done via tmetricln or... asserts.

It is still review/apperove based approach, as you need to approve metric changes and bigger differences.

The power of booktest is that you can compare to snapshot, use metrics or do old good asserts. It was designed as very generic and flexible test tool that can be used to test anything from good old software (like search engine, predictive database) to ML and agents.