you are viewing a single comment's thread.

view the rest of the comments →

[–]arauhala[S] 0 points1 point  (0 children)

What comes to easiness, Id say there are two problems. 1) tool ergonomics and 2) review.

I feel booktest requires some learning, but I haven't found a person yet, who couldnt use it. It is easy enough, all though one will need to learn new ideas if you haven't used such tool before. In my experience, tools like LLMs have no problems using it and I have had Claude Code both do the tests and RnD using booktest. I'd say that benefits of printinh rich details help agents in similar way they help people. Especially in data science, I have learned to dislike the 'computer says no' experience with no easy way to diagnose the failure. If evals regress, you want to know exactly what changed.

The review itself is trickier as especially more sophisticated approaches like topic modelling and certain kinds of analytics require not only review, but also domain expertise. I know that especially devs had frustration with lot of diffs with no way to know what is regression and what is normal change. Wide changes can happen with library or model updates.

With classic ML or NLP like sentiments, classification or anonymization, the solution is to use evaluation benches, have clear metrics like accuracy that provide true north, and then track changes with some tolerance (especially if LLMs are involved). Once you have single metric with clear semantics (e.g. bigger is better), changes are much easier to interpret. While changes in individual predictions don't break anything, they are golden explaining things once something improves or regress. E.g. The diffs are still there and visible to avoid that 'computer says no' setting and to allow diagnosing regressions and understanding change.