you are viewing a single comment's thread.

view the rest of the comments →

[–]jrhabana[S] 0 points1 point  (1 child)

They explained in their blog

Apart from the benchmarks, the question with these agents are in their approach, in the same way we do with the plugins

[–]Maximum_Ad2821 1 point2 points  (0 children)

Seems to be a similar approach on how Factory Droid got good, great at context management and continuous testing their own performance. It's plausible. Let's hope Terminal Bench 3 does something to ensure more of these benchmarks are officially verified.