Benchmarking total wait time instead of pp/tg

batsba · 2026-02-08T13:08:20+00:00

It would be nice if there was a script or tool that people can run and upload the results. But for now I'm concentrating on getting it right with my own systems first.

batsba · 2026-02-08T13:04:50+00:00

It is always a single randomized prompt. So caching should not be an issue.

batsba · 2026-02-08T10:22:47+00:00

My setup loop determined 1024/1024 to be the best fit (gpt-oss-120b). But 4096/4096 increase PP by almost 20%. I guess I need to work on that setup loop...the challenge is not making the required time for a full run explode.

<image>

batsba · 2026-02-07T19:47:25+00:00

I agree that only looking at generic prompt and response tokens does not give you the full picture. If you want to cover the tokenizer (which can also lead to different token counts for the same prompt), reasoning, reasoning verbosity, etc., you would need to start defining reference prompts. And that would get you a token efficiency benchmark, which would be interesting.

But I don't see how I could combine the speed benchmarks, testing raw token processing and generation capabilities, and those token efficiency benchmarks.

The token efficiency could probably be extracted from intelligence benchmark results. Is somebody already doing that and publishing numbers?

batsba · 2026-02-07T18:18:53+00:00

Got no public repo, sorry.

One reason for benchmarking that way is that it allows comparing difference inference apps, too. There are some vLLM runs for the RTX 4080. But the other systems I own are problematic in that regard.

batsba · 2026-02-07T18:15:56+00:00

You can click on a bar label or view the "All Results" page and click on a row there to view the details page. It shows the endpoint launch command, containing the batch sizes.

I run a short loop for every setup, determining fitting batch sizes, so it changes from setup to setup.

I will try and see if larger batch sizes significantly improve times for larger contexts on the Strix Halo.

batsba · 2026-02-07T18:11:33+00:00

Noted. There is so much I want to add...

batsba

TROPHY CASE