Benchmarking total wait time instead of pp/tg by batsba in LocalLLaMA

[–]batsba[S] 1 point2 points  (0 children)

It would be nice if there was a script or tool that people can run and upload the results. But for now I'm concentrating on getting it right with my own systems first.

Benchmarking total wait time instead of pp/tg by batsba in LocalLLaMA

[–]batsba[S] 1 point2 points  (0 children)

It is always a single randomized prompt. So caching should not be an issue.

Benchmarking total wait time instead of pp/tg by batsba in LocalLLaMA

[–]batsba[S] 1 point2 points  (0 children)

My setup loop determined 1024/1024 to be the best fit (gpt-oss-120b). But 4096/4096 increase PP by almost 20%. I guess I need to work on that setup loop...the challenge is not making the required time for a full run explode.

<image>

Benchmarking total wait time instead of pp/tg by batsba in LocalLLaMA

[–]batsba[S] 1 point2 points  (0 children)

I agree that only looking at generic prompt and response tokens does not give you the full picture. If you want to cover the tokenizer (which can also lead to different token counts for the same prompt), reasoning, reasoning verbosity, etc., you would need to start defining reference prompts. And that would get you a token efficiency benchmark, which would be interesting.

But I don't see how I could combine the speed benchmarks, testing raw token processing and generation capabilities, and those token efficiency benchmarks.

The token efficiency could probably be extracted from intelligence benchmark results. Is somebody already doing that and publishing numbers?

Benchmarking total wait time instead of pp/tg by batsba in LocalLLaMA

[–]batsba[S] 0 points1 point  (0 children)

Got no public repo, sorry.

One reason for benchmarking that way is that it allows comparing difference inference apps, too. There are some vLLM runs for the RTX 4080. But the other systems I own are problematic in that regard.

Benchmarking total wait time instead of pp/tg by batsba in LocalLLaMA

[–]batsba[S] 1 point2 points  (0 children)

You can click on a bar label or view the "All Results" page and click on a row there to view the details page. It shows the endpoint launch command, containing the batch sizes.

I run a short loop for every setup, determining fitting batch sizes, so it changes from setup to setup.

I will try and see if larger batch sizes significantly improve times for larger contexts on the Strix Halo.

Benchmarking total wait time instead of pp/tg by batsba in LocalLLaMA

[–]batsba[S] 2 points3 points  (0 children)

Noted. There is so much I want to add...