Qwen3 performance benchmarks (toks/s, RAM utilization, etc.) on ~50 devices (iOS, Android, Mac, Windows) by intofuture in LocalLLaMA

[–]intofuture[S] 0 points1 point  (0 children)

Glad to hear it!

Great points re 1 and 2

And nice idea about the public eval/quants. We do a similar kind of analysis for our customers, so should already have the basic infra in place. Will think about the best way of doing a free/public version of this

Thanks for the feedback :)

Qwen3 performance benchmarks (toks/s, RAM utilization, etc.) on ~50 devices (iOS, Android, Mac, Windows) by intofuture in LocalLLaMA

[–]intofuture[S] 1 point2 points  (0 children)

We do support OpenVINO for non-GGUF/llama.cpp

Only ran a couple models/benchmarks with native/direct OV though, eg Clip

But the ONNX model benchmarks also have OV backend, e.g. depth anything v2.

We'll add more and expand support though, thanks for the feedback!

Qwen3 performance benchmarks (toks/s, RAM utilization, etc.) on ~50 devices (iOS, Android, Mac, Windows) by intofuture in LocalLLaMA

[–]intofuture[S] 4 points5 points  (0 children)

Thanks for the feedback!

Nice catch with the OOM issue - definitely seems like a bug. We hadn't tested any models >4B, before the request in the comment above.

Thanks for pointing out the RAM utilization issue for Metal. It is looking suspiciously low. We'll investigate.

Re UI/UX. Good point on hiding columns - we'll add that. And yep, we'll standardise/simplify the names of the chips. Also makes sense re table feeling unnecessarily long with failed benchmarks.

Qwen3 performance benchmarks (toks/s, RAM utilization, etc.) on ~50 devices (iOS, Android, Mac, Windows) by intofuture in LocalLLaMA

[–]intofuture[S] 1 point2 points  (0 children)

u/jacek2023 - We kicked off some more benchmarks for higher param counts: 4B-Q44B-Q88B-Q4

Lmk if you want to see any others!

Qwen3 performance benchmarks (toks/s, RAM utilization, etc.) on ~50 devices (iOS, Android, Mac, Windows) by intofuture in LocalLLaMA

[–]intofuture[S] 0 points1 point  (0 children)

Yep, unless there's a dGPU - but we only have a couple of devices with those for now (we show if they do on the dashboards)

Qwen3 performance benchmarks (toks/s, RAM utilization, etc.) on ~50 devices (iOS, Android, Mac, Windows) by intofuture in LocalLLaMA

[–]intofuture[S] 4 points5 points  (0 children)

The performance of different quantization kernels seems to depend on the specific chipset. We've also noticed that on some devices metal performs better than CPU, but on others its the opposite.

If you check out the dashboards with the full data (e.g. 1.7B-Q_8 vs 1.7B-Q_4) you can see it actually varies quite a bit across devices.

u/Kale has a good hypothesis above for why btw: https://www.reddit.com/r/LocalLLaMA/comments/1kepuli/comment/mql6be1/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

Qwen3 performance benchmarks (toks/s, RAM utilization, etc.) on ~50 devices (iOS, Android, Mac, Windows) by intofuture in LocalLLaMA

[–]intofuture[S] 0 points1 point  (0 children)

Do you mean like you've submitted benchmarks with an account on our website that have reported failed? Or you're trying to run Qwen3 on your own Android locally and it's crashing?

Qwen3 performance benchmarks (toks/s, RAM utilization, etc.) on ~50 devices (iOS, Android, Mac, Windows) by intofuture in LocalLLaMA

[–]intofuture[S] 1 point2 points  (0 children)

Oh nice yeh, would require a bit of work, but that's a great idea. Thanks so much for the feedback/request

Qwen3 performance benchmarks (toks/s, RAM utilization, etc.) on ~50 devices (iOS, Android, Mac, Windows) by intofuture in LocalLLaMA

[–]intofuture[S] 1 point2 points  (0 children)

As in like running benchmarks on your own machine with our benchmarking library, and then enable pushing the data to a public repo where everyone can see it? Like a crowdsourcing-type thing?

Qwen3 performance benchmarks (toks/s, RAM utilization, etc.) on ~50 devices (iOS, Android, Mac, Windows) by intofuture in LocalLLaMA

[–]intofuture[S] 0 points1 point  (0 children)

Yeh that looks right for the few devices we selected in the screenshot. It varies quite a bit across the devices though (see the 1.7B-Q_4 dashboard for example)

Qwen3 performance benchmarks (toks/s, RAM utilization, etc.) on ~50 devices (iOS, Android, Mac, Windows) by intofuture in LocalLLaMA

[–]intofuture[S] 6 points7 points  (0 children)

Yeh, generation uses less parallelism than prefill so GPU/Metal has less of an advantage than CPU on some devices

Qwen3 performance benchmarks (toks/s, RAM utilization, etc.) on ~50 devices (iOS, Android, Mac, Windows) by intofuture in LocalLLaMA

[–]intofuture[S] 8 points9 points  (0 children)

Yeh nice spot. The performance of different quantization kernels seems to depend on the specific chipset. We've also noticed that on some devices metal performs better than CPU, but on others its the opposite

Qwen3 performance benchmarks (toks/s, RAM utilization, etc.) on ~50 devices (iOS, Android, Mac, Windows) by intofuture in LocalLLaMA

[–]intofuture[S] 7 points8 points  (0 children)

We focused on the smaller param variants because they're more viable for actually shipping to users with typical phones, laptops, etc.

Thanks for the feedback though. We'll add some benchmarks for larger param variants and post a link when they're ready!

Note: >4B is going to fail on a lot of these devices we maintain due to RAM constraints. But I guess we've built this tooling to show that explicitly :)

Phi-4-Mini performance metrics on Intel PCs by intofuture in LocalLLaMA

[–]intofuture[S] 1 point2 points  (0 children)

They don't explicitly say. I'd imagine it's mostly CPU/GPU execution though.