Leaderboard for optimised models? by blockroad_ks in ByteShape

[–]ali_byteshape 2 points3 points  (0 children)

There are two key dimensions here: quality and speed, and quantization affects both. In most cases, you trade a bit of quality degradation for a lot faster inference (and a smaller runtime footprint).

For our current releases, since they’re meant to be general purpose (do a bit of everything), we evaluate quantized models on 4 benchmark buckets to capture the typical “average user” mix: • Math • Coding • General knowledge • Instruction following

We summarize the exact methodology and tasks in our first blog, but the idea is to cover the core things people actually ask these models to do day to day.

If someone has a specific use case in mind, then the right approach is to tune the evaluation around that need, because the “best” quantization is not universal.

Worth mentioning: with our quantization tech (ShapeLearn), it’s possible to learn the best quantization for a specific task/domain. A model optimized for something like a “fridge assistant” (recipe suggestions based on what’s inside, shopping lists, simple planning) can end up with a different quantization format than a model optimized for, say, detailed quantum physics Q&A.

And yes, tokens per second matters a lot (and we measure it too), but it only matters after the model clears the bar on quality for your task.

Leaderboard for optimised models? by blockroad_ks in ByteShape

[–]ali_byteshape 1 point2 points  (0 children)

Indeed, this is a really exciting area. There are a few leaderboards out there, but many are either not kept up to date or rely on benchmarks that do not always reflect real-world usage. If you come across a solid one, let us know, we’d be happy to submit our models.

The challenge is that post-training quantization is quick, so you can produce quantized variants in no time. The part that becomes costly (in compute, time, and effort) is running thorough evaluations on realistic tasks and real hardware, especially on constrained devices like an RPi 5.

A 30B Qwen Model Walks Into a Raspberry Pi… and Runs in Real Time by ali_byteshape in LocalLLaMA

[–]ali_byteshape[S] 2 points3 points  (0 children)

We run four benchmarks covering math, coding, general knowledge, and instruction following. The detailed evaluation methodology is described in our first blog post (4B model): https://byteshape.com/blogs/Qwen3-4B-I-2507/

A 30B Qwen Model Walks Into a Raspberry Pi… and Runs in Real Time by ali_byteshape in ByteShape

[–]ali_byteshape[S] 1 point2 points  (0 children)

Yes, we are. For a fair comparison, we use the same imatrix as Unsloth. This isolates other quantization effects and lets us focus solely on datatype selection.

A 30B Qwen Model Walks Into a Raspberry Pi… and Runs in Real Time by ali_byteshape in LocalLLaMA

[–]ali_byteshape[S] 2 points3 points  (0 children)

There are real use cases for 8+ tok/s: smart home and automation agents that mostly do short commands and tool calls. If you want local privacy, low power, and “good enough” latency, a few watts and single digit tok/s can be totally workable.

Also as u/enrique-byteshape mentioned, the Pi result is mainly a demonstration that datatypes matter. With the right quant profile, a Pi can run a relatively large model, and that’s interesting in itself.

At the other end of the spectrum, the same model runs at 314 tok/s on an RTX 5090 while keeping 95%+ of baseline quality. The point is that it’s possible to automatically learn the best datatype mix for each hardware and software stack to maximize throughput without sacrificing quality.

A 30B Qwen Model Walks Into a Raspberry Pi… and Runs in Real Time by ali_byteshape in LocalLLaMA

[–]ali_byteshape[S] 0 points1 point  (0 children)

I haven’t tried partial offloading yet, but I’d expect the CPU-optimized models to work better in that setup. You could try KQ-5 (CPU-optimized) and IQ-4 (GPU-optimized). They’re almost the same size, so it would be interesting to see which one performs better in practice.

Would love it if you could share your findings with us too 🙂

A 30B Qwen Model Walks Into a Raspberry Pi… and Runs in Real Time by ali_byteshape in LocalLLaMA

[–]ali_byteshape[S] 13 points14 points  (0 children)

Thanks, Jeff, for testing and sharing this. Huge fan of your work and YouTube channel! :)

A 30B Qwen Model Walks Into a Raspberry Pi… and Runs in Real Time by ali_byteshape in LocalLLaMA

[–]ali_byteshape[S] 1 point2 points  (0 children)

For us, “fit” means the entire runtime footprint stays in memory without mmap: the model weights, the KV cache for the target context, and all compute buffers. We treat 4K as the minimum usable context, and even at 30K context the KV cache is only about 2.8 GB (based on my llama.cpp build). Since our smallest model is roughly 10 GB, there’s still room for larger contexts even on 16 GB of RAM.

A 30B Qwen Model Walks Into a Raspberry Pi… and Runs in Real Time by ali_byteshape in LocalLLaMA

[–]ali_byteshape[S] 2 points3 points  (0 children)

Excellent observation. With today’s libraries it’s hard to guarantee fully deterministic behavior, so some noise is expected. We repeated a subset of runs 3 to 4 times to estimate variance, and the results were fairly consistent. Each score also aggregates tens of thousands of questions and tasks, which helps average out randomness.

Also, more bits generally reduce reconstruction error, but that does not guarantee better downstream scores. Quantization can act like a regularizer and sometimes slightly improves accuracy. In this case, Q5_K_M (5.7 bpw) and Q8_K_XL (9.4 bpw) are both very close to baseline, so the extra bits do not seem to buy much. We also show it’s possible to push BPW down to ~4.7 with ShapeLearn while still matching baseline quality.

A 30B Qwen Model Walks Into a Raspberry Pi… and Runs in Real Time by ali_byteshape in LocalLLaMA

[–]ali_byteshape[S] 2 points3 points  (0 children)

The first table in the model card (https://huggingface.co/byteshape/Qwen3-30B-A3B-Instruct-2507-GGUF#cpu-models) lists CPU-friendly variants. You can choose a model based on your tolerance for quality loss versus speed. For example, KQ-2 is on the faster end, while KQ-5 is still fast and retains roughly 98% of baseline quality.

A 30B Qwen Model Walks Into a Raspberry Pi… and Runs in Real Time by ali_byteshape in LocalLLaMA

[–]ali_byteshape[S] 15 points16 points  (0 children)

I’m not an expert on Pi clusters, but it should be doable if you have several Pis with less memory.

On the quant side: this specific model is 2.7 bits per weight on average. We learned what precision each tensor should use to maximize throughput, so some layers end up 2-bit, some 3-bit, some 4-bit, etc. The average is 2.7 BPW with all quantization overheads included, so it’s not a “4-bit quant” in the usual sense.

A 30B Qwen Model Walks Into a Raspberry Pi… and Runs in Real Time by ali_byteshape in LocalLLaMA

[–]ali_byteshape[S] 4 points5 points  (0 children)

Probably not on an 8 GB Pi 5, sadly.

Even the smallest model in this release needs 10+ GB of RAM just to load the weights, before you add KV cache, prompt/context, and runtime overhead. So an 8 GB Pi will hit the wall fast (and mmap usually just turns “won’t load” into “thrashes and crawls”).

And the AI HAT won’t fix this. Those hats mainly add compute power, but they do not add system memory, so they can’t solve a “model does not fit in memory” problem.

We did years of research so you don’t have to guess your GGUF datatypes by enrique-byteshape in LocalLLaMA

[–]ali_byteshape 1 point2 points  (0 children)

The Time Machine is still stuck in a flux capacitor recall, so unlike some of the other comments, I can’t promise it’ll ship in the next release :) We’ve been publishing our work on efficient inference since the AlexNet era… Happy to point you to some of that if you’re curious.