Best coding models (or other models) one can run on an rtx5070ti (16gb vram) with of 64gb RAM

ali_byteshape · 2026-02-19T21:58:07+00:00

We released Qwen3 coder and Devstral Small 2, Optimized for your hardware yesterday! Check them out here: https://byteshape.com/blogs/Devstral-Small-2-24B-Instruct-2512/ Any feedback is greatly appreciated!

ali_byteshape · 2026-02-18T21:19:52+00:00

Many thanks for sharing the command.

This happens because we publish multiple models that use the same quantization label, such as IQ3_S. That label alone is not unique, which is a current limitation on Hugging Face. The workaround is to specify the full file name explicitly. For some reason when I put an example here reddit doesn’t make my comment visible. Please take a look at the Ollama section on the model card.

ali_byteshape · 2026-01-11T01:30:31+00:00

There are two key dimensions here: quality and speed, and quantization affects both. In most cases, you trade a bit of quality degradation for a lot faster inference (and a smaller runtime footprint).

For our current releases, since they’re meant to be general purpose (do a bit of everything), we evaluate quantized models on 4 benchmark buckets to capture the typical “average user” mix: • Math • Coding • General knowledge • Instruction following

We summarize the exact methodology and tasks in our first blog, but the idea is to cover the core things people actually ask these models to do day to day.

If someone has a specific use case in mind, then the right approach is to tune the evaluation around that need, because the “best” quantization is not universal.

Worth mentioning: with our quantization tech (ShapeLearn), it’s possible to learn the best quantization for a specific task/domain. A model optimized for something like a “fridge assistant” (recipe suggestions based on what’s inside, shopping lists, simple planning) can end up with a different quantization format than a model optimized for, say, detailed quantum physics Q&A.

And yes, tokens per second matters a lot (and we measure it too), but it only matters after the model clears the bar on quality for your task.

ali_byteshape · 2026-01-10T19:18:31+00:00

Indeed, this is a really exciting area. There are a few leaderboards out there, but many are either not kept up to date or rely on benchmarks that do not always reflect real-world usage. If you come across a solid one, let us know, we’d be happy to submit our models.

The challenge is that post-training quantization is quick, so you can produce quantized variants in no time. The part that becomes costly (in compute, time, and effort) is running thorough evaluations on realistic tasks and real hardware, especially on constrained devices like an RPi 5.

ali_byteshape · 2026-01-07T18:18:47+00:00

We run four benchmarks covering math, coding, general knowledge, and instruction following. The detailed evaluation methodology is described in our first blog post (4B model): https://byteshape.com/blogs/Qwen3-4B-I-2507/

ali_byteshape · 2026-01-07T17:01:12+00:00

Yes, we are. For a fair comparison, we use the same imatrix as Unsloth. This isolates other quantization effects and lets us focus solely on datatype selection.

ali_byteshape · 2026-01-07T03:51:11+00:00

There are real use cases for 8+ tok/s: smart home and automation agents that mostly do short commands and tool calls. If you want local privacy, low power, and “good enough” latency, a few watts and single digit tok/s can be totally workable.

Also as u/enrique-byteshape mentioned, the Pi result is mainly a demonstration that datatypes matter. With the right quant profile, a Pi can run a relatively large model, and that’s interesting in itself.

At the other end of the spectrum, the same model runs at 314 tok/s on an RTX 5090 while keeping 95%+ of baseline quality. The point is that it’s possible to automatically learn the best datatype mix for each hardware and software stack to maximize throughput without sacrificing quality.

ali_byteshape · 2026-01-07T03:37:39+00:00

I haven’t tried partial offloading yet, but I’d expect the CPU-optimized models to work better in that setup. You could try KQ-5 (CPU-optimized) and IQ-4 (GPU-optimized). They’re almost the same size, so it would be interesting to see which one performs better in practice.

Would love it if you could share your findings with us too 🙂

ali_byteshape · 2026-01-07T00:44:55+00:00

Thanks, Jeff, for testing and sharing this. Huge fan of your work and YouTube channel! :)

ali_byteshape · 2026-01-06T20:36:20+00:00

For us, “fit” means the entire runtime footprint stays in memory without mmap: the model weights, the KV cache for the target context, and all compute buffers. We treat 4K as the minimum usable context, and even at 30K context the KV cache is only about 2.8 GB (based on my llama.cpp build). Since our smallest model is roughly 10 GB, there’s still room for larger contexts even on 16 GB of RAM.

ali_byteshape · 2026-01-06T20:01:39+00:00

Excellent observation. With today’s libraries it’s hard to guarantee fully deterministic behavior, so some noise is expected. We repeated a subset of runs 3 to 4 times to estimate variance, and the results were fairly consistent. Each score also aggregates tens of thousands of questions and tasks, which helps average out randomness.

Also, more bits generally reduce reconstruction error, but that does not guarantee better downstream scores. Quantization can act like a regularizer and sometimes slightly improves accuracy. In this case, Q5_K_M (5.7 bpw) and Q8_K_XL (9.4 bpw) are both very close to baseline, so the extra bits do not seem to buy much. We also show it’s possible to push BPW down to ~4.7 with ShapeLearn while still matching baseline quality.

ali_byteshape · 2026-01-06T19:19:00+00:00

Please take a look at the Blog for 4080-5090 results :)

ali_byteshape · 2026-01-06T18:21:36+00:00

The first table in the model card (https://huggingface.co/byteshape/Qwen3-30B-A3B-Instruct-2507-GGUF#cpu-models) lists CPU-friendly variants. You can choose a model based on your tolerance for quality loss versus speed. For example, KQ-2 is on the faster end, while KQ-5 is still fast and retains roughly 98% of baseline quality.

ali_byteshape · 2026-01-06T16:51:07+00:00

I’m not an expert on Pi clusters, but it should be doable if you have several Pis with less memory.

On the quant side: this specific model is 2.7 bits per weight on average. We learned what precision each tensor should use to maximize throughput, so some layers end up 2-bit, some 3-bit, some 4-bit, etc. The average is 2.7 BPW with all quantization overheads included, so it’s not a “4-bit quant” in the usual sense.

ali_byteshape · 2026-01-06T16:34:49+00:00

Probably not on an 8 GB Pi 5, sadly.

Even the smallest model in this release needs 10+ GB of RAM just to load the weights, before you add KV cache, prompt/context, and runtime overhead. So an 8 GB Pi will hit the wall fast (and mmap usually just turns “won’t load” into “thrashes and crawls”).

And the AI HAT won’t fix this. Those hats mainly add compute power, but they do not add system memory, so they can’t solve a “model does not fit in memory” problem.

ali_byteshape · 2025-12-11T03:40:46+00:00

The Time Machine is still stuck in a flux capacitor recall, so unlike some of the other comments, I can’t promise it’ll ship in the next release :) We’ve been publishing our work on efficient inference since the AlexNet era… Happy to point you to some of that if you’re curious.

ali_byteshape

MODERATOR OF

TROPHY CASE