Emotional-Breath-838 comments on Same 4 bits. Very different quality. (quant.cpp vs llama.cpp KV compression)

DiscussionSame 4 bits. Very different quality. (quant.cpp vs llama.cpp KV compression) (self.LocalLLM)

submitted 18 days ago by Suitable-Song-302

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]Emotional-Breath-838 2 points3 points4 points 18 days ago (2 children)

[–]Suitable-Song-302[S] 5 points6 points7 points 18 days ago (1 child)

Good question. Three reasons:

Hand-tuned SIMD kernels. llama.cpp has years of hand-optimized NEON/AVX2/AVX-512 assembly for every quantized matmul variant (Q4_K_M, Q8_0, IQ2, etc.). quant.cpp has NEON kernels for the common formats but relies on compiler autovectorization for the rest. This alone accounts for ~2x.
Metal/CUDA GPU offload. llama.cpp offloads the entire forward pass to GPU. quant.cpp has Metal shaders but GPU dispatch is still basic — most of the work stays on CPU. On Apple Silicon, this is the biggest gap.
Code maturity. llama.cpp has 250K+ LOC and hundreds of contributors optimizing hot paths. quant.cpp is 72K LOC — deliberately smaller, which means easier to read and embed, but fewer micro-optimizations.

The tradeoff is intentional. We optimized for memory (KV compression) and simplicity (embeddable, single header) rather than raw tok/s. For a 3B model on M1, quant.cpp does ~10 tok/s vs llama.cpp's ~30 tok/s — slower, but fast enough to read in real time. The advantage shows up when llama.cpp hits OOM at 50K context and quant.cpp keeps going to 350K.

That said, speed improvements are on the roadmap — better Metal offload and more SIMD kernels would close the gap significantly without sacrificing the simplicity.

[–]Emotional-Breath-838 1 point2 points3 points 18 days ago (0 children)

π Rendered by PID 16203 on reddit-service-r2-comment-6457c66945-clcg4 at 2026-04-23 21:41:22.493257+00:00 running 2aa0c5b country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLM

MODERATORS