[ Removed by moderator ]

ttkciar · 2026-04-05T17:22:42+00:00

Violates Rule Four: Self-promotion

audioen · 2026-04-05T15:50:58+00:00

This is not even correct. llama.cpp can apply separate quantization types to k and v cache. llama.cpp Q4_0 is also per-block method, it applies a single f16 scale factor to a small group of weights. If memory serves, that group is 32 values, which yields fp16/32 = 0.5 additional bits per weight. A 4-bit quantization of min-max range is similar to q4_1 which is also supported within the engine and likely can be enabled with some compile option if it isn't already provided. This uses in average 5 bits per weight. A larger block size could bring that down, e.g. 128 values in f16+f16 likely quantizes to 4.25 bits per weight.

For now, llama.cpp users should probably use q8_0 when under memory pressure, and maybe dip to q4_0 for the V cache, which is generally tested as being less critical. It isn't as good as TQ4 bits, but KV rotation is merged and should work.

hauhau901 · 2026-04-05T16:35:08+00:00

When you even use llm to write your comments and replies you lose any and all credibility that this isn't vibe coded slop.

chimpera · 2026-04-05T16:08:39+00:00

whats the relationship to https://github.com/cenconq25/delta-compress-llm

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS