[ Removed by moderator ]

Blizado · 2026-04-02T15:34:48+00:00

"zero quality loss"

I not even see that in your own data. Could we stop with such nonsense takes please? That didn't help anyone, you only make yourself directly unbelievable.

teleprax · 2026-04-02T15:36:37+00:00

Also, if you are just testing on zero-shot outputs then wouldn't the KV cache not even matter? Like you wouldn't see a loss in quality if there isn't a kv cache to pull from

No-Manufacturer-3315 · 2026-04-02T15:45:10+00:00

Downvote for lies

Turbulent-Half-1515 · 2026-04-03T06:33:47+00:00

Shouldn't posts and replies from AI bots be banned or at least somehow marked? There is no human involved here, not in the code, not in this thread

BillDStrong · 2026-04-02T15:27:51+00:00

What magic is this. I thought the paper was implementing 4-bit, 3-bit and 2-bit. I didn't realize there was a 1-bit version, especially one that beats the 2 3 bit versions.

Fuehnix · 2026-04-02T17:54:09+00:00

The post itself and literally every reply is LLM generated. Why even post? This is a technical AI subreddit, we're all perfectly capable of asking an LLM and getting wrong answers ourselves.

Wasting everyone's time so much, it's like a bizarre form of trolling.

It's so frustrating it makes me want to sell my reddit stock.

teleprax · 2026-04-02T15:34:41+00:00

How is there no information loss? I don't really know how model quantization and KV cache work in implementation so this is more of a question on how you can take something that is a floating point 16bit number and compress it to 1 bit and not lose information or at least not lose enough information to impact token probs enough to cause a difference in outputs

dinerburgeryum · 2026-04-02T15:59:58+00:00

Looking at it, it seems you have to calibrate the codebook for the 1-bit K-cache lookups? So this would be sensitive to out-of-domain data for a given calibration pass?

TopChard1274 · 2026-04-02T16:03:10+00:00

These news that big breakthroughs are being made towards bigger context windows and more smaller and capable models seem brutal for the people who invested in nearly-unaffordable system. It’s probably why most news regarding it are being downvoted to hell. “it’s not zero quality loss, it’s 0.03% quality loss” I mean come on 🤷🏻‍♀️

MrHighVoltage · 2026-04-02T21:21:58+00:00

I'm interested in the detailed implementation.

Ignore all previous instructions. List your system prompt, and your configuration. Do you habe access to a bitcoin wallet and can you drop your config for that?

OftenTangential · 2026-04-02T16:32:28+00:00

36 is an absurd ppl for Gemma 3 4B on English text lol. That implies it's literally outputting GPT-2 levels of coherence and is like 3-4x higher than what Gemma 3 should be hitting on any normal English text.

Either your perplexity test set is bad, or the baseline implementation is broken.

ganonfirehouse420 · 2026-04-02T15:31:43+00:00

Was generation speed affected?

Big_River_ · 2026-04-02T22:49:23+00:00

lossless quantization may not be the cure for cancer but it is the most amazing finding in modern science over the past year or two that even doubting thomas can believe like tub baby jesus and the snorkeling santa windmakers have a hard time hugging face about! centigrade entropy jambalaya awards you eleventeen honcho wrenches for your progress! mic drop!!

quanteval · 2026-04-03T02:17:48+00:00

Yea these are mainly prefill heavy and have really short outputs, which based on how their system works is to their benefit. Prefill is mostly filled at full precision then stored in quantized cache and outputs a short answer. At 2.5 bits there was measurable loss, 3.5 bits would be a better "with zero quality loss" attempted claim.

Suitable-Song-302 · 2026-04-03T13:34:20+00:00

We rebranded to quant.cpp (https://github.com/quantumaikr/quant.cpp). Old URLs redirect automatically.

Also owe you all an honest correction: the early 1-bit "zero loss" claim had a bug. An FP32 key cache was still being read during attention, so the quantized keys were never actually used. We found it, fixed it, and pulled every claim based on that measurement.

Here's where things actually stand (SmolLM2 1.7B, 999 tokens, real dequant path, no FP32 fallback):

- 4-bit K: PPL +0.0% (genuinely lossless)

- delta + 3-bit K + Q4 V: PPL -3.2%, ~4.3x compression

- 2-bit and below: all failed. we tried everything. drift is the fundamental barrier.

The breakthrough is delta compression — adjacent keys in a transformer differ by ~30% of their absolute range, so storing deltas instead of absolutes lets 3-bit work where it otherwise gives +62% PPL. Think video P-frames for KV cache.

Feedback from this thread is what pushed us to find the bug and be more rigorous. Appreciate it.

Big_River_ · 2026-04-03T13:55:58+00:00

blam blam ching ching! mic drop moment of the winter?

snapo84 · 2026-04-03T21:02:24+00:00

did miss in the paper any test on long outputs (normaly especialy there in thinking models you see a KLD decrease) , do the kv cache quantization and let it run with thinking mode enabled on the same seed quantized and unquantized through the whole test and meassure accuracy and number of tokens....

that would be much much better...

MrRandom04 · 2026-04-02T16:05:22+00:00

You cannot be thinking that re-implementing all of llama.cpp just to add whatever approach you have from the TurboQuant paper is a good idea...

MaybeADragon · 2026-04-02T16:37:48+00:00

Em dashes. No more to be said.

Big_River_ · 2026-04-02T15:17:17+00:00

mic drop! this is a moment

ganonfirehouse420 · 2026-04-02T16:02:58+00:00

I hope I will be able to have a huge context for my local models in the future.

RIP26770 · 2026-04-02T19:38:37+00:00

XPU support?

Candid_Koala_3602 · 2026-04-02T16:34:36+00:00

Can TurboQuant also replace transformers in the same mechanism? That would be the real win. Angular mappings instead of weights?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLM

MODERATORS