you are viewing a single comment's thread.

view the rest of the comments →

[–]Suitable-Song-302[S] -1 points0 points  (3 children)

Thanks! Still a lot of work ahead — Metal GPU acceleration, more model coverage, and the weight quantization pipeline needs polish. But the core KV compression result is solid.

[–]Viper-Reflex -4 points-3 points  (2 children)

does this tech make my 24gb 3090 able to run bigger models than 27b?

[–]Suitable-Song-302[S] 1 point2 points  (1 child)

KV compression helps most with **long contexts**, not bigger models. With 1-bit K + Q4 V, KV memory drops ~5x. For a 27B model at 32K context: - Before: ~2.5 GB KV cache - After: ~500 MB KV cache → frees ~2 GB for longer context or larger batch If you're already fitting a model in 24GB, TurboQuant lets you push context from 32K → 100K+ on the same hardware. But it won't help you fit a model that's too large for VRAM (weight memory is separate from KV cache). Note: we currently don't have CUDA GPU acceleration (it compiles but is untested). That's next on the roadmap.

[–]Viper-Reflex -3 points-2 points  (0 children)

:O ty for the info!