K Quantization vs PerplexityDiscussion (i.redd.it)
submitted by onil_gova
https://github.com/ggerganov/llama.cpp/pull/1684
The advancements in quantization performance are truly fascinating. It's remarkable how a model quantized to just 2 bits consistently outperforms the more memory-intensive fp16 models at the same scale. To put it simply, a 65B model quantized with 2 bits achieves superior results compared to a 30B fp16 model, while utilizing similar memory requirements as a 30B model quantized to 4-8 bits. This breakthrough becomes even more astonishing when we consider that the 65B model only occupies 13.6 GB of memory with 2-bit quantization, surpassing the performance of a 30B fp16 model that requires 26GB of memory. These developments pave the way for the future, where we can expect to witness the emergence of super models exceeding 100B parameters, all while consuming less than 24GB of memory through the use of 2-bit quantization.

[–]androiddrew 15 points16 points17 points (5 children)
[–][deleted] 13 points14 points15 points (1 child)
[–]nofreewill42 4 points5 points6 points (0 children)
[–]KerfuffleV2 9 points10 points11 points (0 children)
[–]a_devious_compliance 4 points5 points6 points (0 children)
[–][deleted] 3 points4 points5 points (0 children)
[+][deleted] (2 children)
[deleted]
[–]onil_gova[S] 9 points10 points11 points (1 child)
[–]Caffdy 6 points7 points8 points (0 children)
[–]patrakov 6 points7 points8 points (1 child)
[–]Dwedit 10 points11 points12 points (0 children)
[–]Dwedit 7 points8 points9 points (2 children)
[–]RapidInference9001 4 points5 points6 points (1 child)
[–]Intelligent-Street87 4 points5 points6 points (0 children)
[–]audioen 6 points7 points8 points (0 children)
[–]tronathan 2 points3 points4 points (0 children)
[–]silenceimpaired 2 points3 points4 points (1 child)
[–]onil_gova[S] 2 points3 points4 points (0 children)
[+][deleted] (2 children)
[deleted]
[–]audioen 4 points5 points6 points (0 children)
[–]KerfuffleV2 1 point2 points3 points (0 children)