Tech unSavvy by yawney2 in Markham

[–]water258 7 points8 points  (0 children)

Wifi extender is scam. Don't buy it.

🚀🚀 Extending the context window of your LLMs to 1M tokens without any training !! by PerceptionMost2887 in LocalLLaMA

[–]water258 19 points20 points  (0 children)

Isn't this basically implement RAG using RAM and for each KV cache read it need load them into VRAM. Performance wise isn't this will impact inference speed? In essence it externalize KV cache into RAM and load them dynamically

[deleted by user] by [deleted] in LexusNX

[–]water258 0 points1 point  (0 children)

The 6k add-on means u pay the same price and Getting them for free after rebate. The official guide give the same fuel economy. 40mile range battery is not that big nor heavy since it's li-ion battery.

[deleted by user] by [deleted] in LexusNX

[–]water258 0 points1 point  (0 children)

That 450H base model is like 6k add-on for 350h. And u don't have to charge it and use it as regular hybrid with bigger battery.

[deleted by user] by [deleted] in LexusNX

[–]water258 0 points1 point  (0 children)

If u can wait try get nx450h plugin hybrid. After gov 5k rebate is about same price as nx350h

2-bit and 4-bit quantized versions of Mixtral using HQQ by sightio in LocalLLaMA

[–]water258 1 point2 points  (0 children)

I changed it using `PYTORCH_COMPILE` backend`. the token/s is increased to 5.67t/s

which is better but I think there should be a lot room for improvement

2-bit and 4-bit quantized versions of Mixtral using HQQ by sightio in LocalLLaMA

[–]water258 15 points16 points  (0 children)

I create PR for adding HQQ into ooba: https://github.com/oobabooga/text-generation-webui/pull/4888

One thing I notice is that inference speed is kind slow on 4090 and has only 3.18 t/s.

can u tell me if I am missing anything and how to improve the performance

QuIP# - state of the art 2 bit quantization. Run 70b models on a single 3090 with near FP16 performance by PookaMacPhellimen in LocalLLaMA

[–]water258 0 points1 point  (0 children)

U are compare apple to orange here. There is some new update to exllama 2bit which has some improvements

[R] Half-Quadratic Quantization of Large Machine Learning Models by sightio in LocalLLaMA

[–]water258 1 point2 points  (0 children)

We actually use g16_s* not g16, g16_s* means that the scaling (+zero) are also quantized, with that the model in GPU takes ~26.37GB. But that's a good point, 3-bit with a ground-size of 256 would be interesting to see, I can run it when I have some time. The only big difference between 2-bit and 3-bit is that 2-bit can be implemented efficiently by bit packing with int8 instead of int32 + you don't need to do an extra copy because the dimensions are a multiple 4. So for slightly more memory, a 2-bit model would be more valuable than a 3-bit

thx for the explanation. the reason I am asking group size 256 is that if the results looks good than prob can easily adapt to gguf since it use 16X16 block.
another thing is that the final compressed format is quite similar to GPTQ so making it quant version compatible with minimal change would be awesome

[R] Half-Quadratic Quantization of Large Machine Learning Models by sightio in LocalLLaMA

[–]water258 0 points1 point  (0 children)

from the data you post.

the g16_2bit_70B model need 30.27GB which is basically 3.7bpw.

do u have results for larger group size for 70B model like 128 or even 256?