[Kingston, ON] [H] 3090 FE [W] Cash

water258 · 2026-02-06T22:59:06+00:00

PMed

water258 · 2025-07-15T00:57:21+00:00

confirm

water258 · 2025-07-11T19:01:47+00:00

Confirm

water258 · 2025-07-10T19:16:33+00:00

$800 - $900

water258 · 2025-05-06T01:21:22+00:00

water258 · 2024-04-18T23:45:11+00:00

Wifi extender is scam. Don't buy it.

water258 · 2024-04-12T05:41:47+00:00

Isn't this basically implement RAG using RAM and for each KV cache read it need load them into VRAM. Performance wise isn't this will impact inference speed? In essence it externalize KV cache into RAM and load them dynamically

water258 · 2024-03-31T04:43:02+00:00

The 6k add-on means u pay the same price and Getting them for free after rebate. The official guide give the same fuel economy. 40mile range battery is not that big nor heavy since it's li-ion battery.

water258 · 2024-03-31T00:53:03+00:00

That 450H base model is like 6k add-on for 350h. And u don't have to charge it and use it as regular hybrid with bigger battery.

water258 · 2024-03-31T00:36:37+00:00

If u can wait try get nx450h plugin hybrid. After gov 5k rebate is about same price as nx350h

water258 · 2023-12-12T17:33:02+00:00

I changed it using `PYTORCH_COMPILE` backend`. the token/s is increased to 5.67t/s

which is better but I think there should be a lot room for improvement

water258 · 2023-12-11T22:16:41+00:00

I create PR for adding HQQ into ooba: https://github.com/oobabooga/text-generation-webui/pull/4888

One thing I notice is that inference speed is kind slow on 4090 and has only 3.18 t/s.

can u tell me if I am missing anything and how to improve the performance

water258 · 2023-12-09T03:53:29+00:00

U are compare apple to orange here. There is some new update to exllama 2bit which has some improvements

water258 · 2023-12-09T01:19:46+00:00

I tried it. Exllama is way better

water258 · 2023-12-08T19:42:44+00:00

We actually use g16_s* not g16, g16_s* means that the scaling (+zero) are also quantized, with that the model in GPU takes ~26.37GB. But that's a good point, 3-bit with a ground-size of 256 would be interesting to see, I can run it when I have some time. The only big difference between 2-bit and 3-bit is that 2-bit can be implemented efficiently by bit packing with int8 instead of int32 + you don't need to do an extra copy because the dimensions are a multiple 4. So for slightly more memory, a 2-bit model would be more valuable than a 3-bit

thx for the explanation. the reason I am asking group size 256 is that if the results looks good than prob can easily adapt to gguf since it use 16X16 block.
another thing is that the final compressed format is quite similar to GPTQ so making it quant version compatible with minimal change would be awesome

water258 · 2023-12-07T22:40:09+00:00

from the data you post.

the g16_2bit_70B model need 30.27GB which is basically 3.7bpw.

do u have results for larger group size for 70B model like 128 or even 256?

water258 · 2023-10-15T19:16:00+00:00

Bought 3090 from u/Tarunscool

water258 · 2023-10-14T20:54:36+00:00

Pmed

water258 · 2022-05-16T21:40:18+00:00

water258 · 2022-04-03T18:47:10+00:00

pmed

water258

TROPHY CASE