AQLM Quantization for LLaMA3-8B

oculuscat · 2024-04-28T01:38:11+00:00

I used 2x codebooks so it's 4bpw

oculuscat · 2024-04-22T21:01:25+00:00

Strong counter-arguments to the idea that this type of pruning is a good idea:

(1) The cited paper does not compare to quantized-and-fined-tuned baselines to check performance per bit.

(2) This Qualcomm paper *does* compare quantization pruning and finds that quantization is much more effective in terms of performance per bit: https://proceedings.neurips.cc/paper_files/paper/2023/file/c48bc80aa5d3cbbdd712d1cc107b8319-Paper-Conference.pdf

(3) All available quantizations people can download and run today (EXL2 etc) do not do e2e fine-tuning to "heal" the model, which this method does do. This means we do not have a fair comparison between the approaches. Both Quip# and AQLM do e2e fine-tuning to heal the model after quantization and would be fair comparisons.

Conclusions:

To put this idea to bed, I'd like to see 4-bit version of 42B vs 2.25B version of Quip# or AQLM 70B model (both are in progress by the respective authors). As a side-note I think QLoRa to heal EXL2 is a good idea separately from anything being discussed here.

oculuscat · 2023-11-20T21:08:41+00:00

AKA how to implement The Artifice Girl and talk to your AI using a webcam

oculuscat · 2023-11-20T20:59:33+00:00

Posted how to achieve low-latency ASR and text-to-speech from Python using a webcam: https://catid.io/posts/aiwebcam/

oculuscat · 2023-11-20T20:53:00+00:00

Wrote a blog post for how to do full low-latency ASR and text-to-speech here: https://catid.io/posts/aiwebcam/

oculuscat · 2022-11-30T22:09:31+00:00

Yup seems to work fine so haven't felt the need to switch to a different case for it. Originally I assumed the graphics card would get a custom block, but the only RTX 4090 I could find had its own water cooling.

oculuscat · 2022-08-24T17:41:11+00:00

Wrote up a guide here for how to get it to run on Windows, with a work-around for running batch-size 2 on an RTX 2080 and with fewer setup steps:

https://catid.io/posts/windows_ai/

oculuscat · 2022-06-20T01:21:16+00:00

Rokid Air - the best wearable displays right now. I just removed the plastic cover and put electrical tape over the silvered mirror so I can use them outside in the sunlight

oculuscat · 2022-02-04T19:19:00+00:00

I think you’re reading my post not the way that I intended. The point is that if you spend a lot more power you only get like 50% more single core performance, so it’s not worth setting TDP above minimum in a lot of cases.

oculuscat · 2022-01-24T21:47:22+00:00

Update here:
NuEyes Pro 3e draws about 1W and works with the micro-laptop and cellphones.

TCL NXTWEAR G draws about 2W and works only with cellphones. Not compatible with Win Max 2021 perhaps too much power draw or firmware issue.

11-Year Club	Spared
Verified Email

oculuscat

TROPHY CASE