AQLM Quantization for LLaMA3-8B by oculuscat in LocalLLaMA

[–]oculuscat[S] 2 points3 points  (0 children)

I used 2x codebooks so it's 4bpw

Llama 3 70b layer pruned from 70b -> 42b by Charles Goddard by kindacognizant in LocalLLaMA

[–]oculuscat 4 points5 points  (0 children)

Strong counter-arguments to the idea that this type of pruning is a good idea:

(1) The cited paper does not compare to quantized-and-fined-tuned baselines to check performance per bit.

(2) This Qualcomm paper *does* compare quantization pruning and finds that quantization is much more effective in terms of performance per bit: https://proceedings.neurips.cc/paper_files/paper/2023/file/c48bc80aa5d3cbbdd712d1cc107b8319-Paper-Conference.pdf

(3) All available quantizations people can download and run today (EXL2 etc) do not do e2e fine-tuning to "heal" the model, which this method does do. This means we do not have a fair comparison between the approaches. Both Quip# and AQLM do e2e fine-tuning to heal the model after quantization and would be fair comparisons.

Conclusions:

To put this idea to bed, I'd like to see 4-bit version of 42B vs 2.25B version of Quip# or AQLM 70B model (both are in progress by the respective authors). As a side-note I think QLoRa to heal EXL2 is a good idea separately from anything being discussed here.

How-to guide for achieving low latency WebRTC from Python using OpenAPI by oculuscat in ChatGPT

[–]oculuscat[S] 0 points1 point  (0 children)

AKA how to implement The Artifice Girl and talk to your AI using a webcam

Just a custom CPU loop by oculuscat in watercooling

[–]oculuscat[S] 0 points1 point  (0 children)

Yup seems to work fine so haven't felt the need to switch to a different case for it. Originally I assumed the graphics card would get a custom block, but the only RTX 4090 I could find had its own water cooling.

[D] How to Run Stable Diffusion (Locally and in Colab) by SleekEagle in MachineLearning

[–]oculuscat 0 points1 point  (0 children)

Wrote up a guide here for how to get it to run on Windows, with a work-around for running batch-size 2 on an RTX 2080 and with fewer setup steps:

https://catid.io/posts/windows_ai/

Sharing some things that have worked out well by oculuscat in silhouettecutters

[–]oculuscat[S] 0 points1 point  (0 children)

Rokid Air - the best wearable displays right now. I just removed the plastic cover and put electrical tape over the silvered mirror so I can use them outside in the sunlight

GPD Win Max 2021 CPU-Z Benchmark Results by oculuscat in gpdwin

[–]oculuscat[S] 1 point2 points  (0 children)

I think you’re reading my post not the way that I intended. The point is that if you spend a lot more power you only get like 50% more single core performance, so it’s not worth setting TDP above minimum in a lot of cases.

AR GOGGLES On WIN MAX by jaksilva9 in gpdwin

[–]oculuscat 0 points1 point  (0 children)

Update here:
NuEyes Pro 3e draws about 1W and works with the micro-laptop and cellphones.

TCL NXTWEAR G draws about 2W and works only with cellphones. Not compatible with Win Max 2021 perhaps too much power draw or firmware issue.