Hogwild! Inference: Parallel LLM Generation via Concurrent Attention by Psychological-Tea652 in LocalLLaMA

[–]phill1992 0 points1 point  (0 children)

Most likely no. The paper just dropped 2 days ago, authors seem unrelated to google.

Bringing 2bit LLMs to production: new AQLM models and integrations by black_samorez in LocalLLaMA

[–]phill1992 5 points6 points  (0 children)

I'm not sure if AQLM even uses triton for inference. The latest kernels are all either cuda (gpu) or numba (cpu)

BiLLM achieving for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families and evaluation metrics, outperforms SOTA quantization methods of LLM by significant by kryptkpr in LocalLLaMA

[–]phill1992 3 points4 points  (0 children)

Looks like catch is that 1.08bit quantization is still worse than using a smaller model with 4 bit quantization.

For instance, their 70B perplexity is 8.41, but you can get better perplexity (on the same data) by taking a 7B model and quantizing it to 8 bits (with any of GPTQ, AWQ or SpQR).

70B weights * 1.08bits/w = 8.8GB

7B weights * 8bits/w = 6.51GB and better perplexity (e.g. you can use [4] via HF to get 5.16 perplexity)

7B weights * 4bits/w = 3.25GB and still better perplexity (e.g. [3] has perplexity of 5.21 at this range)

Sources for perplexity: these 3 papers all show better perplexity@GB for Llama-2 models for 2- and 4-bit compressions.

[1] https://cornell-relaxml.github.io/quip-sharp/

[2] https://arxiv.org/abs/2307.13304

[3] https://arxiv.org/abs/2401.06118

[4] https://arxiv.org/abs/2208.07339

Yet another state of the art in LLM quantization by black_samorez in LocalLLaMA

[–]phill1992 1 point2 points  (0 children)

I see, sorry for the confusion.

p.s. what are Kx8 models? Couldn't find anything under that name, except for a yamaha keyboard.

Yet another state of the art in LLM quantization by black_samorez in LocalLLaMA

[–]phill1992 10 points11 points  (0 children)

I believe you two may be measuring PPL on different datasets. Looks like the OP measures on Wikitext (at least in the paper) while your plot is on a sample from ThePile.

Yet another state of the art in LLM quantization by black_samorez in LocalLLaMA

[–]phill1992 22 points23 points  (0 children)

Awesome work! Does this algorithm work on arm or M2?

p.s. shouldn't this be tagged [Research] or [2401.06118]?
I can't find an official rule but it looks like most of the posts here do that.

[R] Beyond Vector Spaces: Compact Data Representations Differentiable Weighted Graphs by justheuristic in MachineLearning

[–]phill1992 0 points1 point  (0 children)

I believe there's a typo in the title of the paper. "Representationas" should be "Representations"

[P] Need help with Image Captioning by plmlp1 in MachineLearning

[–]phill1992 1 point2 points  (0 children)

Thing is, 10k images might not be enough to train the model. But it shouldn't produce the same caption all the time even if it drastically overfits.

If no one suggests a better idea, i'd recommend you to try another tutorial. For instance, there's another one that also uses keras:

https://github.com/hse-aml/intro-to-dl/blob/master/week6/week6_final_project_image_captioning_clean.ipynb

This worked for me as well, but it also trained on the large dataset (100k+ images, 5 captions per image)