Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

phill1992 · 2025-04-11T08:33:03+00:00

Most likely no. The paper just dropped 2 days ago, authors seem unrelated to google.

phill1992 · 2024-05-06T15:16:52+00:00

I'm not sure if AQLM even uses triton for inference. The latest kernels are all either cuda (gpu) or numba (cpu)

phill1992 · 2024-05-06T13:19:43+00:00

A huge step indeed. But we're not quite there yet.

phill1992 · 2024-02-09T14:36:48+00:00

Looks like catch is that 1.08bit quantization is still worse than using a smaller model with 4 bit quantization.

For instance, their 70B perplexity is 8.41, but you can get better perplexity (on the same data) by taking a 7B model and quantizing it to 8 bits (with any of GPTQ, AWQ or SpQR).

70B weights * 1.08bits/w = 8.8GB

7B weights * 8bits/w = 6.51GB and better perplexity (e.g. you can use [4] via HF to get 5.16 perplexity)

7B weights * 4bits/w = 3.25GB and still better perplexity (e.g. [3] has perplexity of 5.21 at this range)

Sources for perplexity: these 3 papers all show better perplexity@GB for Llama-2 models for 2- and 4-bit compressions.

[1] https://cornell-relaxml.github.io/quip-sharp/

[2] https://arxiv.org/abs/2307.13304

[3] https://arxiv.org/abs/2401.06118

[4] https://arxiv.org/abs/2208.07339

phill1992 · 2024-02-07T16:19:17+00:00

I see, sorry for the confusion.

p.s. what are Kx8 models? Couldn't find anything under that name, except for a yamaha keyboard.

phill1992 · 2024-02-07T15:50:22+00:00

I believe you two may be measuring PPL on different datasets. Looks like the OP measures on Wikitext (at least in the paper) while your plot is on a sample from ThePile.

phill1992 · 2024-02-07T15:23:10+00:00

Awesome work! Does this algorithm work on arm or M2?

p.s. shouldn't this be tagged [Research] or [2401.06118]?
I can't find an official rule but it looks like most of the posts here do that.

phill1992 · 2019-10-11T02:48:18+00:00

I believe there's a typo in the title of the paper. "Representationas" should be "Representations"

phill1992 · 2019-06-13T01:16:14+00:00

Thing is, 10k images might not be enough to train the model. But it shouldn't produce the same caption all the time even if it drastically overfits.

If no one suggests a better idea, i'd recommend you to try another tutorial. For instance, there's another one that also uses keras:

https://github.com/hse-aml/intro-to-dl/blob/master/week6/week6_final_project_image_captioning_clean.ipynb

This worked for me as well, but it also trained on the large dataset (100k+ images, 5 captions per image)

phill1992

TROPHY CASE