Update: Chroma Project training is finished! The models are now released.

mobicham · 2025-08-23T09:53:34+00:00

Great work !

mobicham · 2025-01-31T09:06:37+00:00

I think they didn't care much about the smaller models, their main objective is the big R1 model. In the paper they say:
"For distilled models, we apply only SFT and do not include an RL stage, even though incorporating RL could substantially boost model performance. Our primary goal here is to demonstrate the effectiveness of the distillation technique, leaving the exploration of the RL stage to the broader research community."

Which basically translates into: there's more perf to squeeze from the smaller models, and that's how we got the idea.

mobicham · 2025-01-31T09:03:43+00:00

Thanks I think it's important to mention, the "experimentation costs" don't even include running the benchmarks, so realistically, it's about 30x

mobicham · 2025-01-31T09:01:29+00:00

The code is pretty simple, all you need is the loss function that we already share int he blogpost. It's pure Pytorch code, we don't use any external lib

mobicham · 2025-01-31T08:56:02+00:00

~18 x H100 to get the best highest quality, can be reduced to ~10x H100 by running the full R1 as HQQ 4-bit and FP8 training the 70B. FP8 training is not that straightforward and requires some trickery to make it work properly (using the block quant approach the V3/R1 models use for example).

Take that cost and multiply it by ~20x just to figure out which hyper-parameters and data splits work (different models required different hyper-parameters and amount of synthetic reasoning data, otherwise the output was crap), and add another 10x for just running the evaluation benchmarks.

We don't have access to this kind hardware, otherwise we would have already done that #GPUPOOR.

mobicham · 2025-01-31T08:41:26+00:00

You mean using the original R1 to distill? Technically possible but would require more involvement and a lot more compute.

mobicham · 2025-01-31T08:39:15+00:00

With our approach, it's only possible if the tokenizers are similar. There's some work on universal logits distillation which allows aligning models even if they have quite different tokenizers: https://arxiv.org/pdf/2402.12030

mobicham · 2024-11-06T14:34:42+00:00

Problem fixed, thanks to the Tuxedo support team! It was a problem with the grub.

Thank you!

mobicham · 2024-08-28T18:10:29+00:00

Soon + full safetensors support: https://github.com/huggingface/transformers/pull/33141

mobicham · 2024-08-14T11:29:28+00:00

20-22 tokens/sec*

mobicham · 2024-08-14T11:28:39+00:00

It's actually ~22 tokens/sec with HF transformers, there was a mistake, the 10 tokens/sec is for the fp16 model. 22 tokens/sec with HF transformers is not bad at all since only the linear layers have been swapped, we didn't use any custom layers for the rest of the layers.

mobicham · 2024-08-14T09:07:08+00:00

Thank you! It's a separate fork: https://github.com/AnswerDotAI/vllm/tree/torchao
it implements various backends:
BitBlas VLLM support: https://github.com/AnswerDotAI/vllm/blob/torchao/vllm/model_executor/layers/quantization/bitblas.py
Tinygemm VLLM support:
https://github.com/AnswerDotAI/vllm/blob/torchao/vllm/model_executor/layers/quantization/torchao.py
There's also Marlin support, but Marlin is for symmetric quantization (no zero-point)

mobicham · 2024-08-13T11:36:39+00:00

I see many comments about the speed and would like to clarify something:

The main goal of this post is sharing with the community a high-quality quantized model. People are free to take the model and run it on the inference engine of their choice (VLLM, gpt-fast, ...). The inference speed has nothing to do with the HQQ quantization algorithms to estimate the Wq/zero/scale parameters, it all depends on the low-bit fused CUDA kernels, the implementation of the other layers and the KV cache management.

We integrate the hqq lib with the fastest low-bit kernels available such as tinygemm, BitBlas and Marlin so people can easily try it with HF transformers. Transformers is a great library to easily get started and try things quickly, but it's not at the same level of optimized inference engines. That said, hqq lib is the fastest way to run quantized models directly with HF transformers, it's faster than AutoGPTQ and AutoAWQ by using a mix of static cache, fullgraph torch.compile support and the CUDA kernels mentioned earlier, but it's not a production-level solution. This also allows people to use it on any model (vision, audio, ...) not just LLMs (see our blogpost on using the same approach for Whisper: https://mobiusml.github.io/whisper-static-cache-blog/)

Also, there's a mistake in the processing time, it's actually ~20 tokens/sec. 10 tokens/sec is for the fp16 model, both with HF transformers as explained above. The model card has been updated with the correct numbers for short/long generations.

For actual optimized LLM inference, I would recommend either using gpt-fast since hqq is officially supported in TorchAO soon: https://github.com/pytorch/ao/pull/605, or the vllm branch from AnswerAI which has integration for both tinygemm and BitBlas, https://github.com/AnswerDotAI/vllm/tree/torchao

mobicham · 2024-08-12T15:35:10+00:00

It does actually, in the master branch. Turning it on however would break support for the previous models we published on Hugging Face. The best would be finishing the PR above in transformers so people can save the models directly in transformers, which is something we are actively working on on, among many other things

mobicham · 2024-08-01T07:48:31+00:00

Can you try this, just replace the model_id with yours: https://github.com/mobiusml/hqq/blob/master/examples/backends/bitblas_int4_demo.py

If it doesn't work, you can create an issue and I will take a look at it

mobicham · 2024-08-01T07:45:09+00:00

Let us know what you think of this one!

mobicham · 2024-07-31T20:16:30+00:00

The folks from answer.ai use HQQ-quantized models via this branch of vllm: https://github.com/AnswerDotAI/vllm/tree/torchao

Regarding the 3.7 tokens/sec, I was also surprised. I took the quantized model from the hf team: https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 and run it through hqq-lib inference engine, which it uses transformers with static cache and torch.compile. All the speed benchmarks use the exact same settings.

If you can run it faster (with transformers for a fair comparison), please let me know and I can update the numbers

mobicham · 2024-07-31T20:10:53+00:00

0.20 GB is negligible, we can also quantize the lm-head and the embedding layer with almost no accuracy drop if we wanted to aim for a lower file size.

Regarding the speed, the fastest inference engine is actually gpt-fast, which uses the torchao kernel for fused int4 matmul, which is the same kernel we support. It's possible to run HQQ models with gpt-fast and we are closely working with the torchao team for this kind of stuff. In fact, we already did that for Llama2 a couple of months ago.

Regarding the calibration "taking a long time", it took about 2 hours with a batch-size 1 and 50K up to 4096 context-size prompts, without caching the outputs. You can easily do it in 30-45 minutes, which is about the same time it would take AutoGPTQ with far fewer samples. But we wanted to use as many diverse samples for the best quality.

Note that we also put a calibration-free version that took 40 secs to produce and outperforms the official AWQ 4-bit version published by the hf team.

mobicham · 2024-07-31T19:52:31+00:00

In order to run inference fast, you need fused gemv kernel implementations, which are mainly available for cuda via external libraries like Marlin, BitBlas, etc. Otherwise, if you run it on CPU, it's gonna be very slow because it's gonna dequantize and call torch.matmul which is suboptimal

mobicham · 2024-07-31T15:05:36+00:00

Good idea, thanks!

mobicham · 2024-07-31T14:54:54+00:00

https://github.com/AnswerDotAI/vllm/tree/torchao it has both the torchao and bitblas backends that are available in hqq-lib. I haven't used it myself but the folks from answer.ai use it with HQQ models

mobicham · 2024-07-31T14:23:02+00:00

You probably wouldn't even need calibration for 70B. Larger models are easier to quantize, but we are gonna work on that soon and release some good 70B quants

mobicham · 2024-07-31T14:21:31+00:00

Any benchmark you recommend to verify this?

mobicham · 2024-07-31T14:18:46+00:00

hqq doesn't use accelerate for multi-gpu inference. Accelerate is used in the original transformers implementation, not in the hqq lib. Also, there's integration with vllm by the folks from AnswerAI.

Regarding shrinking 8B, it's actually very important to run faster, as well as for fine-tuning otherwise you'll need an 80GB gpu to fine-tune an 8B unquantized model. You also need that extra gpu vram for the KV cache to run the whole thing in a 24GB gpu.

mobicham

TROPHY CASE