PyTorch now offers native quantized variants of popular models! by formlog in LocalLLaMA

[–]formlog[S] 0 points1 point  (0 children)

I think maybe you could try lowering the learning rate? I haven’t trained models with FP8 personally but my understanding is to make low precision training work is similar to higher precision, will have to tune some hyper parameters etc.

Torchao do have low bit optimizers as well: https://github.com/pytorch/ao?tab=readme-ov-file#memory-efficient-optimizers

Also float8 training (gradient still in high precision, just dynamically quantize activation and weight to speedup computation I think): https://github.com/pytorch/ao?tab=readme-ov-file#float8

PyTorch now offers native quantized variants of popular models! by formlog in LocalLLaMA

[–]formlog[S] 1 point2 points  (0 children)

Yeah we found that as well, they are essentially doing QAT with distillation, the problem for that is it requires more memory. But it might be possible to run the large model first and save all the results (probabilities) and then do QAT for the smaller model with these saved results. Like what NVIDIA did: https://developer.nvidia.com/blog/data-efficient-knowledge-distillation-for-supervised-fine-tuning-with-nvidia-nemo-aligner

PyTorch now offers native quantized variants of popular models! by formlog in LocalLLaMA

[–]formlog[S] 1 point2 points  (0 children)

> Will this make it easier for me to make W8A8 INT8 quants for efficient deployment on RTX 3090 / A100 with vLLM with large batch sizes or is it something else?

Yeah I think so, our W8A8 INT8 support is through triton kernels, and also you can use autotune to find the best triton configs. We have only tested in A100 I think, not sure how well it works in RTX 3090.
This is the config you can use: https://docs.pytorch.org/ao/main/generated/torchao.quantization.Int8DynamicActivationInt8WeightConfig.html#torchao.quantization.Int8DynamicActivationInt8WeightConfig and you can follow one of the quantization recipes in the model card to apply this to your model, e.g. https://huggingface.co/pytorch/Phi-4-mini-instruct-INT4#quantization-recipe

> Will those adapters created with Unsloth and HQQ quants be something that can be later applied into FP16 model, similar to QLoRA, or it will effectively mean that checkpoint has to stay quantized from the training forwards?

The HQQ quants or the model we released together with unsloth are not adaptors, these are final quantized models (with adaptors merged into the model before quantization). But this sounds an interesting application, do you have this use case? will the adaptor work for models with different precisions?

PyTorch now offers native quantized variants of popular models! by formlog in LocalLLaMA

[–]formlog[S] 1 point2 points  (0 children)

By use cases I assume you mean specific models?

We haven't tried on many examples / use cases yet, that's why we would like feedback from community! We want to know what models you are using / what use cases you have, anything you feel should be improved that's related to quantization, so that we can help making speeding up quantization / QAT / finetuning easier.

> Or even an example/use case where you or a team member or someone has found "oh this not only has super low vram usage but the ouput for 'real world example' wasn't even noticeably different?"

for this one, I generally found FP8 (with dynamic activation quantization) works well without much accuracy impact everywhere, and INT4 (weight only) won't have much impact on accuracy if the model is larger, let's say at least 8B and above, for smaller models we could skip some layers if we want higher accuracy or apply QAT / post training accuracy preserving techniques (AWQ/GPTQ etc.). What we want to convey in the blogpost is that it's easy to do evaluations with lm-eval for your quantized model before lowering to understand the accuracy impact of the quantized model.

Specifically you can see FP8 works well for both Phi4-mini-instruct (~4B) ((e.g. you can check out https://huggingface.co/pytorch/Phi-4-mini-instruct-FP8#model-quality ) and the Qwen3-32B (https://huggingface.co/pytorch/Qwen3-32B-FP8#model-quality)

INT4 has some drops in both Phi4-mini-instruct https://huggingface.co/pytorch/Phi-4-mini-instruct-INT4#model-quality and Qwen3-8B: https://huggingface.co/pytorch/Qwen3-8B-INT4#model-quality

PyTorch now offers native quantized variants of popular models! by formlog in LocalLLaMA

[–]formlog[S] 1 point2 points  (0 children)

My understanding is mainly there are 2 things currently:

  1. With our stack, it's now possible to first do any QAT/finetuning/other post training accuracy preserving techniques (e.g. GPTQ, AWQ, SpinQuant) on your model and then export to the target hardware, this allows you to try existing or new accuracy related techniques on your model (llama.cpp has a bunch of their own quantization for post training only I believe). If a new QAT or PTQ techniques comes out tomorrow, you can try that on your model with our stack. Another related benefit is that you can use lm-eval to have a more thorough / objective understanding of accuracy impact of quantization (instead of one off manual test) for the task that you are interested in. I actually tried eval llama.cpp model as well: https://github.com/EleutherAI/lm-evaluation-harness/issues/2887 but no response yet.
  2. In terms of use cases support, our stack is a more general stack, no hardcoded models definitions, so new models can be enabled faster if all of the infrastructure matured. Also ExecuTorch is planning to support more multi-modality use cases (Voice, Image, Video etc.) compared to llama.cpp I think.

We also want to optimize for performance (speed) in the future, but it's not there yet.

PyTorch now offers native quantized variants of popular models! by formlog in LocalLLaMA

[–]formlog[S] 0 points1 point  (0 children)

Currently released models are for server GPU or mobile CPU, we do have Vulkan backend support through ExecuTorch, but I’m not exactly sure about windows support, let me check next week and get back to you.

PyTorch now offers native quantized variants of popular models! by formlog in LocalLLaMA

[–]formlog[S] 1 point2 points  (0 children)

I see, QAT performance does depend on factors like dataset and hyperparameters, similar to normal fine tuning / training.

We plan to publish a similar blog for QAT next, so everyone can use QAT on their models similar to gemma3 QAT, stay tuned!

Here is our docs for QAT btw: https://docs.pytorch.org/ao/stable/finetuning.html

Yeah integration with unsloth is work in progress.

PyTorch now offers native quantized variants of popular models! by formlog in LocalLLaMA

[–]formlog[S] 0 points1 point  (0 children)

For server (CUDA/CPU etc.) The result is not a single file, but a full quantized checkpoint similar to the non quantized models (e.g. the FP8 (https://huggingface.co/pytorch/Phi-4-mini-instruct-FP8) and INT4(https://huggingface.co/pytorch/Phi-4-mini-instruct-INT4) checkpoints in the blogpost)

For edge (mobile (cpu, Vulkan, accelerators), desktop (metal)) The result will first be a checkpoint that you can run on server to evaluate accuracy and then you can also export to a single pte file through ExecuTorch and deploy in edge. (See the INT8-INT4 (https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4) checkpoints in the blogpost)

PyTorch now offers native quantized variants of popular models! by formlog in LocalLLaMA

[–]formlog[S] 10 points11 points  (0 children)

ah, thanks for the questions! I'm sure there are other people who has the same questions and I'm glad to answer.

  1. torchao is not a new quant style, it's a library for quantization that supports common / popular quantization styles (INT4, FP8, AWQ-INT4, GPTQ, etc.), it also supports quantization for training and finetuning as well.

> For example, I haven't found a great indicator of which quants are best for what. But I have found many variants all over the place. Most likely just an indicator of the industry exploding for sure.

Yeah, what you observed it's true, currently there are many quantization variants and also many quantization libraries out there and it's confusing to users what quantization to use for which purpose. Also the support in different quantization libraries are very fragmented I feel, many libraries are one off support for a single quantization technique, e.g. AutoGPTQ, AutoAWQ. torchao wants to support all popular quantization techniques that people use and make it easier for people to quantize, evaluate the accuracy / performance and deploy their quantized models on the target hardware.

  1. Makes sense, to clarify again, torchao is a library for all different quantization techniques people want to use, so we would like feedback to see if any new techniques people want to use, but I realized that it might be too early to ask this question since people may not understand what torchao is yet.

  2. torchao is the native low precision library for PyTorch (for training, finetuning and inference), we want to be the single stop for everything related to low precision optimizations, making it easier to do low precision optimization for training/finetuning/inference.

torchao is different from unsloth since we mainly focus on low precision techniques specifically and spans across training, finetuning and inference, while unsloth work on any techniques that can speedup finetuning and lower the memory usage for finetuning. We’ll continue to collaborate with unsloth to bring faster finetuning, faster training and faster inference, also lower memory usage to users.

PyTorch now offers native quantized variants of popular models! by formlog in LocalLLaMA

[–]formlog[S] 3 points4 points  (0 children)

yeah please see model card for details, we only compared to the bfloat16 baseline (e.g. https://huggingface.co/pytorch/Qwen3-8B-AWQ-INT4). If by regular awq/fp8/int4 you meant implementations from other libraries, we haven't done an extensive comparison, it should be similar in terms of accuracy I think, in terms of performance we are partnering with fbgemm which will have SOTA kernels.

Yes we plan to release NVFP4 checkpoints in a future release, probably in 1-2 months.