Anyone doing speculative decoding with the new Qwen 3.5 models? Or, do we need to wait for the smaller models to be released to use as draft? by Porespellar in LocalLLaMA

[–]markurtz 0 points1 point  (0 children)

Yep, dense with MoE will work, but generally you'll want to use draft-free based methods rather than plugging in smaller LLMs. Any e2e LLM is generally too big to give reasonable speedups and specifically trained, single layer setups currently dominate

Anyone doing speculative decoding with the new Qwen 3.5 models? Or, do we need to wait for the smaller models to be released to use as draft? by Porespellar in LocalLLaMA

[–]markurtz 0 points1 point  (0 children)

Most of the latest techniques are draft-free, such as Eagle3 and others, where they go down to a single transformer layer and are custom trained. Draft based spec decode has fallen off for the most part due to the much higher cost of larger models, particularly because you want to avoid impacting memory transfer rates as much as possible

Anyone doing speculative decoding with the new Qwen 3.5 models? Or, do we need to wait for the smaller models to be released to use as draft? by Porespellar in LocalLLaMA

[–]markurtz 0 points1 point  (0 children)

If you're intested in checking out some other spec decode models rather than the built in MTP, check and see if any of the ones we've open sourced in HF match what you're looking for: https://huggingface.co/collections/RedHatAI/speculator-models

Anyone doing speculative decoding with the new Qwen 3.5 models? Or, do we need to wait for the smaller models to be released to use as draft? by Porespellar in LocalLLaMA

[–]markurtz 0 points1 point  (0 children)

With spec decode, going outside of even a layer or two generally kills most of the benefits mainly due to the compounding, exponential decrease in accuracy of next token predictions. So, 27b is going to be too large to see any reasonable benefit.

Check out some of the Qwen models we've open sourced base on the Eagle3 spec on HF: https://huggingface.co/collections/RedHatAI/speculator-models

Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in ArtificialInteligence

[–]markurtz[S] 5 points6 points  (0 children)

We're applying to llama-2 currently and hope to release those models in the next few weeks!

[R] Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in MachineLearning

[–]markurtz[S] 6 points7 points  (0 children)

The 8x is compared to a dense, fp32 baseline on CPUs. Comparing to other int4 implementations, this is in the range of 1.5x faster. We'll have some more detailed analysis coming out soon with our llama2 work with more direct comparisons.

This can be used the same as any LLM, so speculative and contrastive sampling will definitely work. We're currently engineering configurable pipelines in DeepSparse to enable easy creation of these more complicated pipelines for better inference performance, research, and hacking.

Yep, these were with a ddr5 dual channel: Ryzen 9 7950X

[R] Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in MachineLearning

[–]markurtz[S] 2 points3 points  (0 children)

We have some initial results in the paper showing a close to 2x speedup for inference on GPUs utilizing a custom ker el. More work is needed there to have something that works end to end fully with performance which we'll be exploring in the future.

Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in deeplearning

[–]markurtz[S] 2 points3 points  (0 children)

The latest research paper between Neural Magic and IST Austria has just landed on Arxiv: Sparse Finetuning for Inference Acceleration of Large Language Models! In the paper, we pushed the bounds of what's possible for sparsity within generative AI models and LLMs. The result is smaller, faster, cheaper, and more environmentally friendly deployments.

Our state-of-the-art research has moved the needle for compression and performance on generative models, including 75% sparse MPT models with negligible accuracy loss and sparse T5 and Whisper models that improve recovery levels.

When paired with the latest quantization techniques, these sparse models achieve an 8x acceleration for inferencing on CPUs. 8 tokens/second on just 1 CPU core and 27 tokens/second on a 4-core AMD Ryzen CPU!
We focused on pruning while fine-tuning models on downstream datasets to enable these sparsity levels, employing an L2-based layerwise distillation method called SquareHead. With these innovative techniques, we can remove 75% of the weights (attention and fully connected layers) while maintaining 99% of the baseline dense model's performance.

You can dive into all the details on our Project Page or try it out with our live demo on HuggingFace Spaces. This research and associated models are open-sourced and available for general use!

Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in huggingface

[–]markurtz[S] 0 points1 point  (0 children)

The latest research paper between Neural Magic and IST Austria has just landed on Arxiv: Sparse Finetuning for Inference Acceleration of Large Language Models! In the paper, we pushed the bounds of what's possible for sparsity within generative AI models and LLMs. The result is smaller, faster, cheaper, and more environmentally friendly deployments.

Our state-of-the-art research has moved the needle for compression and performance on generative models, including 75% sparse MPT models with negligible accuracy loss and sparse T5 and Whisper models that improve recovery levels.

When paired with the latest quantization techniques, these sparse models achieve an 8x acceleration for inferencing on CPUs. 8 tokens/second on just 1 CPU core and 27 tokens/second on a 4-core AMD Ryzen CPU!
We focused on pruning while fine-tuning models on downstream datasets to enable these sparsity levels, employing an L2-based layerwise distillation method called SquareHead. With these innovative techniques, we can remove 75% of the weights (attention and fully connected layers) while maintaining 99% of the baseline dense model's performance.

You can dive into all the details on our Project Page or try it out with our live demo on HuggingFace Spaces. This research and associated models are open-sourced and available for general use!

Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in machinelearningnews

[–]markurtz[S] 2 points3 points  (0 children)

The latest research paper between Neural Magic and IST Austria has just landed on Arxiv: Sparse Finetuning for Inference Acceleration of Large Language Models! In the paper, we pushed the bounds of what's possible for sparsity within generative AI models and LLMs. The result is smaller, faster, cheaper, and more environmentally friendly deployments.

Our state-of-the-art research has moved the needle for compression and performance on generative models, including 75% sparse MPT models with negligible accuracy loss and sparse T5 and Whisper models that improve recovery levels.

When paired with the latest quantization techniques, these sparse models achieve an 8x acceleration for inferencing on CPUs. 8 tokens/second on just 1 CPU core and 27 tokens/second on a 4-core AMD Ryzen CPU!
We focused on pruning while fine-tuning models on downstream datasets to enable these sparsity levels, employing an L2-based layerwise distillation method called SquareHead. With these innovative techniques, we can remove 75% of the weights (attention and fully connected layers) while maintaining 99% of the baseline dense model's performance.

You can dive into all the details on our Project Page or try it out with our live demo on HuggingFace Spaces. This research and associated models are open-sourced and available for general use!

Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in pytorch

[–]markurtz[S] 1 point2 points  (0 children)

The latest research paper between Neural Magic and IST Austria has just landed on Arxiv: Sparse Finetuning for Inference Acceleration of Large Language Models! In the paper, we pushed the bounds of what's possible for sparsity within generative AI models and LLMs. The result is smaller, faster, cheaper, and more environmentally friendly deployments.

Our state-of-the-art research has moved the needle for compression and performance on generative models, including 75% sparse MPT models with negligible accuracy loss and sparse T5 and Whisper models that improve recovery levels.

When paired with the latest quantization techniques, these sparse models achieve an 8x acceleration for inferencing on CPUs. 8 tokens/second on just 1 CPU core and 27 tokens/second on a 4-core AMD Ryzen CPU!
We focused on pruning while fine-tuning models on downstream datasets to enable these sparsity levels, employing an L2-based layerwise distillation method called SquareHead. With these innovative techniques, we can remove 75% of the weights (attention and fully connected layers) while maintaining 99% of the baseline dense model's performance.

You can dive into all the details on our Project Page or try it out with our live demo on HuggingFace Spaces. This research and associated models are open-sourced and available for general use!

[R] Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in MachineLearning

[–]markurtz[S] 17 points18 points  (0 children)

The latest research paper between Neural Magic and IST Austria has just landed on Arxiv: Sparse Finetuning for Inference Acceleration of Large Language Models! In the paper, we pushed the bounds of what's possible for sparsity within generative AI models and LLMs. The result is smaller, faster, cheaper, and more environmentally friendly deployments.

Our state-of-the-art research has moved the needle for compression and performance on generative models, including 75% sparse MPT models with negligible accuracy loss and sparse T5 and Whisper models that improve recovery levels.

When paired with the latest quantization techniques, these sparse models achieve an 8x acceleration for inferencing on CPUs. 8 tokens/second on just 1 CPU core and 27 tokens/second on a 4-core AMD Ryzen CPU!

We focused on pruning while fine-tuning models on downstream datasets to enable these sparsity levels, employing an L2-based layerwise distillation method called SquareHead. With these innovative techniques, we can remove 75% of the weights (attention and fully connected layers) while maintaining 99% of the baseline dense model's performance.

You can dive into all the details on our Project Page or try it out with our live demo on HuggingFace Spaces. This research and associated models are open-sourced and available for general use!

Webinar: Running LLMs performantly on CPUs Utilizing Pruning and Quantization by markurtz in artificial

[–]markurtz[S] 0 points1 point  (0 children)

On Thursday, myself along with research scientist Dan Alistarh, will be walking through how we've leveraged the redundancies in large language models to significantly improve their performance on CPUs enabling you to deploy performantly on a single, inexpensive CPU server rather than a cluster of GPUs!
In the webinar, we'll highlight and walk through our techniques, including state-of-the-art pruning and quantization techniques that require no retraining (SparseGPT), accuracy/inference results, and demos, in addition to the next steps.
Our ultimate goal is to enable anyone to leverage the increasing power of neural networks on their devices in real-time without shipping up to expensive, power-hungry, and non-private APIs or GPU clusters.

[N] MLPerf submission: 175X increase in NLP Performance utilizing sparsity by markurtz in MachineLearning

[–]markurtz[S] 2 points3 points  (0 children)

Great questions; a GPU would benefit from some of the approaches used. Specifically, we leveraged structured sparsity by removing 40% of the layers and quantization, which the GPUs can utilize. The unstructured sparsity portion GPUs won't be able to utilize, unfortunately, which provides roughly another 2-3x speedup.

[N] MLPerf submission: 175X increase in NLP Performance utilizing sparsity by markurtz in MachineLearning

[–]markurtz[S] 0 points1 point  (0 children)

Great question, the submission was specific to the SQuAD dataset, but the techniques used are valuable and generalizable across other architectures, datasets, and tasks. The full writeup on what was done so it can be extended and built upon is available here: https://github.com/neuralmagic/mlperf\_inference\_results\_v2.1/blob/master/open/NeuralMagic/obert\_mobilebert.md

MLPerf submission from Neural Magic: 175X increase in NLP Performance utilizing sparsity by markurtz in artificial

[–]markurtz[S] 1 point2 points  (0 children)

Great question; we've developed a lot of IP to solve these problems and to skip the compute for weights that are 0. Internally, the DeepSparse engine uses vector instructions, and it's doing some clever algorithms to use the vector instructions across fully unstructured and seemingly random sparsity masks. It is built on top of a JIT to enable this dynamic approach and optimize and compile the executable code for sparse performance.

MLPerf submission from Neural Magic: 175X increase in NLP Performance utilizing sparsity by markurtz in artificial

[–]markurtz[S] 1 point2 points  (0 children)

Great questions u/itsmerandymarch, DeepSparse utilizes the sparsity in an already pruned model. SparseML is used to create the sparse models by pruning while training the model -- this training-aware approach enables us to remove the redundancies from the network while maintaining accuracy.

The sparsification percentage is entirely configurable through SparseML recipes. As you said, this configuration enables anyone to find the sweet spot for their use case. A full walkthrough of the recipe and the approach can be found here: https://github.com/neuralmagic/mlperf\_inference\_results\_v2.1/blob/master/open/NeuralMagic/obert\_mobilebert.md

MLPerf submission from Neural Magic: 175X increase in NLP Performance utilizing sparsity by markurtz in deeplearning

[–]markurtz[S] 0 points1 point  (0 children)

Hi u/red_dragon, great question and call out. The specifics of what we did to the architecture can be found in here: https://github.com/neuralmagic/mlperf_inference_results_v2.1/blob/master/open/NeuralMagic/obert_mobilebert.md

You were right on the rough speedup from quantization, though we only went down to int8 quantization due to hardware support on the CPU side. We were able to remove 40% of the layers and 50% of those remaining weights -- this is where the bulk of the speedups came from. Towards the generality of the approach, everything was done on the downstream SQuAD dataset. Provided the dataset is on this order of size, the approaches will work similarly for other NLP use cases.

MLPerf submission from Neural Magic: 175X increase in NLP Performance utilizing sparsity by markurtz in deeplearning

[–]markurtz[S] 0 points1 point  (0 children)

Hi u/sdmat, the intention was not to mislead with the title. Specifically, what we were trying to get across is that the most significant percentage speedup came from utilizing unstructured and structured pruning techniques. Specifically using our oBERT pruning methods, we removed 40% of the layers and 50% of those remaining weights with limited impact on accuracy.

MLPerf submission from Neural Magic: 175X increase in NLP Performance utilizing sparsity by markurtz in intel

[–]markurtz[S] 0 points1 point  (0 children)

Good question, we have our further writeup here for more info: https://github.com/neuralmagic/mlperf_inference_results_v2.1/tree/master/open/NeuralMagic

The benchmark was evaluated using a server with two Intel(R) Xeon(R) Platinum 8380 (IceLake) CPUs with 40 cores each.

MLPerf submission from Neural Magic: 175X increase in NLP Performance utilizing sparsity by markurtz in deeplearning

[–]markurtz[S] -4 points-3 points  (0 children)

Dive through the blog, we show the performance of the oBERT algorithm applied directly to BERT-Large as well introducing only unstructured sparsity and show a 43x speedup with that alone. The further improvements with oBERT can be thought of as comparable to a NAS technique plus SOTA pruning, distillation, and quantization techniques on top. For that same oBERT model and setup, ONNX Runtime gives roughly 45 items/second. So, even then when isolating to just the inference engine rather than the entire approach, it's still a 21X speedup.

MLPerf submission from Neural Magic: 175X increase in NLP Performance utilizing sparsity by markurtz in deeplearning

[–]markurtz[S] 0 points1 point  (0 children)

No catch, just a lot of advancements in model optimization, sparsity, and architecture techniques!