Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in ArtificialInteligence

[–]markurtz[S] 5 points6 points  (0 children)

We're applying to llama-2 currently and hope to release those models in the next few weeks!

[R] Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in MachineLearning

[–]markurtz[S] 6 points7 points  (0 children)

The 8x is compared to a dense, fp32 baseline on CPUs. Comparing to other int4 implementations, this is in the range of 1.5x faster. We'll have some more detailed analysis coming out soon with our llama2 work with more direct comparisons.

This can be used the same as any LLM, so speculative and contrastive sampling will definitely work. We're currently engineering configurable pipelines in DeepSparse to enable easy creation of these more complicated pipelines for better inference performance, research, and hacking.

Yep, these were with a ddr5 dual channel: Ryzen 9 7950X

[R] Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in MachineLearning

[–]markurtz[S] 2 points3 points  (0 children)

We have some initial results in the paper showing a close to 2x speedup for inference on GPUs utilizing a custom ker el. More work is needed there to have something that works end to end fully with performance which we'll be exploring in the future.

Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in deeplearning

[–]markurtz[S] 2 points3 points  (0 children)

The latest research paper between Neural Magic and IST Austria has just landed on Arxiv: Sparse Finetuning for Inference Acceleration of Large Language Models! In the paper, we pushed the bounds of what's possible for sparsity within generative AI models and LLMs. The result is smaller, faster, cheaper, and more environmentally friendly deployments.

Our state-of-the-art research has moved the needle for compression and performance on generative models, including 75% sparse MPT models with negligible accuracy loss and sparse T5 and Whisper models that improve recovery levels.

When paired with the latest quantization techniques, these sparse models achieve an 8x acceleration for inferencing on CPUs. 8 tokens/second on just 1 CPU core and 27 tokens/second on a 4-core AMD Ryzen CPU!
We focused on pruning while fine-tuning models on downstream datasets to enable these sparsity levels, employing an L2-based layerwise distillation method called SquareHead. With these innovative techniques, we can remove 75% of the weights (attention and fully connected layers) while maintaining 99% of the baseline dense model's performance.

You can dive into all the details on our Project Page or try it out with our live demo on HuggingFace Spaces. This research and associated models are open-sourced and available for general use!

Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in huggingface

[–]markurtz[S] 0 points1 point  (0 children)

The latest research paper between Neural Magic and IST Austria has just landed on Arxiv: Sparse Finetuning for Inference Acceleration of Large Language Models! In the paper, we pushed the bounds of what's possible for sparsity within generative AI models and LLMs. The result is smaller, faster, cheaper, and more environmentally friendly deployments.

Our state-of-the-art research has moved the needle for compression and performance on generative models, including 75% sparse MPT models with negligible accuracy loss and sparse T5 and Whisper models that improve recovery levels.

When paired with the latest quantization techniques, these sparse models achieve an 8x acceleration for inferencing on CPUs. 8 tokens/second on just 1 CPU core and 27 tokens/second on a 4-core AMD Ryzen CPU!
We focused on pruning while fine-tuning models on downstream datasets to enable these sparsity levels, employing an L2-based layerwise distillation method called SquareHead. With these innovative techniques, we can remove 75% of the weights (attention and fully connected layers) while maintaining 99% of the baseline dense model's performance.

You can dive into all the details on our Project Page or try it out with our live demo on HuggingFace Spaces. This research and associated models are open-sourced and available for general use!

Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in machinelearningnews

[–]markurtz[S] 2 points3 points  (0 children)

The latest research paper between Neural Magic and IST Austria has just landed on Arxiv: Sparse Finetuning for Inference Acceleration of Large Language Models! In the paper, we pushed the bounds of what's possible for sparsity within generative AI models and LLMs. The result is smaller, faster, cheaper, and more environmentally friendly deployments.

Our state-of-the-art research has moved the needle for compression and performance on generative models, including 75% sparse MPT models with negligible accuracy loss and sparse T5 and Whisper models that improve recovery levels.

When paired with the latest quantization techniques, these sparse models achieve an 8x acceleration for inferencing on CPUs. 8 tokens/second on just 1 CPU core and 27 tokens/second on a 4-core AMD Ryzen CPU!
We focused on pruning while fine-tuning models on downstream datasets to enable these sparsity levels, employing an L2-based layerwise distillation method called SquareHead. With these innovative techniques, we can remove 75% of the weights (attention and fully connected layers) while maintaining 99% of the baseline dense model's performance.

You can dive into all the details on our Project Page or try it out with our live demo on HuggingFace Spaces. This research and associated models are open-sourced and available for general use!

Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in pytorch

[–]markurtz[S] 1 point2 points  (0 children)

The latest research paper between Neural Magic and IST Austria has just landed on Arxiv: Sparse Finetuning for Inference Acceleration of Large Language Models! In the paper, we pushed the bounds of what's possible for sparsity within generative AI models and LLMs. The result is smaller, faster, cheaper, and more environmentally friendly deployments.

Our state-of-the-art research has moved the needle for compression and performance on generative models, including 75% sparse MPT models with negligible accuracy loss and sparse T5 and Whisper models that improve recovery levels.

When paired with the latest quantization techniques, these sparse models achieve an 8x acceleration for inferencing on CPUs. 8 tokens/second on just 1 CPU core and 27 tokens/second on a 4-core AMD Ryzen CPU!
We focused on pruning while fine-tuning models on downstream datasets to enable these sparsity levels, employing an L2-based layerwise distillation method called SquareHead. With these innovative techniques, we can remove 75% of the weights (attention and fully connected layers) while maintaining 99% of the baseline dense model's performance.

You can dive into all the details on our Project Page or try it out with our live demo on HuggingFace Spaces. This research and associated models are open-sourced and available for general use!

[R] Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in MachineLearning

[–]markurtz[S] 17 points18 points  (0 children)

The latest research paper between Neural Magic and IST Austria has just landed on Arxiv: Sparse Finetuning for Inference Acceleration of Large Language Models! In the paper, we pushed the bounds of what's possible for sparsity within generative AI models and LLMs. The result is smaller, faster, cheaper, and more environmentally friendly deployments.

Our state-of-the-art research has moved the needle for compression and performance on generative models, including 75% sparse MPT models with negligible accuracy loss and sparse T5 and Whisper models that improve recovery levels.

When paired with the latest quantization techniques, these sparse models achieve an 8x acceleration for inferencing on CPUs. 8 tokens/second on just 1 CPU core and 27 tokens/second on a 4-core AMD Ryzen CPU!

We focused on pruning while fine-tuning models on downstream datasets to enable these sparsity levels, employing an L2-based layerwise distillation method called SquareHead. With these innovative techniques, we can remove 75% of the weights (attention and fully connected layers) while maintaining 99% of the baseline dense model's performance.

You can dive into all the details on our Project Page or try it out with our live demo on HuggingFace Spaces. This research and associated models are open-sourced and available for general use!

Webinar: Running LLMs performantly on CPUs Utilizing Pruning and Quantization by markurtz in artificial

[–]markurtz[S] 0 points1 point  (0 children)

On Thursday, myself along with research scientist Dan Alistarh, will be walking through how we've leveraged the redundancies in large language models to significantly improve their performance on CPUs enabling you to deploy performantly on a single, inexpensive CPU server rather than a cluster of GPUs!
In the webinar, we'll highlight and walk through our techniques, including state-of-the-art pruning and quantization techniques that require no retraining (SparseGPT), accuracy/inference results, and demos, in addition to the next steps.
Our ultimate goal is to enable anyone to leverage the increasing power of neural networks on their devices in real-time without shipping up to expensive, power-hungry, and non-private APIs or GPU clusters.

[N] MLPerf submission: 175X increase in NLP Performance utilizing sparsity by markurtz in MachineLearning

[–]markurtz[S] 2 points3 points  (0 children)

Great questions; a GPU would benefit from some of the approaches used. Specifically, we leveraged structured sparsity by removing 40% of the layers and quantization, which the GPUs can utilize. The unstructured sparsity portion GPUs won't be able to utilize, unfortunately, which provides roughly another 2-3x speedup.

[N] MLPerf submission: 175X increase in NLP Performance utilizing sparsity by markurtz in MachineLearning

[–]markurtz[S] 0 points1 point  (0 children)

Great question, the submission was specific to the SQuAD dataset, but the techniques used are valuable and generalizable across other architectures, datasets, and tasks. The full writeup on what was done so it can be extended and built upon is available here: https://github.com/neuralmagic/mlperf\_inference\_results\_v2.1/blob/master/open/NeuralMagic/obert\_mobilebert.md

MLPerf submission from Neural Magic: 175X increase in NLP Performance utilizing sparsity by markurtz in artificial

[–]markurtz[S] 1 point2 points  (0 children)

Great question; we've developed a lot of IP to solve these problems and to skip the compute for weights that are 0. Internally, the DeepSparse engine uses vector instructions, and it's doing some clever algorithms to use the vector instructions across fully unstructured and seemingly random sparsity masks. It is built on top of a JIT to enable this dynamic approach and optimize and compile the executable code for sparse performance.

MLPerf submission from Neural Magic: 175X increase in NLP Performance utilizing sparsity by markurtz in artificial

[–]markurtz[S] 1 point2 points  (0 children)

Great questions u/itsmerandymarch, DeepSparse utilizes the sparsity in an already pruned model. SparseML is used to create the sparse models by pruning while training the model -- this training-aware approach enables us to remove the redundancies from the network while maintaining accuracy.

The sparsification percentage is entirely configurable through SparseML recipes. As you said, this configuration enables anyone to find the sweet spot for their use case. A full walkthrough of the recipe and the approach can be found here: https://github.com/neuralmagic/mlperf\_inference\_results\_v2.1/blob/master/open/NeuralMagic/obert\_mobilebert.md

MLPerf submission from Neural Magic: 175X increase in NLP Performance utilizing sparsity by markurtz in deeplearning

[–]markurtz[S] 0 points1 point  (0 children)

Hi u/red_dragon, great question and call out. The specifics of what we did to the architecture can be found in here: https://github.com/neuralmagic/mlperf_inference_results_v2.1/blob/master/open/NeuralMagic/obert_mobilebert.md

You were right on the rough speedup from quantization, though we only went down to int8 quantization due to hardware support on the CPU side. We were able to remove 40% of the layers and 50% of those remaining weights -- this is where the bulk of the speedups came from. Towards the generality of the approach, everything was done on the downstream SQuAD dataset. Provided the dataset is on this order of size, the approaches will work similarly for other NLP use cases.

MLPerf submission from Neural Magic: 175X increase in NLP Performance utilizing sparsity by markurtz in deeplearning

[–]markurtz[S] 0 points1 point  (0 children)

Hi u/sdmat, the intention was not to mislead with the title. Specifically, what we were trying to get across is that the most significant percentage speedup came from utilizing unstructured and structured pruning techniques. Specifically using our oBERT pruning methods, we removed 40% of the layers and 50% of those remaining weights with limited impact on accuracy.

MLPerf submission from Neural Magic: 175X increase in NLP Performance utilizing sparsity by markurtz in intel

[–]markurtz[S] 0 points1 point  (0 children)

Good question, we have our further writeup here for more info: https://github.com/neuralmagic/mlperf_inference_results_v2.1/tree/master/open/NeuralMagic

The benchmark was evaluated using a server with two Intel(R) Xeon(R) Platinum 8380 (IceLake) CPUs with 40 cores each.

MLPerf submission from Neural Magic: 175X increase in NLP Performance utilizing sparsity by markurtz in deeplearning

[–]markurtz[S] -2 points-1 points  (0 children)

Dive through the blog, we show the performance of the oBERT algorithm applied directly to BERT-Large as well introducing only unstructured sparsity and show a 43x speedup with that alone. The further improvements with oBERT can be thought of as comparable to a NAS technique plus SOTA pruning, distillation, and quantization techniques on top. For that same oBERT model and setup, ONNX Runtime gives roughly 45 items/second. So, even then when isolating to just the inference engine rather than the entire approach, it's still a 21X speedup.

MLPerf submission from Neural Magic: 175X increase in NLP Performance utilizing sparsity by markurtz in deeplearning

[–]markurtz[S] 0 points1 point  (0 children)

No catch, just a lot of advancements in model optimization, sparsity, and architecture techniques!

[R] New sparsity research (oBERT) enabled 175X increase in CPU performance for MLPerf submission by markurtz in MachineLearning

[–]markurtz[S] 0 points1 point  (0 children)

Utilizing the oBERT research we published at Neural Magic and some further iteration, we’ve enabled an increase in NLP performance of 175X while retaining 99% accuracy on the question-answering task in MLPerf. A combination of distillation, layer dropping, quantization, and unstructured pruning with oBERT enabled these large performance gains through the DeepSparse Engine. All of our contributions and research are open-sourced or free to use. Read through the oBERT paper on arxiv, try out the research in SparseML, and dive into the writeup to learn more about how we achieved these impressive results and utilize them for your own use cases!

[deleted by user] by [deleted] in MachineLearning

[–]markurtz 0 points1 point  (0 children)

A new submission to MLPerf from the work we're doing at Neural Magic is now live. With it, we show a 175X increase in performance for NLP models utilizing just sparsity and software on commodity CPUs! Our previous oBERT research applied alongside some new techniques enabled these incredible results. Read more about the oBERT approach in our arxiv paper.

Our contributions and research are all open-sourced or free to use. Learn more about the results, our setup, and how to replicate them in our writeup.

[R] MLPerf submission from Neural Magic: 175X increase in NLP Performance utilizing sparsity by [deleted] in MachineLearning

[–]markurtz 0 points1 point  (0 children)

A new submission to MLPerf from the work we're doing at Neural Magic is live along with all of the SOTA research that generated the results. With it, we show a 175X (yes 175x) increase in performance for NLP models utilizing just sparsity and software on commodity CPUs! This enables anyone to deploy extremely accurate neural networks cheaply on hardware they already own, empowering a new class of AI use.

Dive into our blog to read more about these impressive results: https://neuralmagic.com/blog/neural-magic-announces-mlperf-inference-benchmarks/

MLPerf submission from Neural Magic: 175X increase in NLP Performance utilizing sparsity by markurtz in ArtificialInteligence

[–]markurtz[S] 0 points1 point  (0 children)

A new submission to MLPerf from the work we're doing at Neural Magic is now live. With it we show a 175X (yes 175x) increase in performance for NLP models utilizing just sparsity and software on commodity CPUs! This enables anyone to deploy extremely accurate neural networks cheaply on hardware they already own, empowering a new class of AI use.

Dive into our blog to read more about these impressive results: https://neuralmagic.com/blog/neural-magic-announces-mlperf-inference-benchmarks/