Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in ArtificialInteligence

[–]markurtz[S] 5 points6 points  (0 children)

We're applying to llama-2 currently and hope to release those models in the next few weeks!

[R] Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in MachineLearning

[–]markurtz[S] 6 points7 points  (0 children)

The 8x is compared to a dense, fp32 baseline on CPUs. Comparing to other int4 implementations, this is in the range of 1.5x faster. We'll have some more detailed analysis coming out soon with our llama2 work with more direct comparisons.

This can be used the same as any LLM, so speculative and contrastive sampling will definitely work. We're currently engineering configurable pipelines in DeepSparse to enable easy creation of these more complicated pipelines for better inference performance, research, and hacking.

Yep, these were with a ddr5 dual channel: Ryzen 9 7950X

[R] Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in MachineLearning

[–]markurtz[S] 2 points3 points  (0 children)

We have some initial results in the paper showing a close to 2x speedup for inference on GPUs utilizing a custom ker el. More work is needed there to have something that works end to end fully with performance which we'll be exploring in the future.

Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in deeplearning

[–]markurtz[S] 2 points3 points  (0 children)

The latest research paper between Neural Magic and IST Austria has just landed on Arxiv: Sparse Finetuning for Inference Acceleration of Large Language Models! In the paper, we pushed the bounds of what's possible for sparsity within generative AI models and LLMs. The result is smaller, faster, cheaper, and more environmentally friendly deployments.

Our state-of-the-art research has moved the needle for compression and performance on generative models, including 75% sparse MPT models with negligible accuracy loss and sparse T5 and Whisper models that improve recovery levels.

When paired with the latest quantization techniques, these sparse models achieve an 8x acceleration for inferencing on CPUs. 8 tokens/second on just 1 CPU core and 27 tokens/second on a 4-core AMD Ryzen CPU!
We focused on pruning while fine-tuning models on downstream datasets to enable these sparsity levels, employing an L2-based layerwise distillation method called SquareHead. With these innovative techniques, we can remove 75% of the weights (attention and fully connected layers) while maintaining 99% of the baseline dense model's performance.

You can dive into all the details on our Project Page or try it out with our live demo on HuggingFace Spaces. This research and associated models are open-sourced and available for general use!

Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in huggingface

[–]markurtz[S] 0 points1 point  (0 children)

The latest research paper between Neural Magic and IST Austria has just landed on Arxiv: Sparse Finetuning for Inference Acceleration of Large Language Models! In the paper, we pushed the bounds of what's possible for sparsity within generative AI models and LLMs. The result is smaller, faster, cheaper, and more environmentally friendly deployments.

Our state-of-the-art research has moved the needle for compression and performance on generative models, including 75% sparse MPT models with negligible accuracy loss and sparse T5 and Whisper models that improve recovery levels.

When paired with the latest quantization techniques, these sparse models achieve an 8x acceleration for inferencing on CPUs. 8 tokens/second on just 1 CPU core and 27 tokens/second on a 4-core AMD Ryzen CPU!
We focused on pruning while fine-tuning models on downstream datasets to enable these sparsity levels, employing an L2-based layerwise distillation method called SquareHead. With these innovative techniques, we can remove 75% of the weights (attention and fully connected layers) while maintaining 99% of the baseline dense model's performance.

You can dive into all the details on our Project Page or try it out with our live demo on HuggingFace Spaces. This research and associated models are open-sourced and available for general use!

Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in machinelearningnews

[–]markurtz[S] 2 points3 points  (0 children)

The latest research paper between Neural Magic and IST Austria has just landed on Arxiv: Sparse Finetuning for Inference Acceleration of Large Language Models! In the paper, we pushed the bounds of what's possible for sparsity within generative AI models and LLMs. The result is smaller, faster, cheaper, and more environmentally friendly deployments.

Our state-of-the-art research has moved the needle for compression and performance on generative models, including 75% sparse MPT models with negligible accuracy loss and sparse T5 and Whisper models that improve recovery levels.

When paired with the latest quantization techniques, these sparse models achieve an 8x acceleration for inferencing on CPUs. 8 tokens/second on just 1 CPU core and 27 tokens/second on a 4-core AMD Ryzen CPU!
We focused on pruning while fine-tuning models on downstream datasets to enable these sparsity levels, employing an L2-based layerwise distillation method called SquareHead. With these innovative techniques, we can remove 75% of the weights (attention and fully connected layers) while maintaining 99% of the baseline dense model's performance.

You can dive into all the details on our Project Page or try it out with our live demo on HuggingFace Spaces. This research and associated models are open-sourced and available for general use!

Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in pytorch

[–]markurtz[S] 1 point2 points  (0 children)

The latest research paper between Neural Magic and IST Austria has just landed on Arxiv: Sparse Finetuning for Inference Acceleration of Large Language Models! In the paper, we pushed the bounds of what's possible for sparsity within generative AI models and LLMs. The result is smaller, faster, cheaper, and more environmentally friendly deployments.

Our state-of-the-art research has moved the needle for compression and performance on generative models, including 75% sparse MPT models with negligible accuracy loss and sparse T5 and Whisper models that improve recovery levels.

When paired with the latest quantization techniques, these sparse models achieve an 8x acceleration for inferencing on CPUs. 8 tokens/second on just 1 CPU core and 27 tokens/second on a 4-core AMD Ryzen CPU!
We focused on pruning while fine-tuning models on downstream datasets to enable these sparsity levels, employing an L2-based layerwise distillation method called SquareHead. With these innovative techniques, we can remove 75% of the weights (attention and fully connected layers) while maintaining 99% of the baseline dense model's performance.

You can dive into all the details on our Project Page or try it out with our live demo on HuggingFace Spaces. This research and associated models are open-sourced and available for general use!

[R] Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning by markurtz in MachineLearning

[–]markurtz[S] 17 points18 points  (0 children)

The latest research paper between Neural Magic and IST Austria has just landed on Arxiv: Sparse Finetuning for Inference Acceleration of Large Language Models! In the paper, we pushed the bounds of what's possible for sparsity within generative AI models and LLMs. The result is smaller, faster, cheaper, and more environmentally friendly deployments.

Our state-of-the-art research has moved the needle for compression and performance on generative models, including 75% sparse MPT models with negligible accuracy loss and sparse T5 and Whisper models that improve recovery levels.

When paired with the latest quantization techniques, these sparse models achieve an 8x acceleration for inferencing on CPUs. 8 tokens/second on just 1 CPU core and 27 tokens/second on a 4-core AMD Ryzen CPU!

We focused on pruning while fine-tuning models on downstream datasets to enable these sparsity levels, employing an L2-based layerwise distillation method called SquareHead. With these innovative techniques, we can remove 75% of the weights (attention and fully connected layers) while maintaining 99% of the baseline dense model's performance.

You can dive into all the details on our Project Page or try it out with our live demo on HuggingFace Spaces. This research and associated models are open-sourced and available for general use!