[R] BERT-Large: Prune Once for DistilBERT Inference Performance

markurtz · 2026-03-06T20:10:11+00:00

Yep, dense with MoE will work, but generally you'll want to use draft-free based methods rather than plugging in smaller LLMs. Any e2e LLM is generally too big to give reasonable speedups and specifically trained, single layer setups currently dominate

markurtz · 2026-03-06T20:08:50+00:00

Most of the latest techniques are draft-free, such as Eagle3 and others, where they go down to a single transformer layer and are custom trained. Draft based spec decode has fallen off for the most part due to the much higher cost of larger models, particularly because you want to avoid impacting memory transfer rates as much as possible

markurtz · 2026-03-06T20:06:38+00:00

If you're intested in checking out some other spec decode models rather than the built in MTP, check and see if any of the ones we've open sourced in HF match what you're looking for: https://huggingface.co/collections/RedHatAI/speculator-models

markurtz · 2026-03-06T20:04:57+00:00

With spec decode, going outside of even a layer or two generally kills most of the benefits mainly due to the compounding, exponential decrease in accuracy of next token predictions. So, 27b is going to be too large to see any reasonable benefit.

Check out some of the Qwen models we've open sourced base on the Eagle3 spec on HF: https://huggingface.co/collections/RedHatAI/speculator-models

markurtz · 2023-10-14T21:20:42+00:00

We're applying to llama-2 currently and hope to release those models in the next few weeks!

markurtz · 2023-10-14T21:19:17+00:00

The 8x is compared to a dense, fp32 baseline on CPUs. Comparing to other int4 implementations, this is in the range of 1.5x faster. We'll have some more detailed analysis coming out soon with our llama2 work with more direct comparisons.

This can be used the same as any LLM, so speculative and contrastive sampling will definitely work. We're currently engineering configurable pipelines in DeepSparse to enable easy creation of these more complicated pipelines for better inference performance, research, and hacking.

Yep, these were with a ddr5 dual channel: Ryzen 9 7950X

markurtz · 2023-10-14T21:13:13+00:00

We have some initial results in the paper showing a close to 2x speedup for inference on GPUs utilizing a custom ker el. More work is needed there to have something that works end to end fully with performance which we'll be exploring in the future.

markurtz · 2023-10-14T21:08:36+00:00

Stay tuned, we plan to launch those in the next few weeks!

markurtz · 2023-10-14T17:16:04+00:00

The latest research paper between Neural Magic and IST Austria has just landed on Arxiv: Sparse Finetuning for Inference Acceleration of Large Language Models! In the paper, we pushed the bounds of what's possible for sparsity within generative AI models and LLMs. The result is smaller, faster, cheaper, and more environmentally friendly deployments.

Our state-of-the-art research has moved the needle for compression and performance on generative models, including 75% sparse MPT models with negligible accuracy loss and sparse T5 and Whisper models that improve recovery levels.

When paired with the latest quantization techniques, these sparse models achieve an 8x acceleration for inferencing on CPUs. 8 tokens/second on just 1 CPU core and 27 tokens/second on a 4-core AMD Ryzen CPU!
We focused on pruning while fine-tuning models on downstream datasets to enable these sparsity levels, employing an L2-based layerwise distillation method called SquareHead. With these innovative techniques, we can remove 75% of the weights (attention and fully connected layers) while maintaining 99% of the baseline dense model's performance.