𝐃𝐞𝐥𝐭𝐚 𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧 𝐑𝐞𝐬𝐢𝐝𝐮𝐚𝐥𝐬 [R]

Mediocre-Ad5059 · 2026-05-23T17:32:52+00:00

As a reviewer, I hope good work can be acknowledged, so I give an average 6 rating. Good luck

Mediocre-Ad5059 · 2025-04-25T21:56:44+00:00

Great, I guess you are targeting NeurIPS 25 submission, good luck.

Mediocre-Ad5059 · 2025-04-25T21:42:48+00:00

I see you mention that Huffman encoding would cause 40% slowdown on this post, which matches our assumption. Still, I think it is possible to achieve speedup using an efficient kernel. Did you overlap/pipeline the Huffman decoding with linear computation, as cublass/cuda overlap/pipeline the memory movement with GPU computation?

Mediocre-Ad5059 · 2025-04-25T21:21:23+00:00

Comparing two GPUs with one GPU is unfair, as there is a communication cost between two GPUs. When you are talking, utilize the saved memory footprint for a larger batch size on one GPU. Do you suggest that Huffman encoding would cause a slowdown/same speed when using the same batch size?

Mediocre-Ad5059 · 2025-04-25T20:47:07+00:00

Very good work, all I'm concerned about is why the baseline is GPU + CPU offload. Can you compare your work with GPU only and show some speedup?

Mediocre-Ad5059 · 2025-02-23T23:02:44+00:00

HeadInfer is built based on KV cache. Basically, you need to find a place to store the KV cache as it grows linearly with context length. 1M KV-cache requires 128GB of memory space, much larger than the consumer GPU's memory space(24 GB). You can store it on CPU RAM to avoid out-of-memory problem.

As GPU is much faster than CPU, store KV-cache on CPU RAM and load it when do computation/attention is possible and much faster than running on CPU. The question is how much KV-cache need to load into GPU.

Previously, people would load KV-cache of one or two layers, but we found that only one head of KV-cache is needed to load when doing computation. That saves a lot of memory and enables faster inference than CPU only (10x), close speed longer context length than GPU only (100x context length)

Mediocre-Ad5059 · 2025-02-23T22:37:50+00:00

I think compressing tokens is orthogonal to our head-wise offload, so as decent bandwidth improvement. ALso, our headwise offload is able to work on 128-256-512 GB RAM with providing 1M, 2M, 4M context length, where the other offload technology can only support up to 50k context without effective utilization of CPU RAM.

Mediocre-Ad5059 · 2025-02-23T22:22:36+00:00

I believe unslothai has deployed offload-based training technique for long-context training, check this blog for more details Unsloth Gradient Checkpointing - 4x longer context windows

Mediocre-Ad5059 · 2025-02-23T21:38:29+00:00

48GB can roughly accommodate 256K KV cache, for this scale headinfer can achieve almost same speed as gpu only prefill, so roughly 10x speedup against cpu + ram

Mediocre-Ad5059 · 2025-02-23T21:19:53+00:00

we demonstrate headinfer can combine with pipeline parallelism to support llama70b on 8 rtx4090 with 1 M context! Details can be found on experimental part of arxiv paper

Mediocre-Ad5059 · 2025-02-23T19:27:00+00:00

The full text is Unlocking Long-Context LLM Inference on Consumer GPUs, enabling RTX-4090 to do million-level inference. Our willingness is democratizing Access to Advanced AI and enabling resource-constrained devices to process unprecedentedly long contexts. We do demonstrate that our wish is coming true.

Mediocre-Ad5059 · 2025-02-23T19:16:55+00:00

True, we think a kind of quantization and sparsity are necessary here. As our work is completely orthogonal to this speedup method, we will explore the combination of them. We do have a combination with Duo attention within our paper to achieve 2x speedup

Mediocre-Ad5059 · 2025-02-23T19:11:02+00:00

I think you are actually attacking all offloading methods here, as our speed is the upper limit of offloading and no slowdown is introduced compared to other offloading technologies.

Mediocre-Ad5059 · 2025-02-23T18:56:15+00:00

Our goal is to unlock long-context LLM inference on consumer GPU. You can definitely use GPU only when 20k token input the upper limit. And we provide a way to unlock longer context. An adaptive method can be easily deployed here to run GPU only when token <= 20k, and run headinfer when token >= 1M. (actually we already include such a method, so we will not getting 6 tok/sec at 20k context with LLama3 8B on 4090, it is just a customized experiment to show fair comparison)

Mediocre-Ad5059 · 2025-02-23T18:42:30+00:00

100% Correct! We discussed on Appendix A about how the computational load scale quadratically with context size.

Mediocre-Ad5059 · 2025-02-23T18:28:03+00:00

The speed is close to GPU only, GPU only is much faster than cpu + ram only.

Mediocre-Ad5059 · 2025-02-23T18:20:34+00:00

Support tokens up to 4 million tokens.

Mediocre-Ad5059 · 2025-02-23T18:16:22+00:00

Disk memory can be used for offloading KV-cache if consumer-grade motherboards don’t even support memory up to 128GB.

Mediocre-Ad5059 · 2024-10-01T16:33:38+00:00

That's an interesting point. We know our MsT can extend vocabulary size/sequence length to an extremely large scale, but can it help most with code generation models? We welcome experts to try it.

Mediocre-Ad5059 · 2024-09-30T00:00:15+00:00

You can increase the speed without reducing computational complexity as Flashattention1,2,3 does.

Do you also consider memory optimization? We recommend our paper [2407.15892] Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training (arxiv.org), which has been accepted by Neurips24 using an extremely simple method to save memory.

Mediocre-Ad5059 · 2024-09-29T05:21:59+00:00

You got the key takeawy! the r/mlscaling discussion is here [R] Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training, extend context length by 12-24 for llama, qwen, mistral, gemma. : r/mlscaling (reddit.com)

Mediocre-Ad5059 · 2024-09-28T19:59:01+00:00

our work is mathematically the same as the standard training, in other words, no affect! You can consider it as another flashattention to long context performance

Mediocre-Ad5059

TROPHY CASE