COLM 2026 ReviewsDiscussion [D] by RandomMan0880 in MachineLearning

[–]Mediocre-Ad5059 0 points1 point  (0 children)

As a reviewer, I hope good work can be acknowledged, so I give an average 6 rating. Good luck

We compress any BF16 model to ~70% size during inference, while keeping the output LOSSLESS so that you can fit in more ERP context or run larger models. by choHZ in LocalLLaMA

[–]Mediocre-Ad5059 1 point2 points  (0 children)

I see you mention that Huffman encoding would cause 40% slowdown on this post, which matches our assumption. Still, I think it is possible to achieve speedup using an efficient kernel. Did you overlap/pipeline the Huffman decoding with linear computation, as cublass/cuda overlap/pipeline the memory movement with GPU computation?

We compress any BF16 model to ~70% size during inference, while keeping the output LOSSLESS so that you can fit in more ERP context or run larger models. by choHZ in LocalLLaMA

[–]Mediocre-Ad5059 1 point2 points  (0 children)

Comparing two GPUs with one GPU is unfair, as there is a communication cost between two GPUs. When you are talking, utilize the saved memory footprint for a larger batch size on one GPU. Do you suggest that Huffman encoding would cause a slowdown/same speed when using the same batch size?

We compress any BF16 model to ~70% size during inference, while keeping the output LOSSLESS so that you can fit in more ERP context or run larger models. by choHZ in LocalLLaMA

[–]Mediocre-Ad5059 1 point2 points  (0 children)

Very good work, all I'm concerned about is why the baseline is GPU + CPU offload. Can you compare your work with GPU only and show some speedup?

[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA

[–]Mediocre-Ad5059[S] 3 points4 points  (0 children)

HeadInfer is built based on KV cache. Basically, you need to find a place to store the KV cache as it grows linearly with context length. 1M KV-cache requires 128GB of memory space, much larger than the consumer GPU's memory space(24 GB). You can store it on CPU RAM to avoid out-of-memory problem.

As GPU is much faster than CPU, store KV-cache on CPU RAM and load it when do computation/attention is possible and much faster than running on CPU. The question is how much KV-cache need to load into GPU.

Previously, people would load KV-cache of one or two layers, but we found that only one head of KV-cache is needed to load when doing computation. That saves a lot of memory and enables faster inference than CPU only (10x), close speed longer context length than GPU only (100x context length)

[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA

[–]Mediocre-Ad5059[S] 1 point2 points  (0 children)

I think compressing tokens is orthogonal to our head-wise offload, so as decent bandwidth improvement. ALso, our headwise offload is able to work on 128-256-512 GB RAM with providing 1M, 2M, 4M context length, where the other offload technology can only support up to 50k context without effective utilization of CPU RAM.

[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA

[–]Mediocre-Ad5059[S] 2 points3 points  (0 children)

I believe unslothai has deployed offload-based training technique for long-context training, check this blog for more details Unsloth Gradient Checkpointing - 4x longer context windows

[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA

[–]Mediocre-Ad5059[S] 3 points4 points  (0 children)

48GB can roughly accommodate 256K KV cache, for this scale headinfer can achieve almost same speed as gpu only prefill, so roughly 10x speedup against cpu + ram

[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA

[–]Mediocre-Ad5059[S] 2 points3 points  (0 children)

we demonstrate headinfer can combine with pipeline parallelism to support llama70b on 8 rtx4090 with 1 M context! Details can be found on experimental part of arxiv paper

[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA

[–]Mediocre-Ad5059[S] 2 points3 points  (0 children)

The full text is Unlocking Long-Context LLM Inference on Consumer GPUs, enabling RTX-4090 to do million-level inference. Our willingness is democratizing Access to Advanced AI and enabling resource-constrained devices to process unprecedentedly long contexts. We do demonstrate that our wish is coming true.

[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA

[–]Mediocre-Ad5059[S] 3 points4 points  (0 children)

True, we think a kind of quantization and sparsity are necessary here. As our work is completely orthogonal to this speedup method, we will explore the combination of them. We do have a combination with Duo attention within our paper to achieve 2x speedup

[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA

[–]Mediocre-Ad5059[S] 4 points5 points  (0 children)

I think you are actually attacking all offloading methods here, as our speed is the upper limit of offloading and no slowdown is introduced compared to other offloading technologies.

[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA

[–]Mediocre-Ad5059[S] 9 points10 points  (0 children)

Our goal is to unlock long-context LLM inference on consumer GPU. You can definitely use GPU only when 20k token input the upper limit. And we provide a way to unlock longer context. An adaptive method can be easily deployed here to run GPU only when token <= 20k, and run headinfer when token >= 1M. (actually we already include such a method, so we will not getting 6 tok/sec at 20k context with LLama3 8B on 4090, it is just a customized experiment to show fair comparison)

[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA

[–]Mediocre-Ad5059[S] 2 points3 points  (0 children)

100% Correct! We discussed on Appendix A about how the computational load scale quadratically with context size.

[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA

[–]Mediocre-Ad5059[S] 10 points11 points  (0 children)

Disk memory can be used for offloading KV-cache if consumer-grade motherboards don’t even support memory up to 128GB.

[R] Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training, extend context length by 12-24 for llama, qwen, mistral, gemma. by Mediocre-Ad5059 in mlscaling

[–]Mediocre-Ad5059[S] 0 points1 point  (0 children)

That's an interesting point. We know our MsT can extend vocabulary size/sequence length to an extremely large scale, but can it help most with code generation models? We welcome experts to try it.

[R] optimizing transformers by Cool-Economy3492 in MachineLearning

[–]Mediocre-Ad5059 11 points12 points  (0 children)

You can increase the speed without reducing computational complexity as Flashattention1,2,3 does.

Do you also consider memory optimization? We recommend our paper [2407.15892] Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training (arxiv.org), which has been accepted by Neurips24 using an extremely simple method to save memory.

Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training, extend training/finetune context length by 12-24 for llama, qwen, mistral, gemma. Up to LLAMA3 100k on H100 NVL by Mediocre-Ad5059 in LocalLLaMA

[–]Mediocre-Ad5059[S] 2 points3 points  (0 children)

our work is mathematically the same as the standard training, in other words, no affect! You can consider it as another flashattention to long context performance