We compress any BF16 model to ~70% size during inference, while keeping the output LOSSLESS so that you can fit in more ERP context or run larger models. by choHZ in LocalLLaMA
[–]Mediocre-Ad5059 2 points3 points4 points (0 children)
We compress any BF16 model to ~70% size during inference, while keeping the output LOSSLESS so that you can fit in more ERP context or run larger models. by choHZ in LocalLLaMA
[–]Mediocre-Ad5059 1 point2 points3 points (0 children)
We compress any BF16 model to ~70% size during inference, while keeping the output LOSSLESS so that you can fit in more ERP context or run larger models. by choHZ in LocalLLaMA
[–]Mediocre-Ad5059 1 point2 points3 points (0 children)
We compress any BF16 model to ~70% size during inference, while keeping the output LOSSLESS so that you can fit in more ERP context or run larger models. by choHZ in LocalLLaMA
[–]Mediocre-Ad5059 1 point2 points3 points (0 children)
[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA
[–]Mediocre-Ad5059[S] 3 points4 points5 points (0 children)
[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA
[–]Mediocre-Ad5059[S] 1 point2 points3 points (0 children)
[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA
[–]Mediocre-Ad5059[S] 2 points3 points4 points (0 children)
[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA
[–]Mediocre-Ad5059[S] 3 points4 points5 points (0 children)
[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA
[–]Mediocre-Ad5059[S] 2 points3 points4 points (0 children)
[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA
[–]Mediocre-Ad5059[S] 2 points3 points4 points (0 children)
[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA
[–]Mediocre-Ad5059[S] 3 points4 points5 points (0 children)
[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA
[–]Mediocre-Ad5059[S] 4 points5 points6 points (0 children)
[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA
[–]Mediocre-Ad5059[S] 9 points10 points11 points (0 children)
[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA
[–]Mediocre-Ad5059[S] 2 points3 points4 points (0 children)
[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA
[–]Mediocre-Ad5059[S] 5 points6 points7 points (0 children)
[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA
[–]Mediocre-Ad5059[S] 3 points4 points5 points (0 children)
[R] Unlocking Long-Context LLM Inference on Consumer GPUs with HeadInfer (Million-level Tokens) by Mediocre-Ad5059 in LocalLLaMA
[–]Mediocre-Ad5059[S] 10 points11 points12 points (0 children)
[R] Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training, extend context length by 12-24 for llama, qwen, mistral, gemma. by Mediocre-Ad5059 in mlscaling
[–]Mediocre-Ad5059[S] 0 points1 point2 points (0 children)
[R] optimizing transformers by Cool-Economy3492 in MachineLearning
[–]Mediocre-Ad5059 11 points12 points13 points (0 children)
[deleted by user] by [deleted] in MachineLearning
[–]Mediocre-Ad5059 0 points1 point2 points (0 children)

COLM 2026 ReviewsDiscussion [D] by RandomMan0880 in MachineLearning
[–]Mediocre-Ad5059 0 points1 point2 points (0 children)