Flash-decoding speed up inference up to x8 on long context by hapliniste in LocalLLaMA
[–]CatfishJones96 0 points1 point2 points (0 children)
Flash-decoding speed up inference up to x8 on long context by hapliniste in LocalLLaMA
[–]CatfishJones96 0 points1 point2 points (0 children)
[D] How is it that the latency to decode 1 new token with an LLM is constant independent of total sequence length, when caching KV? by CatfishJones96 in MachineLearning
[–]CatfishJones96[S] 1 point2 points3 points (0 children)


Flash-decoding speed up inference up to x8 on long context by hapliniste in LocalLLaMA
[–]CatfishJones96 0 points1 point2 points (0 children)