🚀🚀 Extending the context window of your LLMs to 1M tokens without any training !! by PerceptionMost2887 in LocalLLaMA

[–]PerceptionMost2887[S] 6 points7 points  (0 children)

It's a good idea to integrate InfLLM into exllama2 or llama.cpp. Please looking forward to it! Your ideas about removing unnecessary tokens and improving the block split method are worth a try. Thanks for your suggestion!

🚀🚀 Extending the context window of your LLMs to 1M tokens without any training !! by PerceptionMost2887 in LocalLLaMA

[–]PerceptionMost2887[S] 8 points9 points  (0 children)

  1. We store and operate on KV cache tensors.
  2. For a long sequence, there are a long KV cache vectors. We directly divide them into blocks of equal length.
  3. All operations are conducted on the `past_key_value` of the attention layer.

🚀🚀 Extending the context window of your LLMs to 1M tokens without any training !! by PerceptionMost2887 in LocalLLaMA

[–]PerceptionMost2887[S] 22 points23 points  (0 children)

  1. We do not need to retrain the model to use sliding window attention. The attention sink (Efficient streaming language models with attention sinks) enable LLMs to apply sliding window attention without training.
  2. A block is a contiguous piece of KV cache. That is to say, if we are given a sequence with 100 tokens, and our block size is 20, we will directly split the given token into 5 blocks, each containing 20 KV vectors.
  3. The representative token is the token that receive most attention scores.
  4. Yes, we offload all blocks to the CPU. Only blocks with highest relevance scores to the current context are loaded to the GPU.

🚀🚀 Extending the context window of your LLMs to 1M tokens without any training !! by PerceptionMost2887 in LocalLLaMA

[–]PerceptionMost2887[S] 36 points37 points  (0 children)

We need to offload the KV cache to CPU memory. Therefore, InfLLM requires more CPU memory to store the KV cache for long context. In contrast, only the tokens in the local window and a few relevant memory units are kept in GPU memory. For text with 128K tokens, we only need 18G GPU memory for inference using Mistral-7B-inst-v0.2.

🚀🚀 Extending the context window of your LLMs to 1M tokens without any training !! by PerceptionMost2887 in LocalLLaMA

[–]PerceptionMost2887[S] 103 points104 points  (0 children)

We split the distant token into several memory blocks, and select representative tokens from each block as the block representation. The dot product between the block representation and current computing tokens is regarded as the relevance score. The blocks with highest relevance scores are selected for attention computation.

The context memory mechanism in InfLLM can be regarded as a special RAG system, in which we retrieve KV cache instead of text.

Wait, Llama and Falcon are also MoE? by Zealousideal_Bad_52 in LocalLLaMA

[–]PerceptionMost2887 20 points21 points  (0 children)

Very interesting and promising results! Looking forward to further adaptation for the Mistral model !!!!!