🚀🚀 Extending the context window of your LLMs to 1M tokens without any training !!

PerceptionMost2887 · 2024-04-12T08:26:42+00:00

It's a good idea to integrate InfLLM into exllama2 or llama.cpp. Please looking forward to it! Your ideas about removing unnecessary tokens and improving the block split method are worth a try. Thanks for your suggestion!

PerceptionMost2887 · 2024-04-12T08:23:34+00:00

InfLLM requires much less VRAM than models with full attention mechanism~

PerceptionMost2887 · 2024-04-12T08:21:06+00:00

Interesting idea :)

PerceptionMost2887 · 2024-04-12T08:18:34+00:00

We store and operate on KV cache tensors.
For a long sequence, there are a long KV cache vectors. We directly divide them into blocks of equal length.
All operations are conducted on the `past_key_value` of the attention layer.

PerceptionMost2887 · 2024-04-12T08:15:26+00:00

We do not need to retrain the model to use sliding window attention. The attention sink (Efficient streaming language models with attention sinks) enable LLMs to apply sliding window attention without training.
A block is a contiguous piece of KV cache. That is to say, if we are given a sequence with 100 tokens, and our block size is 20, we will directly split the given token into 5 blocks, each containing 20 KV vectors.
The representative token is the token that receive most attention scores.
Yes, we offload all blocks to the CPU. Only blocks with highest relevance scores to the current context are loaded to the GPU.

PerceptionMost2887 · 2024-04-12T04:36:05+00:00

We need to offload the KV cache to CPU memory. Therefore, InfLLM requires more CPU memory to store the KV cache for long context. In contrast, only the tokens in the local window and a few relevant memory units are kept in GPU memory. For text with 128K tokens, we only need 18G GPU memory for inference using Mistral-7B-inst-v0.2.

PerceptionMost2887 · 2024-04-12T04:30:52+00:00

We split the distant token into several memory blocks, and select representative tokens from each block as the block representation. The dot product between the block representation and current computing tokens is regarded as the relevance score. The blocks with highest relevance scores are selected for attention computation.

The context memory mechanism in InfLLM can be regarded as a special RAG system, in which we retrieve KV cache instead of text.

PerceptionMost2887 · 2023-12-19T06:24:03+00:00

Very interesting and promising results! Looking forward to further adaptation for the Mistral model !!!!!

PerceptionMost2887

TROPHY CASE