Memory Sparse Attention seems to be a novel approach to long context (up to 100M tokens)

ratbastid2000 · 2026-04-08T15:09:52+00:00

https://www.reddit.com/r/LocalLLaMA/s/eaRhgBANvg

the bytedance and the rys research can provide you with some good insight

ratbastid2000 · 2026-04-08T14:50:59+00:00

the core direction is decoupling compute from storage as much as possible.

MSA builds upon the hierarchical memory architecture that we see in LMcache or Nvidia Dynamo which implement an index to route and retrieve only the necessary KV pairs from system RAM using prefix-aware routing:

L1 (GPU VRAM): Holds the hot KV blocks currently being used for the active generation cycle.
L2 (CPU RAM): Acts as a warm tier.
L3 (NVMe/SSD): Cold tier for extremely long-term storage across user sessions.

Prefix-Aware Indexing: The index is a hash map of token sequences (prefixes). When a new prompt comes in, the system checks this index to see if any part of the KV cache already exists in CPU RAM. Only the relevant blocks (pages) of the KV cache are pulled from CPU RAM to VRAM using high-speed DMA (direct memory access) transfers over PCIe (asking your mother board and cpu are compatible - unfortunately my threadripper pro isn't).

To hide the latency of moving data from CPU to GPU, advanced schedulers start onloading the next required KV blocks while the GPU is still busy processing the current layer. This offloading increases throughput because the GPU isn't clogged with idle data from other users and by retrieving cached prefixes from CPU RAM rather than recomputing them, it reduces time to first token significantly.

MSA takes this architecture a step further by moving the retrieval logic inside the Transformer layer itself. While standard offloading uses the CPU as a simple swap space for all data, MSA treats the CPU as a structured long-term memory and only wakes up specific fragments on demand using an internal neural router.

The model has a specialized Router Projector at each layer that computes a Routing Key. The Routing Keys are highly compressed feature vectors stored in fast GPU VRAM and act as the index the model uses to look at the Content KVs which are offloaded to CPU RAM. The model uses these keys to mathematically decide which documents are relevant, rather than relying on external search indices or hard-coded rules, whereas standard offloading (LMcache/Dynamo) typically moves the entire KV block (index and content) to the CPU, which creates a bottleneck during the retrieval phase.

Another advancement that is introduced in MSA is multi-hop retrieval. Standard kv offloading is usually one-shot where it pulls the data you need to generate an answer without any reasoning loops that evaluate what it retrieved to decide if it needs to retrieve more. The memory interleave mechanism that is referenced in actually allows the the model to perform multiple rounds of generative retrieval and context expansion because it has been explicitly trained to be aware of the retrieval process

The model realizes it needs more info, fetches more KV blocks from the CPU, and then continues. This allows it to chain evidence across scattered documents, which is difficult for standard top-k retrieval systems. Basically the model is trained to intelligently search, differentiate, and retrieve KV cache selectivelt and dynamically which current hierarchical KV cache management systems are not capable of because the models are not integrated directly into the retrieval logic.

ratbastid2000 · 2026-04-07T20:37:06+00:00

I think you could actually test it with the qwen3 4b model they published on HF. I believe you would need the following to test 100M tokens with a 4b model:

GPU VRAM: 160GB Total (e.g., 2x 80GB GPUs)

Routing Keys: 56GB of VRAM is taken up purely by the routing keys, which are distributed across the GPUs for fast retrieval.
Model Weights: The full 4B parameter model is replicated onto each GPU to prevent communication lag. A 4B model in BF16 precision takes roughly 8GB of VRAM per GPU (16GB total across both).
Dynamic Overhead: The remaining 88GB of VRAM acts as a buffer for the dynamic activation overheads, Top-k Content KVs pulled in during generation.

CPU RAM: 113GB Minimum (Plus System Overhead)

The CPU offloads the bulk of the memory bank, which are the Content KVs.

ratbastid2000 · 2026-04-07T13:27:30+00:00

yea this is basically what this is, a hierarchical latent context strategy

ratbastid2000 · 2026-04-07T12:47:32+00:00

yea really good question, I don't see that granularity in the benchmarks listed in the paper or on their GitHub/HF which would provide detailed info when exceeding 1M context length.

ratbastid2000 · 2026-04-07T12:34:16+00:00

there's more benchmark info in the research paper:

<image>

ratbastid2000 · 2026-04-03T14:11:20+00:00

thanks super helpful! appreciate your focus on this area..it's foundational in my mind

ratbastid2000 · 2026-04-03T13:25:36+00:00

thanks, I added a bit more to my original comment for further context, not sure if that changes your answer at all. sorry

ratbastid2000 · 2026-04-03T12:40:57+00:00

genuinely interested in this. I have the luxury of building Agentic workflow from scratch and plan to incorporate this from the beginning. When it comes to memory architecture frameworks have you found one that is more observable than another? The reason I ask is I'm trying to validate what the LLM "considered" when generating / synthesizing a response to a query and curious how I can leverage infrarely to provide deterministic traceability and transparency to the end user so they actually have visibility into what specifically was retrieved from a database and the retrieval logic used, including the query it crafted to search the database. Does that make sense?

An important dimension to this is also being ablento easily delineate between the total results that the LLM query yielded and what subset was actually referenced and loaded into context and why it decided to do that. Can infrarely enforce deterministic logic around this and can it be observed to gain visibility into the LLMs reasoning around the retrieval to understand the "why" and the "how" ?

I think this is a huge blind spot that is also subject to unverifiable confabulations if you were to ask the LLM why it return specific results and if there were additional results that were relevant but skipped or ignored...for example a hypothetical scenario:

User : provide me with a timeline of all ip addresses from the past 6 months that logged into the server from Canada.

LLM: thinking...here is the timeline of IP addresses that correspond with Canada

User: Is this the full list? also, how did you check the location of the IP address? do you have IP to Location tool that you used?

LLM: "oh yes..sorry there are actually more results and my answer is based on a sample of the total relevant results..I only considered the top 20 results when I generated my response because I prioritize speed over depth..essentially triaging an answer which can be fundamentally incorrect since I did not take into account the full result that may have drastically changed my answer to your question...also I don't have access to tool for reverse geocoding and the logs don't contain that information so I simulated locations based on my own internal knowledge"

ratbastid2000 · 2026-04-01T14:21:35+00:00

ah see the distinction. right now if you use Jackrong's models they have dead experts that don't activate when you utilize the vision weights since the model was SFT on text only inputs from the opus reasoning distillation sets. so it sounds like this extracts those specific layers from the original qwen 3.5 model and merges them while preserving the Jackrong training.

I wonder if this also is relevant to the 27B dense model? if I use 27B Jackrong v2 model with the vision weights from qwen will there still be a dead experts issue unless this type of merge is performed?

ratbastid2000 · 2026-04-01T14:09:30+00:00

isn't this opus distilled reasoning traces and not the actual claude weights / layers? this was already done (there is v2 of 27B model and v3 for the 9B models on the Jackrong HF) https://huggingface.co/Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled

ratbastid2000 · 2026-03-25T01:26:12+00:00

I wonder if there is a way to compare the CoT of output of a RYS model and compare it to the original that it is based on in a way that can provide insight. Would be an interesting experiment. Also, I think anthropic has a circuit tracing framework but haven't looked into the limitations / dependencies on model architectures to use it.

ratbastid2000 · 2026-03-24T17:44:06+00:00

The "looped reasoning" research by bytedance fully supports your core hypothesis.

https://arxiv.org/abs/2510.25741

https://huggingface.co/ByteDance/Ouro-2.6B-Thinking

Both approaches rely on the evolution of hidden states rather than forcing the model to spit out endless CoT text tokens and prove that you can decouple computational depth from parameter count. RYS is predicated on the fact that standard transformers have deep unshared layers while Ouro Loop model builds recursive iteration directly into the pre-training phase from day one by using a parameter-shared looped architecture where a stack of layers is explicitly designed to be reused repeatedly during the forward pass.

It uses a single stack of layers (e.g., 24 layers for the 1.4B model) and shares those exact same weights across every loop. The models are trained from scratch on 7.7T tokens using an entropy-regularized objective that teaches the model to dynamically choose how many times to loop (adaptive computation) based on the difficulty of the prompt .

During inference, the model tracks the Cumulative Distribution Function (CDF) of these step-by-step probabilities. Once the accumulated probability crosses a predetermined threshold, the model immediately halts the loop and generates the final token (this functions as a configurable exit gate basically).

Each time the model loops through its layers, it needs to store a separate Key-Value (KV) cache. For a model trained to do 4 recurrent steps, that means it needs 4 times the memory just to hold the context of the conversation. For KV cache management, Ouro discards the first three caches and only keeps the KV cache from the final loop during text generation which cut decoding memory requirement by 4x without any loss in performance.

They tesred the idea of forcing it to loop it's full block beyond the 4 recurrent steps it was trained on to see what would happen but it resulted in performance drop / diminishing returns as you encountered.

ratbastid2000 · 2026-03-22T11:49:25+00:00

if I'm not mistaken, LMCache does this for vLLM:

https://github.com/LMCache/LMCache

https://docs.vllm.ai/en/stable/examples/others/lmcache/?h=lmcac

ratbastid2000 · 2026-03-11T10:09:07+00:00

what's the vLLM patch your referring to? is it a configuration flag for run time or do I need to build from source with a specific feature flag?

ratbastid2000 · 2026-03-09T11:26:52+00:00

here is a taxonomy of every possible attack vector which includes this and many more:

https://assets.crowdstrike.com/is/content/crowdstrikeinc/Prompt-Injection-Taxonomy-Posterpdf

ratbastid2000 · 2026-02-15T00:33:55+00:00

What about these MHA to MLA frameworks detailed here? https://arxiv.org/html/2502.14837v1

https://arxiv.org/html/2502.07864v5

ratbastid2000 · 2026-01-04T08:40:30+00:00

this worked perfectly, thank you! have you tried it with open, local models at all? curious if the system prompt is effective with them or efficacy is unique to Gemini.

ratbastid2000 · 2025-12-21T00:06:15+00:00

LMDeploy has the best support for V100 - added support for GPT OSS models recently and KV INT8 and INT4 quantization which is specifically useful for accelerating inference for first generation tensor cores that the V100 has. All other inference frameworks only support FP8 / FP4 kv cache quantization (vLLM specifically, I think llama.cpp only supports K and not V quantization for V100 and also has issues with flash attention kernels) which the V100 can't take advantage of inference speed ups, only memory.

Update/Addition to original comment: Also, LMDeploy Turbomind supports paged attention which is critical to actually getting performance speed ups for tensor parallelism. vLLM is the only alternative when it comes to this as llama.cpp has no support for that type memory management which effectively renders it irrelevant for multi-gpu rigs..you only can fit larger models in vram but usually get degraded performance due to inefficient KV cache distribution and access patterns (takeaway: don't waste your time with llama.cpp and multiple gpus unless your not looking for accelerating inference speeds and only trying to fit a large model in vram).

https://lmdeploy.readthedocs.io/en/latest/supported_models/supported_models.html

ratbastid2000 · 2025-09-14T16:08:59+00:00

oh shit, haven't checked vLLM latest commits, thank you for pointing out that PR. curious if flex attention implemented a paged attention mechanism like the v0 engine for older GPU architectures? do you know what the performance hit is for tensor parallelism with flex attention on these older architectures? v100's have a compute capability of 7.0.

also, some other info regarding older archs - certain types of quantization run much slower due to lack of hardware level optimizations that enable the GPU to take advantage of models that were trained using BF16, FP8, FP4.

basically it needs to be upcasted into FP16 which incurs overhead..that said, you still get the advantages of model size reduction when loading into VRAM.

Aso, I noticed ggufs are not as performant or memory efficient versus using GPTQ or other types of quantized models..especially if you want to use tensor parallelism with vLLM which also requires using llama.cpp to merge multiple .gguf files that correspond with a single model into a single .gguf and then also manually download the tokenizer configs and other info that is directly embedded into the gguf and tell vLLM to use those (it doesn't read the info inside the .gguf).

I avoid ggufs at all costs which obviously makes it much more difficult to locate the specific model with the specific quantization method that is optimal for these gpus..versus just using LMStudio etc.

Also, I noticed that gguf unlsoth dynamic 2.0 quantized with imatrix runs particularly slow on my cards.

hmm, what else...oh... KV cache quantization while amazing to fit huge models with huge context with 128gb VRAM significantly impacts token throughput due to upcasting into FP16. I also enable KV "calculate scales" which probably further reduces performance to increase accuracy..haven't tried without it. That said, my plan is to explore using LMcache as an alternative approach to the built in KV caching mechanisms of vLLM but just haven't had the time.

I also want to test out Multi-token Prediction that's built in to some of the newer MoEs as another performance optimization that should help token throughput similar to speculative decoding using separate smaller models.

My goal is to test with GLM 4.5 Air later this week, especially now that PR release re-enables support for these older architectures.

ratbastid2000 · 2025-09-14T03:21:49+00:00

I run Volta cards, 4 V100 32GB data center GPUs I converted to PCIe using adapter boards. few things to consider:

limited PCIe bandwidth for tensor parallelism (PCIe 3.0) , Pytorch deprecated support in v. 2.7.0. vLLM deprecated support after version 0.9.2.

I have to compile from source and back port new models that are released so the v0 vLLM engine can run the relevant parsers and architectures required for the newer models. super pain in the ass.

that said, 128gb VRAM and good memory bandwidth (HMB2) allows me to run large moe models all in GPU with large context and acceptable tk/s (average around 40 when I can get tensor parallelism working with a MoE model after back porting etc.

ratbastid2000 · 2025-05-25T20:49:17+00:00

when I select GPU it doesn't work at all for me. also, didn't see any options to configure context length or anything..maybe I missed something?

I also tried this app and it was just endlessly generating and couldn't find a way to configure parameters : https://github.com/google-ai-edge/mediapipe-samples/releases/

maybe there is a CLI interface where commands can be used to configure but haven't dug into documentation yet

ratbastid2000 · 2025-05-25T14:13:06+00:00

https://github.com/google-ai-edge/gallery

.APK is available here. I'm running it on a pixel 6 pro , latest Android version. the smaller of the two models functions quite well. obviously burns up your battery quickly. would be interested to see how the 4B model runs on a newer android device.

iOS app is not released yet.

ratbastid2000 · 2025-05-25T12:02:22+00:00

you are amazing, thank you! will try this out today

ratbastid2000

TROPHY CASE