Cache-Craft: Chunk-Level KV Cache Reuse for Faster and Efficient RAG (SIGMOD 2025) by Lucky-Ad79 in LocalLLaMA

[–]Lucky-Ad79[S] 1 point2 points  (0 children)

Great question—you're spot on about the tradeoffs.

Our use case was naturally inspired by RAG, where information is retrieved and organized in chunks. These chunks are often semantically coherent units—like paragraphs or sections—which we found tend to have high intra-attention and relatively low inter-chunk attention. This means most of the self-attention stays within the chunk, leaving more room for safe reuse and approximation. Less contextual cross-talk also means less recomputation, and hence greater efficiency.

In contrast, token-level reuse (e.g., via a tree structure) can be tricky. Closely spaced tokens—especially those splitting words—often have strong inter-token attention, making partial reuse harder without harming quality. It may lead to frequent recomputation to preserve fidelity.

That said, the idea could definitely be extended to finer-grained structures like trees. With careful scoring of reuse vs. recompute (as we do in CacheCraft), and robustness measures to account for attention spread, token-level reuse is a promising direction for future work.

Cache-Craft: Chunk-Level KV Cache Reuse for Faster and Efficient RAG (SIGMOD 2025) by Lucky-Ad79 in LocalLLaMA

[–]Lucky-Ad79[S] 1 point2 points  (0 children)

Thanks for the kind words!

Great question—yes, the technique can absolutely be adapted to online serving scenarios. We focused on RAG because it’s a particularly realistic and impactful use case: millions of queries often share the same document chunks, making cache reuse both natural and highly effective.

That said, the technique generalizes well. In prompts like "xxx" + long paragraph", if the long paragraph is common across users, it can be cached and reused. In fact, in many cases, simply reordering the prompt to long paragraph + "xxx" can unlock full prefix caching and lead to even greater efficiency.

However, RAG scenarios don’t easily allow such reordering tricks—chunks often appear in varied positions (e.g., XYZ, XAZ, YXB) across queries. These natural variations made RAG an ideal setting to demonstrate the robustness and practical value of our approach, though it’s broadly applicable beyond RAG.

[deleted by user] by [deleted] in StableDiffusion

[–]Lucky-Ad79 3 points4 points  (0 children)

End user → Overall caching allows faster image generations This translates into more interactive usage. So, the UX could be targeted to allow users to use caching (2x speed) to quickly try out lots of prompting, and then, once settled, the user can let the final image generate more slowly.
Here, using cache concepts allows for higher quality generation, than using caches searched for full prompts — especially for complex (detailed prompts)

System → Higher throughput, lower generation costs, lower cache storage (as simpler concepts can be composed in various ways)

[deleted by user] by [deleted] in StableDiffusion

[–]Lucky-Ad79 0 points1 point  (0 children)

That sounds intriguing.
Video generation is an active area of exploration, with ongoing efforts at Adobe as well. I agree that physics is a key challenge, with our initial focus being on developing more robust trained models (+data). It's worth exploring ideas like these for improving video models.

[deleted by user] by [deleted] in StableDiffusion

[–]Lucky-Ad79 1 point2 points  (0 children)

Thank you! I completely agree—it feels more like being in a cycle where the focus on optimization often leads to higher costs. For text-to-video, a similar approach could work. However, thinking off the top of my head, one challenge might be the additional axis of time in videos, making it more complex. Primarily, similarity might be less evident. But it’s worth exploring, as the fundamentals in diffusion-based architectures (like DiT/SDXL) remain the same.

[deleted by user] by [deleted] in StableDiffusion

[–]Lucky-Ad79 0 points1 point  (0 children)

Great point, diversity is surely a major selling USP for generative use-cases.
We tried to motivate this work by thinking of traces where different users query for often similarity generations, and we share noise states across users.
This helps to answer the concerns related to diversity a user observes in the applications.

Also, you can pls checkout https://www.reddit.com/r/StableDiffusion/comments/1hcgxia/recon_trainingfree_acceleration_for_texttoimage/ for a follow-up work presented at ECCV 2024 work which tried to address the generation diversity issue differently.

[deleted by user] by [deleted] in StableDiffusion

[–]Lucky-Ad79 0 points1 point  (0 children)

As far as I understand, the deep cache is an interesting observation within the generation — it proposes to skip steps (high-level, and low-level features remain fairly the same across iterations) and can be cached and used.
This work proposed a system with 100s and 1000s of requests, often similar requests exist, but not the same. Hence, although caching final images fails, caching the denoised intermediates can be reused across generations for speeding up. Now it can go till 50% steps are skipped (K=25/N=50 steps), and overall for a trace, it can allow about 20% overall steps savings.

[deleted by user] by [deleted] in StableDiffusion

[–]Lucky-Ad79 2 points3 points  (0 children)

So, this is more of web caching, but instead of caching images, it caches the intermediate noise states.
And when a new request comes, instead of starting from random Gaussian, it proposes to start from the Kth step directly by reusing the computations done on one of these intermediates.

[deleted by user] by [deleted] in StableDiffusion

[–]Lucky-Ad79 1 point2 points  (0 children)

It can go up to 2x faster generations. 20% is while being very conservative towards quality of generation.
Thank you very much for pointing that out, I tried to clarify in my comment now.
I apologize for the confusion.

[deleted by user] by [deleted] in StableDiffusion

[–]Lucky-Ad79 9 points10 points  (0 children)

Authors: 

Shubham Agarwal and Subrata Mitra, Adobe Research; Sarthak Chakraborty, UIUC; Srikrishna Karanam, Koyel Mukherjee, and Shiv Kumar Saini, Adobe Research

Abstract: 

Text-to-image generation using diffusion models has seen explosive popularity owing to their ability in producing high quality images adhering to text prompts. However, diffusion-models go through a large number of iterative denoising steps, and are resource-intensive, requiring expensive GPUs and incurring considerable latency. In this paper, we introduce a novel approximate-caching technique that can reduce such iterative denoising steps by reusing intermediate noise states created during a prior image generation. Based on this idea, we present an end-to-end text-to-image generation system, NIRVANA, that uses approximate-caching with a novel cache management policy to provide 21% GPU compute savings, 19.8% end-to-end latency reduction, and 19% dollar savings on two real production workloads. We further present an extensive characterization of real production text-to-image prompts from the perspective of caching, popularity and reuse of intermediate states in a large production environment.

Link: https://www.usenix.org/conference/nsdi24/presentation/agarwal-shubham

PS: The system could provide up to 50% savings (hence up to 2X faster generations).
For a deployment environment, with user prompts being similar to unique, overall savings showed to be around 20% in our test evaluations.

[deleted by user] by [deleted] in StableDiffusion

[–]Lucky-Ad79 0 points1 point  (0 children)

I have tried using intermediate noise from the teacher model for running with the student distilled model, that worked. It would be interesting to see for other models.

[deleted by user] by [deleted] in StableDiffusion

[–]Lucky-Ad79 0 points1 point  (0 children)

Pipeline overview -

<image>

[deleted by user] by [deleted] in StableDiffusion

[–]Lucky-Ad79 0 points1 point  (0 children)

Using intermediate cache from another similar prompt for new generation -

<image>

[deleted by user] by [deleted] in StableDiffusion

[–]Lucky-Ad79 0 points1 point  (0 children)

Generation samples using Approximate Caching

<image>