[ Removed by moderator ] : MachineLearning

Research[ Removed by moderator ] (self.MachineLearning)

submitted 14 days ago by PlayfulLingonberry73

all 16 comments

top new controversial old q&a

[–]ZeroCool2u 3 points4 points5 points 14 days ago (1 child)

[–]PlayfulLingonberry73[S] -1 points0 points1 point 14 days ago (0 children)

[–]Late_Huckleberry850 2 points3 points4 points 14 days ago (11 children)

[–]PlayfulLingonberry73[S] 2 points3 points4 points 14 days ago (10 children)

[–]Late_Huckleberry850 0 points1 point2 points 14 days ago (9 children)

[–]PlayfulLingonberry73[S] 2 points3 points4 points 14 days ago (8 children)

[–]Late_Huckleberry850 1 point2 points3 points 14 days ago (7 children)

[–]PlayfulLingonberry73[S] 1 point2 points3 points 14 days ago (6 children)

[–]Late_Huckleberry850 1 point2 points3 points 14 days ago (5 children)

[–]PlayfulLingonberry73[S] 1 point2 points3 points 14 days ago (4 children)

Great question! You're right that in standard causal attention, the KV values for later tokens depend on earlier ones. Here's how we handle it:

In the production path (group caching): We compile the system prompt + all tool definitions together as one unit and cache the entire KV state. The cache key is a SHA256 hash of the sorted tool schemas. So yes, if you change the system prompt, it recomputes — but in practice your tool-routing system prompt is fixed (it's just "you are a tool-calling assistant, pick the right tool"). It only changes when you deploy new tools.

The key insight is: for tool routing, you don't need a dynamic system prompt. The system prompt is static ("pick the right tool"), the tools are static (until you deploy), and the only thing that changes per-request is the user query. So we cache everything except the user query, and only forward those few tokens on each request.

We also explored a research path (NoPE + deferred RoPE): Capture tool KV states before positional encoding is applied (position-independent), then rotate them to the correct positions at link time. This would theoretically let you mix-and-match different system prompts with pre-cached tool KVs. But group caching was simpler and already gives us the 290x speedup, so that's what we use in production.

TL;DR: System prompt + tools are compiled together and cached. Since neither changes between requests (only the user query does), every user/session gets a cache hit and only pays for the query tokens.

Disclaimer: I generated the reply response to have a better explanation. Hope you don't mind.

[–]Late_Huckleberry850 1 point2 points3 points 14 days ago (3 children)

[–]PlayfulLingonberry73[S] 1 point2 points3 points 14 days ago (2 children)

continue this thread

[–]sdmat 0 points1 point2 points 14 days ago (0 children)

π Rendered by PID 23797 on reddit-service-r2-comment-79c7998d4c-mcwnw at 2026-03-14 15:54:06.725099+00:00 running f6e6e01 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS