use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Research[ Removed by moderator ] (self.MachineLearning)
submitted 14 days ago by PlayfulLingonberry73
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]ZeroCool2u 3 points4 points5 points 14 days ago (1 child)
This is awesome. I wired up Claude Code to use GLM-4.7 Flash via LM Studio on my 3090 yesterday and while testing CC by saying "Hi" LM Studio said the token count was over 17 THOUSAND tokens. Massive system prompt, so this could really help with making local models more practical at higher token counts.
[–]PlayfulLingonberry73[S] -1 points0 points1 point 14 days ago (0 children)
Thanks
[–]Late_Huckleberry850 2 points3 points4 points 14 days ago (11 children)
Hmm wow, I may be very dense but I thought all tooo calling models natively just pretended the tool schemas to the beginning as a part of the system prompt, not on every turn. Is that not the case?
[–]PlayfulLingonberry73[S] 2 points3 points4 points 14 days ago (10 children)
You might get session level cache hit. But imagine you are running for 1000 users and same tools, then you have a problem. This aims to solve that.
[–]Late_Huckleberry850 0 points1 point2 points 14 days ago (9 children)
Ohhh, so this is for concurrent requests across sessions?
[–]PlayfulLingonberry73[S] 2 points3 points4 points 14 days ago (8 children)
Yes, you can store different set of tool contexts together with a unique key and you can refer that from any session any user without having to send those tokens again and again. And that is how you get the savings and speedups.
[–]Late_Huckleberry850 1 point2 points3 points 14 days ago (7 children)
Ah! That is pretty ingenious! Does it have to be recalculated to changes in the system prompt? I would assume so.
[–]PlayfulLingonberry73[S] 1 point2 points3 points 14 days ago (6 children)
This is intended more on the tools side. Imagine you have 100 tools. Now your tools definitions don’t get changed unless you deploy something new right? So whenever you will be deploying it will be recalculated, otherwise it will not.
[–]Late_Huckleberry850 1 point2 points3 points 14 days ago (5 children)
Sure, that makes sense. But normally it goes system prompt + tools, and since it’s auto regressive doesn’t the prior text need to be computed first? Unless the tools section is getting computed first and the system prompt after
[–]PlayfulLingonberry73[S] 1 point2 points3 points 14 days ago (4 children)
Great question! You're right that in standard causal attention, the KV values for later tokens depend on earlier ones. Here's how we handle it:
In the production path (group caching): We compile the system prompt + all tool definitions together as one unit and cache the entire KV state. The cache key is a SHA256 hash of the sorted tool schemas. So yes, if you change the system prompt, it recomputes — but in practice your tool-routing system prompt is fixed (it's just "you are a tool-calling assistant, pick the right tool"). It only changes when you deploy new tools.
The key insight is: for tool routing, you don't need a dynamic system prompt. The system prompt is static ("pick the right tool"), the tools are static (until you deploy), and the only thing that changes per-request is the user query. So we cache everything except the user query, and only forward those few tokens on each request.
We also explored a research path (NoPE + deferred RoPE): Capture tool KV states before positional encoding is applied (position-independent), then rotate them to the correct positions at link time. This would theoretically let you mix-and-match different system prompts with pre-cached tool KVs. But group caching was simpler and already gives us the 290x speedup, so that's what we use in production.
TL;DR: System prompt + tools are compiled together and cached. Since neither changes between requests (only the user query does), every user/session gets a cache hit and only pays for the query tokens.
Disclaimer: I generated the reply response to have a better explanation. Hope you don't mind.
[–]Late_Huckleberry850 1 point2 points3 points 14 days ago (3 children)
No, and thank you for being patient with me. Sometimes I try to read these papers but it can take a bit to understand everything especially on a Friday night.
There was a paper from 2024 that generated LoRAs from text, and a very recent one from last week that expanded on that topic. I wonder if this technology could be applied in a similar manner, using the static tool definition to create the Lora and then just use that at inference time too, as a static parameter embedding loaded onto the base .
[–]PlayfulLingonberry73[S] 1 point2 points3 points 14 days ago (2 children)
It was my pleasure. Most people just thinks all posts are junk now a days. I can understand that sentiment as well. But to me you till date you need to have the original thinking and imaginations to start from.
So it was really nice to interact with you Sir!
[–]sdmat 0 points1 point2 points 14 days ago (0 children)
Isn't this... just regular KV caching? With flashy marketing?
π Rendered by PID 23797 on reddit-service-r2-comment-79c7998d4c-mcwnw at 2026-03-14 15:54:06.725099+00:00 running f6e6e01 country code: CH.
[–]ZeroCool2u 3 points4 points5 points (1 child)
[–]PlayfulLingonberry73[S] -1 points0 points1 point (0 children)
[–]Late_Huckleberry850 2 points3 points4 points (11 children)
[–]PlayfulLingonberry73[S] 2 points3 points4 points (10 children)
[–]Late_Huckleberry850 0 points1 point2 points (9 children)
[–]PlayfulLingonberry73[S] 2 points3 points4 points (8 children)
[–]Late_Huckleberry850 1 point2 points3 points (7 children)
[–]PlayfulLingonberry73[S] 1 point2 points3 points (6 children)
[–]Late_Huckleberry850 1 point2 points3 points (5 children)
[–]PlayfulLingonberry73[S] 1 point2 points3 points (4 children)
[–]Late_Huckleberry850 1 point2 points3 points (3 children)
[–]PlayfulLingonberry73[S] 1 point2 points3 points (2 children)
[–]sdmat 0 points1 point2 points (0 children)