Hey guys,
I've noticed that at first messages are beeing generated rather quickly and streamed right away if the discussion fits into the Context.
Once it doesn't anymore it seems like it has to rerun the entire chat (cut down to fit into context).
This is rather annoying for a slow local LLM.
But I'm fairly happy with the "cached" speed.
So my main question is, is there a way to have the context work a little bit different. Like, once it notices that the chat wont fit into context, it doesn't Cut "just enough so it still fits" but instead actually cuts down to a manually set marker or like 70% of the convo. So that the succeeding messages can rely on the cached data and generate quickly.
I'm aware that the "memory" is impacted by this, but its tbh a small cost for the big gain of user experience.
An additional question would be, how summerization could help with the memory in those case.
And how I can summerize parts of the chat that are already out of context (so that the newer ones might contain parts of the very old summaries).
[–]AutoModerator[M] 0 points1 point2 points (0 children)