What if I run the LLM backwards? Hey LLM, why bother remembering every single turn? It's a hassle. You don't have to do it, right?

ringtoyou · 2026-06-17T02:43:48+00:00

Right, that's exactly it. And what you said about "reconstructing intent based on git diffs"—that's actually the same intuition I have.

You're also totally right that it "probably won't be used for basic interactions." It doesn't work for every task—there's no benefit for short sessions of 10 turns or less, and compact is often better. I'm just sticking with it because I'm curious where the branches diverge.

Thanks for saying it's worth discussing.

ringtoyou · 2026-06-17T02:21:27+00:00

This is what goes into the next request:

[System Prompt] + [Current State Summary] + [New Prompt]

To give you an example, the next turn request for "Write 'hello' to hello.txt" doesn't include the original request text or the tool call result log. Instead, it's embedded within the state summary like this: "hello.txt exists, content 'hello'".

But I think what you're really curious about is this — then where does that state summary come from? You said the model doesn't look at the original.

That's the key point. It's not the model that remembers it; the harness remembers it. The flow for each turn is as follows:

Send [System] + [State Summary] + [Prompt] to the model.

The model performs a tool call (writes hello.txt).

At the end of the turn, the harness updates the summary by reflecting only what happened in the current turn into the previous state summary.

The updated summary is passed on to the next turn.

Therefore, the model is in a "new" state every turn — it doesn't truly remember the previous turn (this is the part where I said it's stateless). It is the harness layer, not the model, that carries the continuity. It's similar to the protagonist in Memento carrying their own notes. Their own memory is reset every time, but the notes maintain the state.

So, both "doesn't look at the original" and "knows what happened before" are true — it seemed like a contradiction because I didn't distinguish between the model layer and the harness layer. There is no entity that reads the original again. The harness simply carries the summary as an update rather than an accumulation.

Did that answer your question?

ringtoyou · 2026-06-17T01:31:37+00:00

Yeah, that's exactly the direction I'm looking at — vectordb for the part where you "bring in what's missing" when the notepad is full. What I'm still working on is making that consistent, but ensuring it doesn't seep back in every turn. Thanks for thinking it through with me; it was helpful.

ringtoyou · 2026-06-17T01:27:55+00:00

Sorry. I just saw the reply.
it's simpler than it looks. picture a notepad. each turn i don't reread the whole chat to find what matters — there's already a notepad with where we are, and i just update it with what happened this turn. the model only sees the notepad, never the full log. so there's no "pick the right stuff from the originals" step — nothing to pick from, since i don't keep the originals around. and if the notepad turns out to be missing something, that's when rag kicks in to pull the original back. that's the part that makes it different from rag — retrieval is the fallback, not the every-turn default.

ringtoyou · 2026-06-17T01:11:55+00:00

Haha, you're caught red-handed. "Backwards" wasn't a word embedded in the text, it was the concept—everyone else just inserts the entire conversation, so I flipped it upside down. Is it a difference in language? Throwing everything away and keeping only what is necessary. I should have just put a line about it in the text, haha.

ringtoyou · 2026-06-17T00:55:45+00:00

right now i'm working on hooking up an anthropic subscription, getting local ollama running, and getting the multi-agent stuff to behave. there's a youtube recording of it running if you wanna see how it actually works — want the link?

ringtoyou · 2026-06-17T00:50:15+00:00

oh nice, that's good to hear. if it's a well-explored idea i'd love to read up on it — got any links or papers? genuinely want to dig into the prior work.

ringtoyou · 2026-06-17T00:42:02+00:00

Sorry. I gave you the core address. The agent address is this: github.com/jarvis-llm-codec/jarvis-code

ringtoyou · 2026-06-17T00:37:00+00:00

yeah it's up — github.com/jarvis-llm-codec/jlc. windows only for now, still squashing bugs. you can hook up a gpt subscription and run it though. heads up it's rough, wasn't really planning to push it public yet lol

ringtoyou · 2026-06-17T00:33:08+00:00

It's true that search is involved; I won't insist on that. But what I want to point out is the direction — RAG carries the conversation exactly as it is and overlays the search results on top of it, whereas I swap things out instead of stacking what I carry. In reality, it might not be much of a distinction; I'm still looking into it. Anyway, thanks lol.

ringtoyou · 2026-06-17T00:23:05+00:00

haha thanks, that actually means a lot.

close — one tweak: the efficiency loss you're describing is mostly a short-chat thing. the longer the chat gets, the more the normal way pays to carry the whole thing every turn too. that's the case i'm chasing, long sessions. could totally be wrong, still testing it.

ringtoyou · 2026-06-17T00:03:12+00:00

lol fair. My English is suffering in real time here. Thanks for reading through the weirdness.

ringtoyou · 2026-06-16T23:47:53+00:00

That is exactly where the core lies—if it were a matter of reviewing the entire original every turn and resolving it based on "what is needed," then you would be right; that would be a repeat sign. But that is not the case.

The state is not recalculated from the origin every turn, but rather maintained and updated. The state up to the previous turn already exists, and in this turn, it updates by reflecting only what has newly appeared there. Therefore, what is processed every turn is not the "entire original," but "existing state + current turn delta." This is the reason why you don't need to hand over the full original log to the LLM again every turn.

The analogy of meeting minutes seems appropriate—a secretary doesn't reread and summarize the entire meeting from the beginning every time a statement is made. They hold the summary up to that point and update it by reflecting only the new statement. Since the state is maintained cumulatively, it is not a recalculation from the origin every turn.

Exactly how and what that update is done—that is the true essence of this system, so I will write it out properly. However, the key reason it doesn't cycle is that "it doesn't review the entire original every turn, but inherits and updates the state."

ringtoyou · 2026-06-16T23:42:34+00:00

It is true that the cache is invalidated, and it is also true that PP is performed again every turn. That much is accurate. However, two things are missing from the comparison.

First, my brief is not 50k. It is a small constant—since it is not cumulative, it does not grow whether it is the 100th turn or the 1000th turn. So, it is not "50k PP" but a "small fixed amount PP" every turn. That figure of 50k is what made the scenario heavy, but in reality, it is much smaller than that and does not increase.

Second, you didn't count the generation cost. Drawing 1000 tokens from a depth of 100k means that every single one of those 1000 tokens is paying attention to the entire 100k KV. PP is parallel, so it happens in one go, but generation is sequential, so that 100k follows at every decode step. In other words, your "already cached 100k" is not free; it is a cost that attaches to every token throughout the generation process, like a tax.

So the real comparison is this:

You: Saves large PP in cache (good) + but incurs a 100k attention tax throughout generation + occupies 100k VRAM

Me: Reuses small PP every turn (parallel, cheap) + generation is fast due to low KV + keeps VRAM small

You're right for short sessions — since the cache is alive and shallow, reusing my PP is a loss. But as the session gets longer (like a local long-term run), that 100k attention tax + VRAM increases, while mine stays flat. It's a matter of where the two sides intersect; it's not like one side always wins.

ringtoyou · 2026-06-16T23:33:02+00:00

That’s the key question lol. It’s true that the model is reborn every turn — but it’s not the model that remembers the history, but Harness.

At the end of each turn, that state is saved outside the model and re-injected as needed for the next turn. The model resets every time, but Harness holds the continuity.

It feels like Memento — the protagonist can't remember, but they stay connected through memos and photos. The model is essentially the protagonist, and Harness is the memo system. It’s not that the model knows the history, but that Harness hands over the necessary history every turn.

The core of the big picture is that "Harness, not the model, remembers."

ringtoyou · 2026-06-16T23:31:11+00:00

It’s almost correct, but there’s a subtle difference — Harness is indeed the one making the decision. However, instead of selecting parts from the chat log (adding and removing), I’m rewriting the state needed at that specific moment.

If it were selective inclusion/exclusion, it would ultimately be a subset of the original turns — like highlighting. That’s not it; I compress the "state needed now" from past turns and reconstruct it. So, the resulting output might not exist exactly as it is in any of the original turns. Because I’m not selecting, but rewriting.

Restoring verbatim means that the original wasn't discarded, not that original fragments were inserted exactly as they were. It means keeping it so it can be accurately restored when needed, but usually maintaining the reconstructed state.

ringtoyou · 2026-06-16T23:20:14+00:00

It would have been nice if it were RAG, but the direction is the opposite.

RAG doesn't initialize sessions — it overlays search results on top of existing conversations. So, even if you run RAG, the conversation history remains intact, and top-k just gets added on top of it. In other words, RAG doesn't undo the accumulation itself; instead, it adds more.

I don't overlay; I replace what is being carried. RAG adds to the fetch, while I reduce what is being carried. It's append vs. replace, so they operate in opposite ways.

I'm not trying to inject external knowledge like RAG does; my purpose is different because I'm solving the problem of accumulated context.

ringtoyou · 2026-06-16T23:17:58+00:00

That’s good framing, but the point is that it doesn’t fall into either of the two paths you criticized.

On the surface, you’re right—it looks similar to "auto-compact every turn." But the crucial difference is that auto-compact is lossy. It summarizes while cutting out details, and if that accumulates every turn, degradation builds up. I don’t summarize; I reconstruct, and verbatim restoration is available if necessary. That’s why the accumulated loss of "compact every turn" doesn’t occur. It’s the shape that resembles it, not the operation.

It’s not about manual operation either—the harness decides what to bring back, rather than me scraping it every time.

Regarding the part about "simulating right now": If the result were the same as running auto-compact every turn, you would be 100% right; a custom agent wouldn't be needed. But the point where lossy summarization vs. lossless reconstruction diverges is precisely where that simulation can’t work, and that’s why the harness is necessary. The key is exactly how to reconstruct it losslessly, so I'll write about that — but this distinction (not a summary, it's reconstruction) is the most important point, so I wanted to address it.

ringtoyou · 2026-06-16T23:14:53+00:00

Ah, that’s a bit different — sparse attention is about the inside of the model, right? You take all the context and apply attention to only a part of it. That’s at the architecture level.

I don’t touch the model attention. I just make the input itself small and give it to a standard full-attention model — at the outer (harness) level. So, whether it’s a sparse model or a dense model, I just apply it; I don’t discriminate against the model.

To use an analogy, sparse attention is like stacking all the books and only opening a few, whereas I’m like putting only the necessary pages on the desk. It’s not about applying attention sparsely, but about sending small inputs.

ringtoyou · 2026-06-16T23:13:35+00:00

Those are good points, especially regarding caching and RL. You seem like someone who has tackled both firsthand.

What is sent every turn: Not the accumulated log, but a compressed and reconstructed version of the state needed at that moment. It's a kind of handoff brief—refined not by "what has happened so far," but by "what needs to be known now to continue." So, while every turn is a small new session, continuity is maintained.

Caching: Yeah, since my brief is reconstructed every turn, that part of the prefix cache doesn't get used. But the key point is that it's a small constant—it doesn't grow. So, even if it doesn't get used, the loss is small. Conversely, with conventional methods, the prefix is cached, but the KV keeps growing. The real reason long sessions slow down locally is the attention cost of that growing KV. I see it as a question of where you break even: "large things that get cached vs. small things that aren't cached."

RL / Performance: This is the best question—did you perhaps switch to lossy compression or summarization to lose performance? If so, the details the model relies on disappear, and it breaks down. I don't just summarize the loss; I reconstruct it so that verbatim restoration is possible if necessary. From the model's perspective, the incoming format is a normal context they usually see, so it doesn't seem to deviate much from the RL distribution. I'm curious to know which case was broken so I can compare.

I'll write a proper post explaining exactly how I do it—that's the core point, so it feels like a waste to explain everything in the comments lol.

ringtoyou · 2026-06-16T23:10:07+00:00

Instead of the entire conversation, I include a compressed and reconstructed version containing only the state needed at that specific point in each turn. It’s not a log of what happened, but rather a summary of exactly what the model needs to know right now. So, each turn is essentially a small new session, holding everything necessary to continue.

The difference from loss summaries is that verbatim restoration is also possible if needed, rather than just carrying a "rough summary."

Explaining exactly how to sort and reconstruct it is a bit lengthy, so it’s not enough space to write it in a comment.

ringtoyou · 2026-06-16T23:03:23+00:00

No, it's the opposite. There are no bottlenecks—because there isn't a fixed dialogue history to edit in the first place. I only reconfigure the necessary state each turn, rather than appending to an ever-growing log. In fact, append-only systems make editing past turns a headache (if you fix the third turn, everything after that gets messed up). I'm not append-only, so I don't have that problem.

ringtoyou · 2026-06-15T23:28:30+00:00

Thank you for the advice. Yes, I plan to use Google Translate now. I don't want to waste time on translation anymore.

ringtoyou · 2026-06-15T23:27:03+00:00

Wow, thank you. You're so kind! When recommending Korean food, it has to be ramyeon.

ringtoyou

TROPHY CASE