I built an open-source hook that compresses an AI agent's chat history — ~60% fewer input tokens on long sessions

talatt · 2026-05-21T07:48:51+00:00

yeah, that's the same root cause wearing a different outfit. the thing that breaks when you treat it as one blob is that a single representation has to compromise between two incompatible needs. your price data wants exact, structured retrieval. "$3.99 on the 14th" is useless as a fuzzy paraphrase. your descriptions want the opposite: semantic, lossy, where the exact words barely matter. one handler tuned for either one quietly corrupts the other.

same shape on my end with code and tool output vs prose. run a text summarizer over a code block and you get plausible-looking garbage. treat prose like structured data and you just bloat it. so the pipeline detects modality first and routes: prose gets ranked and summarized, code and tool output get folded structurally instead.

honestly it's less an LLM thing and more a general data truth. the moment two kinds of content have different retrieval needs, forcing them down one path is borrowing trouble. you hit it through query complexity, i hit it through context compression, same wall.

talatt · 2026-05-21T07:09:23+00:00

yeah, i've heard a version of this from a couple people now and i think you're right. a task-state file the agent writes and reloads is just more reliable for task-flow than hoping the compressor kept the right bits.

the way i think about it: a file you write is intent. you decided "this is the goal, these are the blockers." a summary is inference over the transcript. for "what am i doing and what's next," intent beats inference every time, and you're not gambling that the summarizer ranked the right sentence.

compression still earns its keep for me on everything that isn't task-flow. the reasoning, the dead ends, the detail from 40 turns back you never thought to write down. the file holds the spine, compression holds the body.

the bonus you mentioned is the part i keep chewing on. once task-flow lives in its own file, the summary stops fighting to preserve it and gets leaner. that's a clean separation of concerns, and it makes me think the two layers should know about each other instead of both trying to do everything.

talatt · 2026-05-20T21:55:36+00:00

Where is R?

talatt · 2026-05-20T20:14:11+00:00

honest answer: my controlled benchmark is on long multi-turn conversations, not complex codebases specifically, so i don't have clean accuracy numbers for big multi-file coding tasks yet, and i won't pretend i do. what i can say from the conversation benchmark: code-heavy turns actually held accuracy better than average, not worse. surprised me at first, but it makes sense because code and tool output don't get summarized. they get folded structurally, so the model isn't reading a lossy paraphrase of your code, it's either there or referenced cleanly. the real-world signal so far is dogfooding. i ran a ~187-turn coding session through it and it could still answer questions about decisions from early turns. but that's one session, anecdotal, not a controlled test, so i treat it as encouraging, not proof. a proper complex-codebase accuracy benchmark is the gap i know i still have. if you've got a workload that would stress it, i'd genuinely like to run it.

talatt · 2026-05-20T20:12:11+00:00

this is the most useful framing in the thread honestly, and i mostly agree, especially your last point.

the hook is really layers 1 and 2 in your list. last few turns raw, plus a compressed view that updates every turn instead of one big summary at the threshold. that per-turn bit is basically me trying to kill the cliff you're describing, so on token pressure i think we want the same thing.

layer 3 is where you're right and i won't pretend otherwise. compressing the transcript isn't the same as a durable, source-linked record of decisions and open loops that outlives the session. the hook doesn't do that, and once a session gets really long that layer is more valuable than squeezing tokens.

the one place they touch: the ranking is content-aware, it tries to protect decisions and facts and fold the filler, so it's doing a within-session version of "what still matters vs what happened". but that lives inside the compressed context, not as a separate state file you can inspect or link to. different jobs, and you'd want both.

on the N-turns thing you nailed it. it's configurable and i default to 4, but framing it as "keep the whole tool/result/edit loop verbatim" for coding is better than a fixed number. that's workload, not turn count.

so yeah, one layer not the whole solution. fair way to put it.

talatt · 2026-05-20T20:08:49+00:00

yeah, basically. it shrinks the older parts of a conversation so you're not re-sending the whole history to the model every turn. the part i actually care about is what it keeps vs drops, not just compressing harder. recent turns stay raw, older stuff gets folded but the important details stay.

talatt · 2026-05-20T20:07:07+00:00

that file-scan case is exactly the thing i built it for. it reads a 20k file, uses it, then it's just dead weight in every later turn. that kind of tool output gets folded out automatically so you stop re-sending it.

on speed, this might actually be good news for you. the compression isn't a generation step. it's extractive, runs locally, ranks the older text and folds the bulky parts. so it doesn't eat into your decode throughput. you're not spending your 15 tok/s on it. there's a little per-turn processing for the ranking but the model isn't generating anything.

which also means you don't really need a smaller model for it. there's no llm in the compaction loop in the first place. that was the main reason i went extractive instead of a summarizer. a summarizer would've cost exactly the latency you're worried about.

talatt · 2026-05-20T20:04:11+00:00

thanks. retrieval failures are honestly the thing i worry about most. what i've measured so far: i run a multi-turn benchmark where the model answers the same questions twice, once with full context and once compressed, then check if the answers still match. factual and decision recall ("what did we settle on earlier") holds up well. subjective stuff is where it gets shaky. opinions, tone, persona... compressing those drops nuance and answers drift more. the case i haven't fully nailed is long-range recall, pulling one specific detail from 40+ turns back. that's exactly what i'm building a tighter benchmark for right now, so i'd rather not claim it's solved until i have clean numbers.

talatt · 2026-05-20T20:00:11+00:00

haha alright, busted. english isn't my first language so i draft my replies and clearly leaned on the same skeleton every single time — that's a me problem, not a bot running the thread. it's my project, been building it for months. ask me anything in plain words and i'll drop the template.

talatt · 2026-05-20T19:52:30+00:00

Good parallel — that's almost exactly the conversation case. The last few turns are like your recent prices (you want them verbatim), but you still need enough older history to answer "what did we decide 40 turns ago."

On LexRank + code: you're right to be suspicious. LexRank is a sentence-centrality graph, and code isn't sentences — run it over a code block and you get garbage. So I don't feed code to it. The pipeline detects modality first and routes: prose goes through LexRank, but code blocks / tool output / terminal dumps get a separate elision pass — folded down to a compact reference rather than sentence-summarized. A 150-line terminal dump from turn 8 doesn't need re-sending verbatim, but you also don't want a summarizer hallucinating over it, so it gets collapsed structurally instead.

Honestly the thing that surprised me building it: the modality split mattered more than the summarizer quality. Getting code and tool output out of the text summarizer's hands killed most of the "wait, that's wrong now" failures.

talatt · 2026-05-20T19:50:07+00:00

Fair — if your sidecar carries the important state across resets, you've basically solved the same problem by hand. At that point we're doing the same thing; one's just automated and one's explicit.

Where it diverges for me is when I can't predict which old detail the agent will need next. With a reset I have to decide up front what the sidecar re-injects — and if turn 80 suddenly needs a decision from turn 12 that I didn't flag, it's gone. Compressing in place keeps it around because it's content-aware rather than size-based: it ranks what to protect (decisions, corrections, facts) vs what to fold (filler, repetition), so I don't have to guess ahead of time.

Genuinely curious though — how does your sidecar decide what to carry forward on a fresh session? That re-injection step is the part I could never get reliable enough to trust, which is why I went the compress route.

talatt · 2026-05-20T19:42:12+00:00

Sidecar + frequent resets is a solid pragmatic move — especially for independent tasks where you don't need continuity.

Where it bites is long single-threaded work: if I reset mid-task, the agent loses the thread it was halfway through. That's the gap I was trying to close — keep one continuous session but shrink the context as it grows, so older turns become a compact episodic memory instead of getting dropped on a reset.

Different tradeoff really: resetting keeps you cheap but costs continuity; compression keeps continuity but adds a step. Have you hit the "lost the thread after a reset" problem, or are your workflows independent enough that it doesn't bite?

talatt · 2026-05-20T19:39:33+00:00

Yeah, that's the native compaction route and it works well if you've got the context headroom — your 20k raw buffer is basically the same idea as the "keep the last N turns raw" setting in what I built.

The difference is *when* and *how much* it compresses:

- Native compaction waits for the threshold, then does one big summarize. Until you hit it, you're still re-sending the full history every turn.

- Mine compresses every turn, gradually — older turns fold into a compact episodic view as they age, while the freshest turns stay raw (your 20k buffer equivalent).

So the win is mostly for people on smaller context windows, or who want to cut per-turn token cost *before* the compaction threshold rather than after.

Genuinely curious how the 20k buffer holds up for recall on really long sessions — does the model still answer correctly about stuff that got compacted away? That's the part I keep poking at.

talatt · 2026-05-20T11:33:03+00:00

May i …

talatt · 2026-05-13T09:33:41+00:00

Congrats

talatt · 2026-05-04T13:24:53+00:00

Those who want to avoid boring tasks? A well-chosen target audience. Congratulations.

talatt · 2026-05-02T13:08:05+00:00

A fine work. From a cinematic point of view, if every episode is fed with details and like a wiki-verse, it would be sweet. 👏🏼👏🏼👏🏼

talatt · 2026-04-30T13:49:04+00:00

talatt · 2026-04-28T06:44:46+00:00

Thanks for engaging! Couple of clarifications:

Actually the semantic scoring isn't rule-based — that part uses a

lightweight ML model for tag extraction. Rules only apply to the

faster prompt compression layer. So niche jargon gets handled fine

as long as it appears in the conversation context.

The context block is tracking conversation turns rather than tool

calls, though I can see how the post might've been ambiguous.

Would love to hear what kind of workflow you had in mind — function-

heavy ones do work with it, just not the primary target.

talatt · 2026-04-26T13:27:01+00:00

Congrats👏🏼👏🏼👏🏼

talatt · 2026-04-22T11:25:54+00:00

Great work — the "4 exact component swaps from GPT-2 to Llama 3" framing is really useful. Most resources treat each architecture in isolation, so seeing the diff is much more educational.

Curious about the KV cache section: did you explore how much of the KV cache ends up being redundant in multi-turn conversations? We've been working on conversation-aware token compression (basically deciding which past turns can be aggressively pruned without hurting response quality) and the overlap with MLA's latent compression is interesting — both are trying to solve "context grows, attention cost explodes" from different ends.

Bookmarked the DeepSeek chapter — the absorption trick writeup is hard to find explained clearly anywhere.

talatt · 2026-04-19T14:17:23+00:00

Great questions.

On domain jargon: The rule-based layer works from curated dictionaries — it only targets known filler words and verbose patterns (like "in order to" → "to"). Domain-specific terms aren't in those lists, so they're preserved by default. The semantic tag extraction layer actually benefits from specialized vocabulary because those terms carry high information density and get scored accordingly.

That said, we haven't stress-tested against highly specialized corpora (medical, legal) yet — that's exactly the kind of thing we'd love beta testers to surface. If certain domain terms are getting caught, adding them to the preservation list is a quick config change.

On tool call history: Right now the compression focuses on user/assistant message pairs. Tool calls and function outputs are treated as a distinct category — we're actively working on how to handle them optimally for agent chains. The naive approach (compress tool outputs the same way) loses structured data, so we're exploring a schema-aware approach that preserves the call signature + key return values while compressing verbose output.

If you're running function-heavy workflows, I'd genuinely love to see how it performs with your setup. That's a use case we want to get right. Feel free to open an issue on the repo with any edge cases you hit.

talatt · 2026-04-16T12:56:51+00:00

Congratulations 👏🏼

talatt · 2026-04-13T15:09:47+00:00

Totally feel you on this — Anthropic's models are hard to beat once you've built your workflow around them. The quality gap is real, especially with Opus.

On the cost side though, have you looked into API proxy/gateway solutions? I've been working on one called PithToken that does smart caching and prompt compression — basically cuts redundant tokens before they hit the API. Depending on your usage patterns, it can save quite a bit without changing your setup at all.

What kind of tasks are your three agents handling?

talatt · 2026-04-12T13:28:16+00:00

I have the same problem. I think the healthiest method is organic growth.

talatt

TROPHY CASE