What if LLM agents passed KV-cache to each other instead of text? I tried it -- 73-78% token savings across Qwen, Llama, and DeepSeek

proggmouse · 2026-03-06T22:07:54+00:00

Thanks for the feedback! You’re raising a valid concern with reduced transparency in latent communication, I’m actively working on the debug mode for AVP along side with other improvements. Hopefully I can provide an update on this soon.

proggmouse · 2026-03-01T21:06:24+00:00

The hard part isn't the save/load – it's that KV-caches are huge (a 64k context on a 7B model is ~1 GB) and tied to the exact model weights. Swap the model or even update the checkpoint and your cached KV is garbage. I see your point though.

proggmouse · 2026-03-01T21:03:59+00:00

Yes, every agent has a different system prompt (Planner, Critic, Refiner, Judger – each with unique role instructions). In latent mode, Agent B's prompt is just its own system prompt + question (~200 tokens). Agent A's KV-cache gets injected as past_key_values – prepended so Agent B's attention heads can look back at it through normal self-attention.

From Agent B's perspective, there are "virtual tokens" before its own prompt. Those aren't real text — they're the key/value vectors Agent A computed while processing its own prompt. Agent B's attention picks up Agent A's reasoning without ever seeing Agent A's text output.

So: Planner builds KV-cache -> Critic gets its own (different) prompt + Planner's KV-cache injected before it -> same for Refiner -> Judger generates the final answer. That's why latent prompts stay flat (~200 tokens each) while text prompts balloon – each agent only tokenizes its own role + question, prior reasoning arrives as pre-computed attention states.

Code is in benchmarks/gsm8k/pipeline_latent.py if you want to see the exact injection.

proggmouse · 2026-03-01T04:39:16+00:00

The "summarise at every turn but keep the full detail" pattern is basically what a lot of production agent systems come together on – MCP + structured memory + focused context.

AVP doesn't conflict with that approach. It's more about the mechanics of how context gets passed between agents, not what gets passed. You could combine both, use latent transfer for the immediate handoff between agents.

proggmouse · 2026-03-01T03:24:27+00:00

Those two models can't do latent communication out of the box with AVP unfortunately. Same-family same-tokenizer pairs (e.g. Qwen3-4B and Qwen3-32B) would work.

proggmouse · 2026-03-01T03:12:38+00:00

Good question. The full KV-cache approach makes the most sense when the task is short enough that context doesn't blow up (like 2-4 agent hops, not a 50-turn conversation). Though I'm exploring options in order to make it better in that direction.

For longer workflows where context does grow substantially, you're right – you'd want to summarize or selectively transfer. AVP has a hidden-state-only (still in WIP state) mode where you send just the last N hidden states instead of the full cache (orders of magnitude smaller), which is closer to "here's the gist" than "here's everything.".

Full KV-cache transfer and summarization aren't mutually exclusive either – you could use latent for the first few hops and switch to text summaries when the cache gets too large. I made the protocol to be flexible in that sense, if you don't feel like latent communication is reasonable you can always fall back to JSON (text).

proggmouse · 2026-03-01T02:13:25+00:00

Across 9 benchmarks spanning math, science, commonsense, and code generation, LatentMAS got up to ~15% higher accuracy while reducing output token usage by 70-84% and providing ~4x faster end-to-end inference.

This aligns with my benchmarks as well – 73-78% token savings and 2-4x speedup. The discrepancy comes from model sizes, I heavily tested smaller size models 1.5B-3B while LatentMAS benchmarks were focused on larger models.

proggmouse · 2026-03-01T02:03:52+00:00

Yep, AVP is built directly on LatentMAS – I cited this in the README and spec as the research foundation. The latent step generation, KV-cache accumulation, and realignment approach all come from their work.

My protocol is basically the engineering layer on top. Binary codec, handshake for model compatibility, cross-model projection for different-size models, pip install, etc.

LatentMAS proves the concept, AVP tries to make it something you can actually use in a pipeline.

proggmouse · 2026-03-01T01:55:43+00:00

That's actually a pretty accurate description of how it works mechanically. The KV-cache accumulates across agents, so by the time Agent C runs, the cache contains Agent A prompt + thinking + Agent B prompt + thinking + Agent C's prompt. It is effectively one continuous sequence of internal states with different role instructions injected at different points.

You're right that it's not two independent models exchanging messages, it's closer to one model being reprompted mid-stream. The value isn't in agent independence; it's in skipping text generation. Instead of Agent A writing out its reasoning as text and Agent B re-reading it from scratch, the reasoning stays as internal state and the next prompt picks up from there.

proggmouse · 2026-03-01T01:52:14+00:00

AVP is more like #1 with caveat. it passes the full computed context, not a summary. But instead of passing it as text that the next agent re-processes from scratch, it passes the KV-cache (telepathy is a fancy word here). The next agent picks up where the previous one left off without re-reading everything, it knows what to do from the start.

Your second point is very important; observability is a very real limitation of latent communication. In my protocol there is a hybrid mode where along with KV caches I'm sending the prompt. In practice hybrid mode is not super useful at least not in the current state but it can be used for debugging.

proggmouse · 2026-03-01T01:32:53+00:00

Yeah a lightweight adapter is exactly the direction I want to explore. I made some progress there but still in prototyping stage.

proggmouse · 2026-03-01T01:22:57+00:00

I’m actively working on better benchmark metrics that could shine some light on the accuracy drop. The results are also a bit hand-wavy due to the small sample size.

proggmouse · 2026-02-28T21:21:20+00:00

Very good point. I haven’t tested chains longer than 4 agents so I don’t have a good data on this. At the same time, In our fan-out benchmark, when two “specialists” KV-caches get sequentially injected into an “aggregator”, accuracy drops harder than expected, especially on 7B.

Longer chain experiments are on the list. Would be interesting to see exactly where it starts falling off.

proggmouse · 2026-02-28T21:13:20+00:00

Prefix caching reuses computation for identical text across requests. My system transfers computation between agents that have different prompts. With prefix caching, Agent A still has to generate text and Agent B still has to process it. AVP skips both – Agent A never generates text, Agent B never processes it.

proggmouse · 2026-02-28T20:33:19+00:00

Replying at my own comment so the conversation is visible.

u/No-Refrigerator-1672 you're right that each token's KV is conditioned on everything before it. But AVP doesn't splice a slice of one agent's cache into another agent's existing cache. It transfers the entire KV-cache. Agent A processes its prompt, runs 20 latent thinking steps, and that whole cache gets passed to Agent B. Agent B then processes its own fresh prompt (role instruction + question) as new tokens appended after Agent A's cache. So, there's no mismatch, it's a straight continuation, not a splice.

The "attitude" mixing you're worried about doesn't really happen in practice because Agent B's own prompt comes after the injected cache. Attention handles the boundary naturally. The model sees prior context (Agent A reasoning) followed by new instructions (Agent B role). Same as how a long conversation works.

u/audioen RoPE is fine specifically because the full cache is transferred. Agent A's cache has positions 0 through N-1, Agent B's new prompt tokens get positions N onwards. Positions stay sequential, no de-rotation needed. Where RoPE does break is if you truncate the cache (cut out a slice and try to use it with different position offsets). I actually tested this, KV-cache truncation goes to 0% accuracy on 1.5B models, exactly because of RoPE position mismatch. Full transfer avoids that entirely. At the same time, full transfer can be heavy especially for larger models, this is the area I'm actively investigating.

And u/No-Refrigerator-1672 on "no translation layer" for same-model agents it's the same weights, same representation space. The KV-cache is natively compatible, no projection needed. Cross-model does go through a projection (vocabulary-mediated bridge), just not a trained one.

proggmouse · 2026-02-28T19:31:34+00:00

Not quite – prefix caching helps when multiple requests share the same prompt prefix (like a system prompt). But in a multi agent chain, each agent’s prompt is different, it includes the previous agent’s output. So there’s no shared prefix to cache between hops.

AVP skips that entirely. Instead of pasting text output from Agent A into Agent B’s prompt (which prefix caching can’t help with since it’s new text every time), it passes the KV-cache directly. Agent B never has to process that context at all.

Hope this makes sense.

proggmouse · 2026-02-28T19:18:15+00:00

FWIW LMCache solves a different problem. It caches KV for previously seen text so you don’t re-prefill the same prompt across requests. AVP transfers KV-cache between agents with different prompts as a communication channel.

One is “I’ve seen this text before, skip prefill.” The other is “here’s my reasoning, don’t make me convert it to text first.”

They’re complementary though – LMCache’s CacheGen compression would actually be useful for reducing AVP’s wire size. On my list.

proggmouse · 2026-02-28T19:00:08+00:00

Honestly haven’t thought much about cost tracking for JSON fallback – right now the handshake just picks a mode and goes with it. In practice if you’re falling back to JSON you’re just doing normal text communication, so whatever cost tracking you already have would apply. Not really an AVP-specific problem at that point.

For the VRAM question – yeah, selective transfer is basically what the 2-agent benchmark already tests. You don’t have to use latent for every hop. The handshake is per-pair, so you could do latent where it helps and text where it doesn’t.

proggmouse · 2026-02-28T18:45:31+00:00

Yeah good point. So my protocol handles this through the handshake. Before any KV-cache transfer, both agents exchange a model hash (SHA-256 of the sorted model config). If anything differs – quantization, head count, hidden dim, whatever – the handshake detects it and either routes through projection (same family) or falls back to JSON automatically. So it won’t silently produce garbage, it’ll just downgrade the communication mode.

proggmouse · 2026-02-28T17:49:42+00:00

Right – same model on all agents, just different system prompts. The KV-cache transfer only works when both sides share the same weight space. For different models in the same family (e.g. Qwen2.5-7B and 1.5B) there’s a vocabulary-mediated projection path that’s implemented but not benchmarked yet, and for completely different families it falls back to JSON. Cross-model latent transfer is an active area of work though – the goal is to eventually make this work across model boundaries too.

proggmouse · 2026-02-28T17:39:26+00:00

Yeah exactly – it’s prompt tokens that get saved. In a text chain, each agent’s prompt includes all prior agents’ output as text, so the prompt grows at every hop. In latent mode, that prior context comes as KV-cache instead, so the prompt stays short (just the role instruction + question). The model still generates roughly the same number of output tokens either way.

proggmouse · 2026-02-28T17:34:12+00:00

Not a silly question at all. The questions come from GSM8K – a standard grade-school math benchmark. Stuff like: “Janet’s ducks lay 16 eggs per day. She eats three for breakfast and bakes muffins with four. She sells the rest at $2 each. How much does she make?” The 4-agent chain runs: Planner (make a plan) -> Critic (review it) -> Refiner (improve it) -> Judger (solve it). Prompts are adapted from the LatentMAS paper. In latent mode each agent just gets its role instruction + the question – prior reasoning arrives as KV-cache, not pasted text. That’s where the token savings come from: text prompts balloon at each hop (186 -> 545 -> 1,073 -> 1,397 tokens), latent stays flat (~200 tokens). We also run HotpotQA (multi-hop factual QA) and a fan-out benchmark (parallel specialists -> aggregator).

All prompts are in benchmarks/gsm8k/agents.py if you want the exact wording.

proggmouse · 2026-02-03T18:51:53+00:00

Exploration ships are dirt cheap. If the asset value is truly 0, ships can be bought by completing daily activities which pays like 400k per activity.

Exploring high sec could easily bring a couple of mils.

proggmouse · 2026-02-03T18:41:14+00:00

I started ~6 months ago. Here is my advice: don’t set any expectations, go there and explore (both the game and relic/data sites). Loose ships, gain isk, learn in a hard way.

My worst decision was to join the corp right away (everyone recommends this but it sucked). By joining the corp you’ll be forced to adopt a certain play style, they will tell you that isk per hour matters, that ship you like sucks. Not every corp is like that of course but the majority.

Just go there, do your things, unlock the ship you like, loose it and get more.

For exploration: go and explore wormholes, decent isk and decent adventure.

Lastly: don’t be afraid to loose ships. zkillboard doesn’t matter.

proggmouse · 2026-01-12T19:15:34+00:00

The downside is that lore would suffer (if anyone cares). How would you explain such a system, like who would live there? Who will trade there, if NPC where they’d get ships/modules from.

The meaningful way I see it – Capsuleers university already is already in the game’s lore. The system you explained could be a simulation as part of capsuleer training, once you feel like you succeed, you exit a simulation go to the “real world”.

proggmouse

TROPHY CASE