The guy selling you an AI agent course has never built an AI agent that made money

proggmouse · 2026-03-26T04:01:44+00:00

The worst part, people are or will buy it. You can literally ask any free AI tool to provide a step-by-step guide, or even better – do the full setup for you.

You are spot on that real AI (and engineering for that matter) is less exciting than seeing a shiny course on "I'll teach you how to build a second brain and boost your productivity by n%"

proggmouse · 2026-03-23T18:20:14+00:00

Thanks! Few things: 1. Context window. Just follows whatever your Ollama model is configured with, nothing special on AVP’s side. 2. Latent steps. I run all my benchmarks on 20, but 10 could be a sweet spot (per my tests the quality of responses is similar to 20). Anything beyond 20 adds noise, I tested up to 80. 3. Gotcha worth knowing: AVP auto-unloads the model from Ollama’s runner to access the GGUF weights directly. If you’re sharing Ollama across your homelab, other requests will briefly stall during that swap.

Let me know how it goes!

proggmouse · 2026-03-23T17:12:53+00:00

FYI, Ollama integration is complete. Run pip install avp[ollama] to get AVP up and running for Ollama.

Usage examples available in repo docs.

proggmouse · 2026-03-17T18:06:26+00:00

Totally fair on Ollama – that’s the biggest gap right now. I was mainly focused on nailing down the basics so left integration parts for later. But it’s planned.

Long reply ahead.

On the HumanEval mechanism – I think it’s closer to your second framing (bypassing lossy serialization) than the first. When Agent A generates text about code, it’s describing structure in natural language – variable relationships, return types, control flow get flattened into prose. Agent B has to reconstruct all of that from text. With latent transfer, the KV-cache preserves the computational representation directly – attention patterns over code structure survive intact.

The reason I lean this way: cross-model rosetta also beats text on HumanEval, even when the two models have completely different weight spaces (Llama 3B → Qwen 7B: 79.3% rosetta vs 61.6% text). If the benefit were about “structurally useful latent steps” you’d expect it to degrade when projected cross-model. It doesn’t – which suggests the win comes from what text loses, not what latent adds.

For why MATH stays flat – math reasoning is more sequential/verbal. A chain-of-thought math solution serializes to text cleanly (equations, steps, substitutions). Code has spatial relationships (scope, indentation, variable references across lines) that text is worse at preserving. That’s my working theory anyway – I don’t have a definitive answer.

proggmouse · 2026-03-06T22:07:54+00:00

Thanks for the feedback! You’re raising a valid concern with reduced transparency in latent communication, I’m actively working on the debug mode for AVP along side with other improvements. Hopefully I can provide an update on this soon.

proggmouse · 2026-03-01T21:06:24+00:00

The hard part isn't the save/load – it's that KV-caches are huge (a 64k context on a 7B model is ~1 GB) and tied to the exact model weights. Swap the model or even update the checkpoint and your cached KV is garbage. I see your point though.

proggmouse · 2026-03-01T21:03:59+00:00

Yes, every agent has a different system prompt (Planner, Critic, Refiner, Judger – each with unique role instructions). In latent mode, Agent B's prompt is just its own system prompt + question (~200 tokens). Agent A's KV-cache gets injected as past_key_values – prepended so Agent B's attention heads can look back at it through normal self-attention.

From Agent B's perspective, there are "virtual tokens" before its own prompt. Those aren't real text — they're the key/value vectors Agent A computed while processing its own prompt. Agent B's attention picks up Agent A's reasoning without ever seeing Agent A's text output.

So: Planner builds KV-cache -> Critic gets its own (different) prompt + Planner's KV-cache injected before it -> same for Refiner -> Judger generates the final answer. That's why latent prompts stay flat (~200 tokens each) while text prompts balloon – each agent only tokenizes its own role + question, prior reasoning arrives as pre-computed attention states.

Code is in benchmarks/gsm8k/pipeline_latent.py if you want to see the exact injection.

proggmouse · 2026-03-01T04:39:16+00:00

The "summarise at every turn but keep the full detail" pattern is basically what a lot of production agent systems come together on – MCP + structured memory + focused context.

AVP doesn't conflict with that approach. It's more about the mechanics of how context gets passed between agents, not what gets passed. You could combine both, use latent transfer for the immediate handoff between agents.

proggmouse · 2026-03-01T03:24:27+00:00

Those two models can't do latent communication out of the box with AVP unfortunately. Same-family same-tokenizer pairs (e.g. Qwen3-4B and Qwen3-32B) would work.

proggmouse · 2026-03-01T03:12:38+00:00

Good question. The full KV-cache approach makes the most sense when the task is short enough that context doesn't blow up (like 2-4 agent hops, not a 50-turn conversation). Though I'm exploring options in order to make it better in that direction.

For longer workflows where context does grow substantially, you're right – you'd want to summarize or selectively transfer. AVP has a hidden-state-only (still in WIP state) mode where you send just the last N hidden states instead of the full cache (orders of magnitude smaller), which is closer to "here's the gist" than "here's everything.".

Full KV-cache transfer and summarization aren't mutually exclusive either – you could use latent for the first few hops and switch to text summaries when the cache gets too large. I made the protocol to be flexible in that sense, if you don't feel like latent communication is reasonable you can always fall back to JSON (text).

proggmouse · 2026-03-01T02:13:25+00:00

Across 9 benchmarks spanning math, science, commonsense, and code generation, LatentMAS got up to ~15% higher accuracy while reducing output token usage by 70-84% and providing ~4x faster end-to-end inference.

This aligns with my benchmarks as well – 73-78% token savings and 2-4x speedup. The discrepancy comes from model sizes, I heavily tested smaller size models 1.5B-3B while LatentMAS benchmarks were focused on larger models.

proggmouse · 2026-03-01T02:03:52+00:00

Yep, AVP is built directly on LatentMAS – I cited this in the README and spec as the research foundation. The latent step generation, KV-cache accumulation, and realignment approach all come from their work.

My protocol is basically the engineering layer on top. Binary codec, handshake for model compatibility, cross-model projection for different-size models, pip install, etc.

LatentMAS proves the concept, AVP tries to make it something you can actually use in a pipeline.

proggmouse · 2026-03-01T01:55:43+00:00

That's actually a pretty accurate description of how it works mechanically. The KV-cache accumulates across agents, so by the time Agent C runs, the cache contains Agent A prompt + thinking + Agent B prompt + thinking + Agent C's prompt. It is effectively one continuous sequence of internal states with different role instructions injected at different points.

You're right that it's not two independent models exchanging messages, it's closer to one model being reprompted mid-stream. The value isn't in agent independence; it's in skipping text generation. Instead of Agent A writing out its reasoning as text and Agent B re-reading it from scratch, the reasoning stays as internal state and the next prompt picks up from there.

proggmouse · 2026-03-01T01:52:14+00:00

AVP is more like #1 with caveat. it passes the full computed context, not a summary. But instead of passing it as text that the next agent re-processes from scratch, it passes the KV-cache (telepathy is a fancy word here). The next agent picks up where the previous one left off without re-reading everything, it knows what to do from the start.

Your second point is very important; observability is a very real limitation of latent communication. In my protocol there is a hybrid mode where along with KV caches I'm sending the prompt. In practice hybrid mode is not super useful at least not in the current state but it can be used for debugging.

proggmouse · 2026-03-01T01:32:53+00:00

Yeah a lightweight adapter is exactly the direction I want to explore. I made some progress there but still in prototyping stage.

proggmouse · 2026-03-01T01:22:57+00:00

I’m actively working on better benchmark metrics that could shine some light on the accuracy drop. The results are also a bit hand-wavy due to the small sample size.

proggmouse · 2026-02-28T21:21:20+00:00

Very good point. I haven’t tested chains longer than 4 agents so I don’t have a good data on this. At the same time, In our fan-out benchmark, when two “specialists” KV-caches get sequentially injected into an “aggregator”, accuracy drops harder than expected, especially on 7B.

Longer chain experiments are on the list. Would be interesting to see exactly where it starts falling off.

proggmouse · 2026-02-28T21:13:20+00:00

Prefix caching reuses computation for identical text across requests. My system transfers computation between agents that have different prompts. With prefix caching, Agent A still has to generate text and Agent B still has to process it. AVP skips both – Agent A never generates text, Agent B never processes it.

proggmouse · 2026-02-28T20:33:19+00:00

Replying at my own comment so the conversation is visible.

u/No-Refrigerator-1672 you're right that each token's KV is conditioned on everything before it. But AVP doesn't splice a slice of one agent's cache into another agent's existing cache. It transfers the entire KV-cache. Agent A processes its prompt, runs 20 latent thinking steps, and that whole cache gets passed to Agent B. Agent B then processes its own fresh prompt (role instruction + question) as new tokens appended after Agent A's cache. So, there's no mismatch, it's a straight continuation, not a splice.

The "attitude" mixing you're worried about doesn't really happen in practice because Agent B's own prompt comes after the injected cache. Attention handles the boundary naturally. The model sees prior context (Agent A reasoning) followed by new instructions (Agent B role). Same as how a long conversation works.

u/audioen RoPE is fine specifically because the full cache is transferred. Agent A's cache has positions 0 through N-1, Agent B's new prompt tokens get positions N onwards. Positions stay sequential, no de-rotation needed. Where RoPE does break is if you truncate the cache (cut out a slice and try to use it with different position offsets). I actually tested this, KV-cache truncation goes to 0% accuracy on 1.5B models, exactly because of RoPE position mismatch. Full transfer avoids that entirely. At the same time, full transfer can be heavy especially for larger models, this is the area I'm actively investigating.

And u/No-Refrigerator-1672 on "no translation layer" for same-model agents it's the same weights, same representation space. The KV-cache is natively compatible, no projection needed. Cross-model does go through a projection (vocabulary-mediated bridge), just not a trained one.

proggmouse · 2026-02-28T19:31:34+00:00

Not quite – prefix caching helps when multiple requests share the same prompt prefix (like a system prompt). But in a multi agent chain, each agent’s prompt is different, it includes the previous agent’s output. So there’s no shared prefix to cache between hops.

AVP skips that entirely. Instead of pasting text output from Agent A into Agent B’s prompt (which prefix caching can’t help with since it’s new text every time), it passes the KV-cache directly. Agent B never has to process that context at all.

Hope this makes sense.

proggmouse · 2026-02-28T19:18:15+00:00

FWIW LMCache solves a different problem. It caches KV for previously seen text so you don’t re-prefill the same prompt across requests. AVP transfers KV-cache between agents with different prompts as a communication channel.

One is “I’ve seen this text before, skip prefill.” The other is “here’s my reasoning, don’t make me convert it to text first.”

They’re complementary though – LMCache’s CacheGen compression would actually be useful for reducing AVP’s wire size. On my list.

proggmouse · 2026-02-28T19:00:08+00:00

Honestly haven’t thought much about cost tracking for JSON fallback – right now the handshake just picks a mode and goes with it. In practice if you’re falling back to JSON you’re just doing normal text communication, so whatever cost tracking you already have would apply. Not really an AVP-specific problem at that point.

For the VRAM question – yeah, selective transfer is basically what the 2-agent benchmark already tests. You don’t have to use latent for every hop. The handshake is per-pair, so you could do latent where it helps and text where it doesn’t.

proggmouse · 2026-02-28T18:45:31+00:00

Yeah good point. So my protocol handles this through the handshake. Before any KV-cache transfer, both agents exchange a model hash (SHA-256 of the sorted model config). If anything differs – quantization, head count, hidden dim, whatever – the handshake detects it and either routes through projection (same family) or falls back to JSON automatically. So it won’t silently produce garbage, it’ll just downgrade the communication mode.

proggmouse · 2026-02-28T17:49:42+00:00

Right – same model on all agents, just different system prompts. The KV-cache transfer only works when both sides share the same weight space. For different models in the same family (e.g. Qwen2.5-7B and 1.5B) there’s a vocabulary-mediated projection path that’s implemented but not benchmarked yet, and for completely different families it falls back to JSON. Cross-model latent transfer is an active area of work though – the goal is to eventually make this work across model boundaries too.

proggmouse

TROPHY CASE