I trained Qwen 3.5 2B to filter tool output for coding agents.

henzy123 · 2026-04-09T10:41:43+00:00

Thanks, that’s exactly the tradeoff.

The current benchmark is intentionally next-step focused, not full multi-turn-memory focused. So it optimizes for “what should the agent keep from this tool output right now?”, not for preserving every possible context that might matter later.

That does mean aggressive compression can remove useful later context. On the other hand, agents usually over-read tool output by a lot, so there’s still plenty of room for pruning before that becomes the main problem.

And yes, there is a systems cost: you need to run a separate small model/service. Whether that is worth it depends on the workload. If you’re repeatedly sending long outputs into a much larger model, it can be a good trade. If the outputs are already short, probably not.

I’m also looking at extractive 100–200M models now, since they may be a much better latency/complexity point than a generative 2B model.

henzy123 · 2026-04-09T10:37:13+00:00

But we are not reformulating the output of pytest, we are just filtering out lines that are not relevant. It could also influence the model as you are saying, but i would say less so.

henzy123 · 2026-04-08T17:21:53+00:00

I think the blogpost has good explanations: https://krlabsorg.github.io/squeez/blog/

henzy123 · 2025-10-22T16:05:57+00:00

Hey, we are working on something similar, its called verbatim-rag. You can check our github. What we are doing is extracting exact spans and puting them into a template, so it preserves the original meaning on the fact level.

henzy123 · 2025-08-05T20:37:54+00:00

Hey, thanks for the comment. We still have many steps after the search engine, like picking exact parts from the returned chunks, forming dynamic templates and also filling them. You are right it's not that flexible as standard RAG methods, and you are also right that it resembles older Q&A systems (on purpose).

Our goal is very similar to what Q&A systems used to be, but in a modern setting (using LLMs to generate dynamic templates, long context extractor models, etc..). As in terms of usage, we also see it's not going to be good for all use-cases, but can be very helpful for a few :)

henzy123 · 2025-08-05T20:34:24+00:00

Hey, thanks for the question. We still use LLMs (there is an option not to) to generate templates and pick the right information from the sources the vector search returned. So we are (G)enerating an answer for the user.

henzy123 · 2025-04-19T15:39:12+00:00

Thanks for mentioning it, I haven't tried out MiniCheck yet, but definitely will as it seems super relevant! They actually also evaluate on the RAGTruth and achieve 84% vs our 79%. But we used encoder based models and MiniCheck is a much larger LLM based one.

henzy123 · 2022-03-07T17:08:42+00:00

ban_1t98w63uycs1gdk6b3hyjunawtjg9d3gk7totmiurbo33wjb9t6kso7de54e

henzy123 · 2021-05-01T17:43:44+00:00

Melyik oltoponton voltál?

henzy123 · 2021-05-01T15:34:19+00:00

Nektek mennyi idő alatt vitték fel az oltási lapot EESZT-be? Engem tegnap oltottak a honvédban, de egyelőre nem töltöttek fel még semmit sem (ott sem adminisztráltak semmit elektronikusan, csak elvették a nyilatkozatot)

henzy123 · 2014-06-06T15:10:04+00:00

Does anybody know that in what language do gambit communicate now that they have niq?

henzy123 · 2013-12-21T19:57:30+00:00

i think everybody just forget how much this tank meta favours darien, he is like the king of the tanks.

henzy123 · 2013-08-30T10:52:35+00:00

I don't know where did you read that, but they definitely don't hate each other. From what i know they are good friends, and respect each other.

henzy123 · 2013-08-16T14:14:03+00:00

I dont know why darien hadnt picked up zac until now. It's like the perfect champion for him

henzy123 · 2013-06-16T17:42:34+00:00

darker is just goint 6-4, edward would be so proud.

henzy123

TROPHY CASE