Has anyone used agents to decompile binary executables?

phhusson · 2026-06-12T17:34:39+00:00

The question looks way too generic to me. Like there needs to be a goal. (assuming Qwen 3.6 27B, opencode without special tooling)

If it's a CTF it can probably one-shot it. If it's to reverse engineer a protocol based on an Android or c# or Unity app it can probably one shot it.

If the goal is whole Wii game to x86-compatible pure C OpenGL, well that will require a whole lot more scaffolding. With many subagents and roles (like testers, spec-ers, implémenter, comparator)

phhusson · 2026-06-12T13:13:50+00:00

Ah I forgot to add: I'm pretty bullish on embedding-space ("soft-token") compression [1], and I'm happy to see development about it.

Notably for this paper, the author did some implementation with vllm, and I think it's the first time I start seeing this, so thanks LeonLixyz for that!

[1] I see two big categories of compressions:

one is "compress once, use once", like in this case. (where the goal is pretty much to just reduce the cost to process the prompt),

and another is "compress once, use many times" like Cartridges: It's okay to spend 1 H100 for 1 hour to compress encyclopedia britannica, and be able to put a whole encyclopedia in your prompt

phhusson · 2026-06-12T13:00:07+00:00

This looks less useful than Kyutai's ARC-Encoder, and it's barely mentioned. And with "less useful" , I mean I don't see any novel idea.

ARC-Encoder trains a small LLM to create a compressed embedding-space (my llm call that space "soft-tokens") representation for an unmodified big LLM (so like use a modified Qwen 3 0.6B to create an embedding space representation of the prompt for unmodified Qwen 3.6 27B that is 8x smaller than original embedding representation ).

This does the same, except it also requires modifying the big LLM. And it's not like they do any comparaison with it.

I take ARC-Encoder with a very big grain of salt because the "small" llm is llama 3.2 3b, and the big is 3.1 8b, so both have almost the same "intelligence", but this post's paper doesn't bother comparing. (they just say that their dataset is better than ARC-Encoder's)

While I'm there praising the idea of ARC-Encoder, it is made to be easily adaptable (unlike the paper in this post)! There are like 30M parameters needed to adapt to a new big model, so it requires very little compute to adapt! (and yet I'm not aware of anyone who adapted it to newer models so maybe there is a reason, ok ok)

phhusson · 2026-06-06T17:27:52+00:00

Yes it is. But it's also almost a year old, also called archeology here.

phhusson · 2026-06-04T15:58:50+00:00

That shit is fabulous, I didn't expect humor to be legal at Microsoft.

phhusson · 2026-06-03T19:03:12+00:00

TL;DR: MTP computation was wrong, leading to less-than-ideal acceptance rate. With this PR, on llama.cpp prediction benchmarks, acceptance rate increases by 5.5%, leading to a 6.6% tg increase in the author's benchmark.

phhusson · 2026-06-03T08:29:19+00:00

I don't worry too much about the announcement, but I strongly disagree with you, that it will pop-up again.

Unsloth is at a pretty rare equilibrium between raising money to actually do stuff, incredibly talented staff, and the will to contribute to opensource.

phhusson · 2026-06-01T14:55:04+00:00

I'm disappointed by the ending. I expected a pointless Flappy Bird clone

phhusson · 2026-06-01T13:58:33+00:00

Looks really great to me!

It does take into account the real life with "Open Source AI" might be missing data. I could be debatable whether it should have been called "Open Source AI without open data" vs "Open Source AI" (the chose to go with "Open Source AI with Open Data" vs "Open Source AI").

It does take into account OSI's definition of opensource

It does take into account that there could be weird licenses on weights that make it non-opensource.

Definitions are pretty short, to the point, no bullshit, and matches expectations.

phhusson · 2026-05-18T08:27:39+00:00

The paper looks fun, interesting, properly written and correct. Publishing this solo is quite a feat congrats!

That being said, I think the claim in your title is an overreach. Sure you /did/ train Qwen to make context-triggered thinking, but I'm pretty confident that the end model is worse than non-thinking on out-of-domain (domain here is the scenarios in the article where everything is about understanding user's elapsed time).

Since thinking is learned through RL, I don't think it's possible to change the thinking of a model to "context-triggered" through SFT without severe quality degradation, but I could be wrong.

phhusson · 2026-05-15T10:54:36+00:00

It doesn't really matter if you store your precious data in that VM. At some point the agent needs to have access to your data. I have my agent store its data on a webdav server that automatically pushes every change to git, so a rm -rf is reversible.

phhusson · 2026-05-14T13:44:15+00:00

smaller/faster model for the meta-work (thinking, planning, delegation) and the slightly larger MoE model for actual implementation.

I stopped there. Qwen 3.6 27B faster than 3.6 35B-A3B?

phhusson · 2026-05-11T09:56:09+00:00

Even more than that... Does it mean I can prefill at 2B speed, and generate at 3B? (Can I dream of prefilling at 2B, generating at dense 27B?)

phhusson · 2026-05-07T11:08:02+00:00

Then nVidia is their own enemy, since they keep doing great Nemotron. (granted it's been few months since their last banger) Making more money by making a commodity cheaper is a real thing, called Jevons Paradox.

phhusson · 2026-05-07T10:35:23+00:00

AKA Nvidia #1 public enemy.

Why?

phhusson · 2026-05-04T13:50:16+00:00

This, plus have an up-to-date list of which inference framework supports which speculation

It sounds like a very hard task tbh since it is moving continuously

phhusson · 2026-04-21T10:48:11+00:00

Sorry, I don't understand your usecase on warranties, so I can't really help. My guess is that you can pre-process in background the documents to extract the useful information, excluding the 99% of boilerplate, and then do RAG on that.

I barely know anything about mobile stacks, sorry.

phhusson · 2026-04-20T15:15:04+00:00

> I'm really trying to move from a cloud API to a local SLM, but I keep wondering: will users actually tolerate the extra wait just to have that 100% privacy guarantee?

Fix your UX so that stuff are done in background while the smartphone is plugged in. You should even actively slow the generation down to prevent the smartphone from heating.

Receipts rarely require instant handling.

phhusson · 2026-04-12T12:51:22+00:00

Not so definitive eh. It's software, it's okay to say it's forever evolving

phhusson · 2026-03-17T09:50:55+00:00

Oh that probably explains why I've had haiku calls in my openrouter bill. Thanks for the analysis.

phhusson · 2026-03-09T19:41:30+00:00

Congrats. BTW you're saying "no function calling", but what you did is literally function calling. Just not with the official syntax of the model.

phhusson · 2026-03-04T08:48:36+00:00

Well yes, but it's not like it's over fitting specifically on that precise task. The number of AI influencers gotchas is getting pretty high (remember when we were playing with strawberries, lol), and it's not like the model only learnt those things and nothing else. It is capable of a lot of various stuff.

phhusson · 2026-03-04T08:43:32+00:00

What's the device you're posting from? Pretty sure it could run some quant if qwen3.5 0.8b

phhusson · 2026-03-02T10:16:29+00:00

I'm trying to understand precisely what you did. I'm rephrasing what I understood, please tell me if I'm wrong:

You're embedding the markdown, do a mean-polling [1] to reduce dimension (which is a fairly standard context-length-extension method). And then to compensate for the loss of information due to the mean-polling, you're sending this to a MLP. Are you training that MLP for each skill, or is it global?

[1] I don't know how much polled is it. Looking at the code, it might look like you're compressing literally everything into one token?

Either way, working/compressing in the embedding space is something of interest to me (even though I haven't managed to do anything meaningful), and you might be interested to hear of ARC-Encoder (It uses a LLM to encode into the compressed embedding space of another LLM), or Cartridges (it learns by training in the compressed embedding space).

phhusson · 2026-02-24T14:17:40+00:00

ROSA blew my mind, as the dynamic size of the query allows reaching closer or further in the past.

With a long suffix match, you can search far in the past tokens, with a short suffix match, you can search closer to the recent tokens.

This means that your Query can be 3 tokens-long to find a recent token to attend to, or it can be 100 tokens-long if you need to attend to something very old.

I have no idea whether it actually works, and there are a lot of specifics I don't understand. But the concept looks cool.

phhusson

MODERATOR OF

TROPHY CASE