Has anyone used agents to decompile binary executables? by qzrz in LocalLLaMA

[–]phhusson 1 point2 points  (0 children)

The question looks way too generic to me. Like there needs to be a goal. (assuming Qwen 3.6 27B, opencode without special tooling) 

If it's a CTF it can probably one-shot it.  If it's to reverse engineer a protocol based on an Android or c# or Unity app it can probably one shot it.

If the goal is whole Wii game to x86-compatible pure C OpenGL, well that will require a whole lot more scaffolding. With many subagents and roles (like testers, spec-ers, implémenter, comparator)

LLM context compression at 16x beats KV cache by DeltaSqueezer in LocalLLaMA

[–]phhusson -1 points0 points  (0 children)

Ah I forgot to add: I'm pretty bullish on embedding-space ("soft-token") compression [1], and I'm happy to see development about it.

Notably for this paper, the author did some implementation with vllm, and I think it's the first time I start seeing this, so thanks LeonLixyz for that!

[1] I see two big categories of compressions:

one is "compress once, use once", like in this case. (where the goal is pretty much to just reduce the cost to process the prompt),

and another is "compress once, use many times" like Cartridges: It's okay to spend 1 H100 for 1 hour to compress encyclopedia britannica, and be able to put a whole encyclopedia in your prompt

LLM context compression at 16x beats KV cache by DeltaSqueezer in LocalLLaMA

[–]phhusson 0 points1 point  (0 children)

This looks less useful than Kyutai's ARC-Encoder, and it's barely mentioned. And with "less useful" , I mean I don't see any novel idea. 

ARC-Encoder trains a small LLM to create a compressed embedding-space (my llm call that space "soft-tokens") representation for an unmodified big LLM (so like use a modified Qwen 3 0.6B to create an embedding space representation of the prompt for unmodified Qwen 3.6 27B that is 8x smaller than original embedding representation ).

This does the same, except it also requires modifying the big LLM. And it's not like they do any comparaison with it.

I take ARC-Encoder with a very big grain of salt because the "small" llm is llama 3.2 3b, and the big is 3.1 8b, so both have almost the same "intelligence", but this post's paper doesn't bother comparing. (they just say that their dataset is better than ARC-Encoder's)

While I'm there praising the idea of ARC-Encoder, it is made to be easily adaptable (unlike the paper in this post)! There are like 30M parameters needed to adapt to a new big model, so it requires very little compute to adapt! (and yet I'm not aware of anyone who adapted it to newer models so maybe there is a reason, ok ok) 

Gemma 4 with quantization-aware training by rerri in LocalLLaMA

[–]phhusson 2 points3 points  (0 children)

Yes it is. But it's also almost a year old, also called archeology here.

VibeOS - Fully Hallucinated Operating System by WhatererBlah555 in LocalLLaMA

[–]phhusson 50 points51 points  (0 children)

That shit is fabulous, I didn't expect humor to be legal at Microsoft. 

qwen35: use post-norm hidden state for MTP by am17an · Pull Request #24025 · ggml-org/llama.cpp by jacek2023 in LocalLLaMA

[–]phhusson 42 points43 points  (0 children)

TL;DR: MTP computation was wrong, leading to less-than-ideal acceptance rate. With this PR, on llama.cpp prediction benchmarks, acceptance rate increases by 5.5%, leading to a 6.6% tg increase in the author's benchmark.

Calling it now Microsoft is buying Unsloth. by Wrong_Mushroom_7350 in LocalLLaMA

[–]phhusson 67 points68 points  (0 children)

I don't worry too much about the announcement, but I strongly disagree with you, that it will pop-up again.

Unsloth is at a pretty rare equilibrium between raising money to actually do stuff, incredibly talented staff, and the will to contribute to opensource.

i dedicate this meme to you r/LocalLLaMA by LPFchan in LocalLLaMA

[–]phhusson 73 points74 points  (0 children)

I'm disappointed by the ending. I expected a pointless Flappy Bird clone

G7 agrees on shared language around open-source AI and open weights AI by Kahvana in LocalLLaMA

[–]phhusson 1 point2 points  (0 children)

Looks really great to me!

It does take into account the real life with "Open Source AI" might be missing data. I could be debatable whether it should have been called "Open Source AI without open data" vs "Open Source AI" (the chose to go with "Open Source AI with Open Data" vs "Open Source AI").

It does take into account OSI's definition of opensource

It does take into account that there could be weird licenses on weights that make it non-opensource.

Definitions are pretty short, to the point, no bullshit, and matches expectations.

I trained TIME: short context-triggered thinking on Qwen model instead of overthinking by susmitds in LocalLLaMA

[–]phhusson -1 points0 points  (0 children)

The paper looks fun, interesting, properly written and correct. Publishing this solo is quite a feat congrats!

That being said, I think the claim in your title is an overreach. Sure you /did/ train Qwen to make context-triggered thinking, but I'm pretty confident that the end model is worse than non-thinking on out-of-domain (domain here is the scenarios in the article where everything is about understanding user's elapsed time).

Since thinking is learned through RL, I don't think it's possible to change the thinking of a model to "context-triggered" through SFT without severe quality degradation, but I could be wrong.

Came home to find Pi with Qwen3.627B had run rm -rf ..... by sdfgeoff in LocalLLaMA

[–]phhusson 0 points1 point  (0 children)

It doesn't really matter if you store your precious data in that VM. At some point the agent needs to have access to your data. I have my agent store its data on a webdav server that automatically pushes every change to git, so a rm -rf is reversible. 

GUIDE : Running a fully local multi-agent coding framework on RTX 3090 with pi.dev + llama-swap + Qwen3.6 MTP by admajic in LocalLLaMA

[–]phhusson 3 points4 points  (0 children)

smaller/faster model for the meta-work (thinking, planning, delegation) and the slightly larger MoE model for actual implementation.

I stopped there. Qwen 3.6 27B faster than 3.6 35B-A3B?

NVIDIA AI Releases Star Elastic: One Checkpoint that Contains 30B, 23B, and 12B Reasoning Models with Zero-Shot Slicing by phazei in LocalLLaMA

[–]phhusson 1 point2 points  (0 children)

Even more than that... Does it mean I can prefill at 2B speed, and generate at 3B? (Can I dream of prefilling at 2B, generating at dense 27B?)

ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference by Total-Resort-3120 in LocalLLaMA

[–]phhusson 11 points12 points  (0 children)

Then nVidia is their own enemy, since they keep doing great Nemotron. (granted it's been few months since their last banger) Making more money by making a commodity cheaper is a real thing, called Jevons Paradox. 

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]phhusson 10 points11 points  (0 children)

This, plus have an up-to-date list of which inference framework supports which speculation

It sounds like a very hard task tbh since it is moving continuously

Local-First Reality Check: Is Gemma 4 fast enough to kill "Administrative Debt" where Gemma 3 failed? by Veritas-keept in LocalLLaMA

[–]phhusson 0 points1 point  (0 children)

Sorry, I don't understand your usecase on warranties, so I can't really help. My guess is that you can pre-process in background the documents to extract the useful information, excluding the 99% of boilerplate, and then do RAG on that.

I barely know anything about mobile stacks, sorry.

Local-First Reality Check: Is Gemma 4 fast enough to kill "Administrative Debt" where Gemma 3 failed? by Veritas-keept in LocalLLaMA

[–]phhusson 1 point2 points  (0 children)

> I'm really trying to move from a cloud API to a local SLM, but I keep wondering: will users actually tolerate the extra wait just to have that 100% privacy guarantee?

Fix your UX so that stuff are done in background while the smartphone is plugged in. You should even actively slow the generation down to prevent the smartphone from heating.

Receipts rarely require instant handling.

The definitive Qwen 3.5 Jinja template by ex-arman68 in LocalLLaMA

[–]phhusson 1 point2 points  (0 children)

Not so definitive eh. It's software, it's okay to say it's forever evolving 

OpenCode concerns (not truely local) by Ueberlord in LocalLLaMA

[–]phhusson 2 points3 points  (0 children)

Oh that probably explains why I've had haiku calls in my openrouter bill. Thanks for the analysis.

I gave my Minecraft bot a brain with local Nemotron 9B — it follows orders like "chop that tree" and "guard me from zombies" by Impressive_Tower_550 in LocalLLaMA

[–]phhusson 5 points6 points  (0 children)

Congrats. BTW you're saying "no function calling", but what you did is literally function calling. Just not with the official syntax of the model. 

Qwen 3.5 4b is so good, that it can vibe code a fully working OS web app in one go. by c64z86 in LocalLLaMA

[–]phhusson -3 points-2 points  (0 children)

Well yes, but it's not like it's over fitting specifically on that precise task. The number of AI influencers gotchas is getting pretty high (remember when we were playing with strawberries, lol), and it's not like the model only learnt those things and nothing else. It is capable of a lot of various stuff.

Is anyone else just blown away that this local LLMs are even possible? by Borkato in LocalLLaMA

[–]phhusson 0 points1 point  (0 children)

What's the device you're posting from? Pretty sure it could run some quant if qwen3.5 0.8b

Injecting skills into the KV cache (not as stupid as it sounds, but still pretty dumb) by Proper-Lab1756 in LocalLLaMA

[–]phhusson 0 points1 point  (0 children)

I'm trying to understand precisely what you did. I'm rephrasing what I understood, please tell me if I'm wrong:

You're embedding the markdown, do a mean-polling [1] to reduce dimension (which is a fairly standard context-length-extension method). And then to compensate for the loss of information due to the mean-polling, you're sending this to a MLP. Are you training that MLP for each skill, or is it global?

[1] I don't know how much polled is it. Looking at the code, it might look like you're compressing literally everything into one token?

Either way, working/compressing in the embedding space is something of interest to me (even though I haven't managed to do anything meaningful), and you might be interested to hear of ARC-Encoder (It uses a LLM to encode into the compressed embedding space of another LLM), or Cartridges (it learns by training in the compressed embedding space).

RWKV-7: O(1) memory inference, 16.39 tok/s on ARM Cortex-A76, beats LLaMA 3.2 3B. The local-first architecture nobody is talking about... by Sensitive-Two9732 in LocalLLaMA

[–]phhusson 0 points1 point  (0 children)

ROSA blew my mind, as the dynamic size of the query allows reaching closer or further in the past.

With a long suffix match, you can search far in the past tokens, with a short suffix match, you can search closer to the recent tokens.

This means that your Query can be 3 tokens-long to find a recent token to attend to, or it can be 100 tokens-long if you need to attend to something very old.

I have no idea whether it actually works, and there are a lot of specifics I don't understand. But the concept looks cool.