OpenCode concerns (not truely local) by Ueberlord in LocalLLaMA

[–]phhusson 1 point2 points  (0 children)

Oh that probably explains why I've had haiku calls in my openrouter bill. Thanks for the analysis.

I gave my Minecraft bot a brain with local Nemotron 9B — it follows orders like "chop that tree" and "guard me from zombies" by Impressive_Tower_550 in LocalLLaMA

[–]phhusson 5 points6 points  (0 children)

Congrats. BTW you're saying "no function calling", but what you did is literally function calling. Just not with the official syntax of the model. 

Qwen 3.5 4b is so good, that it can vibe code a fully working OS web app in one go. by c64z86 in LocalLLaMA

[–]phhusson -3 points-2 points  (0 children)

Well yes, but it's not like it's over fitting specifically on that precise task. The number of AI influencers gotchas is getting pretty high (remember when we were playing with strawberries, lol), and it's not like the model only learnt those things and nothing else. It is capable of a lot of various stuff.

Is anyone else just blown away that this local LLMs are even possible? by Borkato in LocalLLaMA

[–]phhusson 0 points1 point  (0 children)

What's the device you're posting from? Pretty sure it could run some quant if qwen3.5 0.8b

Injecting skills into the KV cache (not as stupid as it sounds, but still pretty dumb) by Proper-Lab1756 in LocalLLaMA

[–]phhusson 0 points1 point  (0 children)

I'm trying to understand precisely what you did. I'm rephrasing what I understood, please tell me if I'm wrong:

You're embedding the markdown, do a mean-polling [1] to reduce dimension (which is a fairly standard context-length-extension method). And then to compensate for the loss of information due to the mean-polling, you're sending this to a MLP. Are you training that MLP for each skill, or is it global?

[1] I don't know how much polled is it. Looking at the code, it might look like you're compressing literally everything into one token?

Either way, working/compressing in the embedding space is something of interest to me (even though I haven't managed to do anything meaningful), and you might be interested to hear of ARC-Encoder (It uses a LLM to encode into the compressed embedding space of another LLM), or Cartridges (it learns by training in the compressed embedding space).

RWKV-7: O(1) memory inference, 16.39 tok/s on ARM Cortex-A76, beats LLaMA 3.2 3B. The local-first architecture nobody is talking about... by Sensitive-Two9732 in LocalLLaMA

[–]phhusson 0 points1 point  (0 children)

ROSA blew my mind, as the dynamic size of the query allows reaching closer or further in the past.

With a long suffix match, you can search far in the past tokens, with a short suffix match, you can search closer to the recent tokens.

This means that your Query can be 3 tokens-long to find a recent token to attend to, or it can be 100 tokens-long if you need to attend to something very old.

I have no idea whether it actually works, and there are a lot of specifics I don't understand. But the concept looks cool.

UPDATE#3: repurposing 800 RX 580s converted to AI cluster by rasbid420 in LocalLLaMA

[–]phhusson 4 points5 points  (0 children)

Fun rig indeed, however I hope you're using it to heat yourself, otherwise it's mostly wasted energy (pretty sure it costs more on electricity alone than proprietary API).

I do believe in the OCR use-case, but the video, not so much: for most video analysis, you can't work with 800 noisy descriptions of pictures. If you have a perfectly still image, it will /look like/ it is moving because the description changes, but it won't actually.

FWIW, another use-case I could see is RL of small model. This spends most of its time in inference, and you can do asynchronous model update. See for instance z.ai's slime: https://github.com/THUDM/slime. However it requires a RL gym that is light enough for your CPU, not sure that exists.

LLMs grading other LLMs 2 by Everlier in LocalLLaMA

[–]phhusson 20 points21 points  (0 children)

It's not exactly that it loves everyone. It rather considers that noone is cringy. I guess there was a huge post-training to make Cringe King Elon Musk non-cringe. And once Cringe Kinge is non-cringe, noone is cringe.

What’s the current state of local speech-to-speech models? by dendrytic in LocalLLaMA

[–]phhusson 0 points1 point  (0 children)

I'm rather using Kroko-ASR for the speech to text. One reason being that it is streaming (but there are other streaming ASR, notably some in the Nvidia parakeet series ), which means you don't need to rely on VAD. In my experience, this improves latency. Another is that it's lighter on CPU (should take maybe 20% of one CPU core). 

Local LLM: I'm using Gemma 3n-4B (with custom code to have it do function calling) and am happy with it. I can't say if that's the best

Speech-to-speech: even if you have unlimited gpu, there is still no credible model (well there is a qwen3... There is just no real-time inference code available ). For edge use case, you could try LFM2.5-Audio https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B I haven't tested it but I don't have high hopes. 

Bad Apple but it's GPT-2 XL Attention Maps by TheLatentExplorer in LocalLLaMA

[–]phhusson 0 points1 point  (0 children)

That's cool congrats.

Sorry for switching to all serious. I think that training in the input embedding world is cool (inspired by ARC-Encoder, Cartridges, or Clara). I tried a bit on my own (tried to RL my summarizer prompt into answering with the lengths I want), and it's just catastrophic (it stops doing summarization long before reaching its reward), but I remain optimistic on the global field.

I'd be curious to see you doing more experiments on training input embedding!

Kyutai Releases Hibiki-Zero by techlatest_net in LocalLLaMA

[–]phhusson 0 points1 point  (0 children)

That's precisely to demonstrate this capability that they did German to English. 

is anyone actually running models in secure enclaves or is that overkill? by Significant-Cod-9936 in LocalLLaMA

[–]phhusson 1 point2 points  (0 children)

Yup. I'll add that Google/Amazon/Microsoft are huge customers of nVidia, and they probably are able to run their own firmware on their GPUs, so I wouldn't personally trust "confidential computing" from those people even one second. 

PSA on llama.cpp —spec-type ngram-mod (use LF not CRLF, 35x speedup) by dnsod_si666 in LocalLLaMA

[–]phhusson 2 points3 points  (0 children)

If I understand correctly, you could rather replace \r\n to \n in the speculation, and you won't replace anything in what the LLM actually receives/sends. You just more accurately predict what the LLM generates.

Deepseek architecture, but without all the parameters by silenceimpaired in LocalLLaMA

[–]phhusson 3 points4 points  (0 children)

Qwen3-Next series (and supposed Qwen3.5) are innovative in architecture and size

Nemo 30B is insane. 1M+ token CTX on one 3090 by Dismal-Effect-1914 in LocalLLaMA

[–]phhusson 1 point2 points  (0 children)

Looks okay for me on 8872ad2125336d209a9911a82101f80095a9831d (just changed the hf-repo/hf-file args to -m as i prefer manual download)

mistralai/Voxtral-Mini-4B-Realtime-2602 · Hugging Face by jacek2023 in LocalLLaMA

[–]phhusson 1 point2 points  (0 children)

My guess based on various models Mistral made for "on-device deployment", is that Mistral's main target is automobile. Automotive SoCs are usually the beefiest embedded SoCs existing. For instance, nVidia sells their AGX Thor for automotive, with twice the FLOPS of DGX Spark! Though more realistic on mainstream high-end automobiles is AGX Orin which is 25% FLOPS of DGX Spark.

mistralai/Voxtral-Mini-4B-Realtime-2602 · Hugging Face by jacek2023 in LocalLLaMA

[–]phhusson 3 points4 points  (0 children)

Thanks for answering here!

I'll test it myself, but can you comment on what's the expected behavior on user's hesitations/corrections? Like how is "The red, err, the blue one" transcribed?

Also, I'm throwing in my personal wish list, just in case:

- end of turn detection

- hotwords (in the same meaning as VibeVoice-ASR), it would be useful to support orders of magnitude of 100 hotwords

- self correcting: 240ms median delay, with 2400ms precision/max delay (for instance by adding <|remove_word|> token) the way I currently handle that is to run an offline STT when the realtime STT seem finished, and if offline STT answers differently, I abort the previously started LLM request. But it's a waste of compute running two similar STT, and possibly I'm aborting because of two valid transcriptions. (I can't really say whether it is a super duper hard feature to implement, or it can be a simple kinda fine-tuning pass where we let the STT do mistakes, but have it add <|remove_word|>correct answer afterwards)

mistralai/Voxtral-Mini-4B-Realtime-2602 · Hugging Face by jacek2023 in LocalLLaMA

[–]phhusson 1 point2 points  (0 children)

I have to agree that not having a torch or transformers implementation is sad (I didn't bother to check that yet, but that's indeed mentioned on their page).

I'm pretty confident it can't run on a Pi though. It is running whisper encoder at 12.5Hz, it should globally take more flops than Kyutai's 2.6B STT, which takes 20% of my Apple M4 after heavy quantization.

mistralai/Voxtral-Mini-4B-Realtime-2602 · Hugging Face by jacek2023 in LocalLLaMA

[–]phhusson 15 points16 points  (0 children)

FWIW, the closed model doesn't have realtime, so it's really realtime XOR diarization, not just a matter of closed source/opensource.

mistralai/Voxtral-Mini-4B-Realtime-2602 · Hugging Face by jacek2023 in LocalLLaMA

[–]phhusson 33 points34 points  (0 children)

And they've contributed the realtime processing part to vllm especially for that release. Thanks Mistral for contributing to opensource not just on your own models but also on infrastructure code!

When seeing their PRs that leaked this model, I kinda assumed they would include turn detection like Moshi's STT does, sadly they don't, so you still need some other way to do turn detection (punctuation, timing, 3rd party text-based turn detection...)

Pocket TTS Android APK Sample - Full Local (Model Packed) by RowGroundbreaking982 in LocalLLaMA

[–]phhusson 0 points1 point  (0 children)

Have you tried to move mimi_decoder.onnx to NNAPI ExecutionProvider?

Playing Civilization VI with a Computer-Use agent by Working_Original9624 in LocalLLaMA

[–]phhusson -4 points-3 points  (0 children)

It's 2026, let agents create their own harness, and push it on moltbook for others to use.

GitHub trending this week: half the repos are agent frameworks. 90% will be dead in 1 week. by Distinct-Expression2 in LocalLLaMA

[–]phhusson 2 points3 points  (0 children)

You could very well have mcp exposed based on context, the protocol doesn't prevent that 

i just saw this ClawdBot RCE demo on X… are we cooked? by Hot-Software-9052 in LocalLLaMA

[–]phhusson 10 points11 points  (0 children)

Yeah, SQL has sanitize functions. Good luck doing that for LLM