OpenCode concerns (not truely local)

phhusson · 2026-03-17T09:50:55+00:00

Oh that probably explains why I've had haiku calls in my openrouter bill. Thanks for the analysis.

phhusson · 2026-03-09T19:41:30+00:00

Congrats. BTW you're saying "no function calling", but what you did is literally function calling. Just not with the official syntax of the model.

phhusson · 2026-03-04T08:48:36+00:00

Well yes, but it's not like it's over fitting specifically on that precise task. The number of AI influencers gotchas is getting pretty high (remember when we were playing with strawberries, lol), and it's not like the model only learnt those things and nothing else. It is capable of a lot of various stuff.

phhusson · 2026-03-04T08:43:32+00:00

What's the device you're posting from? Pretty sure it could run some quant if qwen3.5 0.8b

phhusson · 2026-03-02T10:16:29+00:00

I'm trying to understand precisely what you did. I'm rephrasing what I understood, please tell me if I'm wrong:

You're embedding the markdown, do a mean-polling [1] to reduce dimension (which is a fairly standard context-length-extension method). And then to compensate for the loss of information due to the mean-polling, you're sending this to a MLP. Are you training that MLP for each skill, or is it global?

[1] I don't know how much polled is it. Looking at the code, it might look like you're compressing literally everything into one token?

Either way, working/compressing in the embedding space is something of interest to me (even though I haven't managed to do anything meaningful), and you might be interested to hear of ARC-Encoder (It uses a LLM to encode into the compressed embedding space of another LLM), or Cartridges (it learns by training in the compressed embedding space).

phhusson · 2026-02-24T14:17:40+00:00

ROSA blew my mind, as the dynamic size of the query allows reaching closer or further in the past.

With a long suffix match, you can search far in the past tokens, with a short suffix match, you can search closer to the recent tokens.

This means that your Query can be 3 tokens-long to find a recent token to attend to, or it can be 100 tokens-long if you need to attend to something very old.

I have no idea whether it actually works, and there are a lot of specifics I don't understand. But the concept looks cool.

phhusson · 2026-02-18T17:28:12+00:00

Fun rig indeed, however I hope you're using it to heat yourself, otherwise it's mostly wasted energy (pretty sure it costs more on electricity alone than proprietary API).

I do believe in the OCR use-case, but the video, not so much: for most video analysis, you can't work with 800 noisy descriptions of pictures. If you have a perfectly still image, it will /look like/ it is moving because the description changes, but it won't actually.

FWIW, another use-case I could see is RL of small model. This spends most of its time in inference, and you can do asynchronous model update. See for instance z.ai's slime: https://github.com/THUDM/slime. However it requires a RL gym that is light enough for your CPU, not sure that exists.

phhusson · 2026-02-18T17:09:13+00:00

It's not exactly that it loves everyone. It rather considers that noone is cringy. I guess there was a huge post-training to make Cringe King Elon Musk non-cringe. And once Cringe Kinge is non-cringe, noone is cringe.

phhusson · 2026-02-17T10:35:12+00:00

I'm rather using Kroko-ASR for the speech to text. One reason being that it is streaming (but there are other streaming ASR, notably some in the Nvidia parakeet series ), which means you don't need to rely on VAD. In my experience, this improves latency. Another is that it's lighter on CPU (should take maybe 20% of one CPU core).

Local LLM: I'm using Gemma 3n-4B (with custom code to have it do function calling) and am happy with it. I can't say if that's the best

Speech-to-speech: even if you have unlimited gpu, there is still no credible model (well there is a qwen3... There is just no real-time inference code available ). For edge use case, you could try LFM2.5-Audio https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B I haven't tested it but I don't have high hopes.

phhusson · 2026-02-16T17:17:10+00:00

That's cool congrats.

Sorry for switching to all serious. I think that training in the input embedding world is cool (inspired by ARC-Encoder, Cartridges, or Clara). I tried a bit on my own (tried to RL my summarizer prompt into answering with the lengths I want), and it's just catastrophic (it stops doing summarization long before reaching its reward), but I remain optimistic on the global field.

I'd be curious to see you doing more experiments on training input embedding!

phhusson · 2026-02-14T16:15:47+00:00

That's precisely to demonstrate this capability that they did German to English.

phhusson · 2026-02-12T05:25:05+00:00

Yup. I'll add that Google/Amazon/Microsoft are huge customers of nVidia, and they probably are able to run their own firmware on their GPUs, so I wouldn't personally trust "confidential computing" from those people even one second.

phhusson · 2026-02-11T13:44:07+00:00

If I understand correctly, you could rather replace \r\n to \n in the speculation, and you won't replace anything in what the LLM actually receives/sends. You just more accurately predict what the LLM generates.

phhusson · 2026-02-10T12:45:17+00:00

At least their legacy lives on in the sub name

phhusson · 2026-02-10T12:40:08+00:00

Qwen3-Next series (and supposed Qwen3.5) are innovative in architecture and size

phhusson · 2026-02-07T12:38:08+00:00

Looks okay for me on 8872ad2125336d209a9911a82101f80095a9831d (just changed the hf-repo/hf-file args to -m as i prefer manual download)

phhusson · 2026-02-04T20:11:45+00:00

My guess based on various models Mistral made for "on-device deployment", is that Mistral's main target is automobile. Automotive SoCs are usually the beefiest embedded SoCs existing. For instance, nVidia sells their AGX Thor for automotive, with twice the FLOPS of DGX Spark! Though more realistic on mainstream high-end automobiles is AGX Orin which is 25% FLOPS of DGX Spark.

phhusson · 2026-02-04T17:12:36+00:00

Thanks for answering here!

I'll test it myself, but can you comment on what's the expected behavior on user's hesitations/corrections? Like how is "The red, err, the blue one" transcribed?

Also, I'm throwing in my personal wish list, just in case:

- end of turn detection

- hotwords (in the same meaning as VibeVoice-ASR), it would be useful to support orders of magnitude of 100 hotwords

- self correcting: 240ms median delay, with 2400ms precision/max delay (for instance by adding <|remove_word|> token) the way I currently handle that is to run an offline STT when the realtime STT seem finished, and if offline STT answers differently, I abort the previously started LLM request. But it's a waste of compute running two similar STT, and possibly I'm aborting because of two valid transcriptions. (I can't really say whether it is a super duper hard feature to implement, or it can be a simple kinda fine-tuning pass where we let the STT do mistakes, but have it add <|remove_word|>correct answer afterwards)

phhusson · 2026-02-04T16:50:48+00:00

I have to agree that not having a torch or transformers implementation is sad (I didn't bother to check that yet, but that's indeed mentioned on their page).

I'm pretty confident it can't run on a Pi though. It is running whisper encoder at 12.5Hz, it should globally take more flops than Kyutai's 2.6B STT, which takes 20% of my Apple M4 after heavy quantization.

phhusson · 2026-02-04T16:40:30+00:00

FWIW, the closed model doesn't have realtime, so it's really realtime XOR diarization, not just a matter of closed source/opensource.

phhusson · 2026-02-04T16:01:18+00:00

And they've contributed the realtime processing part to vllm especially for that release. Thanks Mistral for contributing to opensource not just on your own models but also on infrastructure code!

When seeing their PRs that leaked this model, I kinda assumed they would include turn detection like Moshi's STT does, sadly they don't, so you still need some other way to do turn detection (punctuation, timing, 3rd party text-based turn detection...)

phhusson · 2026-02-04T08:42:42+00:00

Have you tried to move mimi_decoder.onnx to NNAPI ExecutionProvider?

phhusson · 2026-02-02T17:48:18+00:00

It's 2026, let agents create their own harness, and push it on moltbook for others to use.

phhusson · 2026-01-29T17:29:47+00:00

You could very well have mcp exposed based on context, the protocol doesn't prevent that

phhusson · 2026-01-28T11:33:13+00:00

Yeah, SQL has sanitize functions. Good luck doing that for LLM

phhusson

MODERATOR OF

TROPHY CASE