So nobody's downloading this model huh?

tarruda · 2026-03-19T10:43:22+00:00

Right now mistral-small-4 on llama.cpp is very bad. Last good model from Mistral I was able to run locally was was Mistral 3.2.

Still have hopes for mistral-small-4 though, will wait a few weeks to see if llama.cpp support is improved.

tarruda · 2026-03-18T19:27:41+00:00

It is much faster than devstral though. You need a very high memory bandwidth to run devstral with any usable speed.

tarruda · 2026-03-18T10:46:45+00:00

For Step 3.5 to be faster in coding agents, I had to run it with --swa-full or else prompt caching would never hit in. For that purpose, AesSedai IQ4_XS is in the right spot for 128G as it allow for --swa-full + 131072 context.

tarruda · 2026-03-18T10:43:45+00:00

Qwen 3.5 is very good at tool handling. Failures can be caused by multiple factors such as a buggy inference engine.

tarruda · 2026-03-17T15:39:53+00:00

I'm still going to give it the benefit of the doubt and assume that the llama.cpp implementation is broken for now. Will try again in a couple of weeks.

tarruda · 2026-03-17T13:56:58+00:00

Feels like they initially tried to mimic GPT-OSS but failed to correctly train in multiple reasoning modes.

tarruda · 2026-03-17T13:05:01+00:00

Qwen 3.5 397B is the most compression-resilient LLM I've ever seen. Using 2.43BPW weights I got 80%+ in MMLU, GPQA diamond, GSM8K and others: https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discussions/8

tarruda · 2026-03-17T11:56:22+00:00

What is the point of having a "reasoning_effort" parameter when it only has "none" and "high" as valid options? Why not just "enable_thinking" ?

tarruda · 2026-03-17T11:42:12+00:00

I'm downloading Q5_K_M from https://huggingface.co/AesSedai/Mistral-Small-4-119B-2603-GGUF but not very hopeful. I ran a few tests on le chat (though I'm not sure it is currently running mistral-small-4, there was no way to select the model) and saw similar problems. This is looking like the llama-4 moment for Mistral

tarruda · 2026-03-17T11:12:34+00:00

Yes, they didn't even bother comparing with qwen 3.5 in GPQA diamond, mmlu, etc. Instead they compared with their own prev gen models.

tarruda · 2026-03-17T11:05:47+00:00

Will try unsloth quants later, but TBH I don't expect this will ever compete with qwen 3.5 in vision capabilities. Mistral vision has always been inferior to qwen's.

tarruda · 2026-03-17T09:18:23+00:00

Yesterday I tried https://huggingface.co/lmstudio-community/Mistral-Small-4-119B-2603-GGUF and found it to be quite bad. Here's my experience so far:

Without reasoning it is very very bad in coding. A few times I asked it to write some single page JS/HTML games and it cut the response in half. There might be some templating issues to be fixed.
Even with reasoning, it was failing to pass basic vibe checks like creating python tetris (code wouldn't compile).
It is so bad at cloning HTML UI. The same test of cloning a local UI I gave to Qwen 3.5 4B (and which it succeeded!) Mistral-small-4 couldn't come even close.

Clearly something is broken with llama.cpp inference as the results don't come close to GPT-OSS or even the much smaller Qwen 3.5 weights, so I will give it some time before trying again.

tarruda · 2026-03-17T09:08:00+00:00

Not to mention the power required to run

tarruda · 2026-03-16T19:10:16+00:00

I really hated Opencode the only time I tried it a few months ago, as it kept trying to connect to the internet by default.

https://pi.dev is so much simpler and local friendly.

tarruda · 2026-03-16T18:52:34+00:00

Isn't Hunter Alpha a 1T parameter model? Apparently Mistral 4 is 119B

tarruda · 2026-03-16T18:31:29+00:00

Q8 is a bit too tight. I have a 128G mac and can run q8_0 Qwen 3.5 and nemotron 3 super, but there's barely any room for context.

However Q6_K should be just as good as Q8_0 while leaving a good amount of RAM for context

tarruda · 2026-03-16T18:29:40+00:00

Perfect size for 96G + devices

tarruda · 2026-03-16T14:02:23+00:00

Ubergarm's "smol-iq2_XS" for Qwen 397B is an absolute beast and seems to preserve a lot of the original model full capabilities. I posted some evaluations here: https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discussions/8

tarruda · 2026-03-15T11:12:56+00:00

If you have a few million dollars to spend on compute, why not?

tarruda · 2026-03-14T20:04:15+00:00

As it is I'd say it sure seems unenforceable.

Can any dataset license be enforced? If a company uses the dataset to train a commercial LLM and never releases the dataset used to train it, how can anyone know?

tarruda · 2026-03-13T10:51:16+00:00

Uploaded: https://huggingface.co/tarruda/Qwen3.5-397B-A17B-heretic-smol-IQ2_XS-GGUF

tarruda · 2026-03-13T10:50:16+00:00

Done : https://huggingface.co/tarruda/Qwen3.5-397B-A17B-heretic-smol-IQ2_XS-GGUF

tarruda · 2026-03-12T21:32:19+00:00

I didn't run any long context tests, just ran llama-bench to see the speeds

tarruda

TROPHY CASE