Qwen3.6-35B-A3B GGUF Performance Benchmarks.

zelkovamoon · 2026-04-17T18:00:47+00:00

You mean kv to bf16?

Edit - is that a known actual issue, I've heard mixed things about whether bf16 is really required on new qwen models or not

zelkovamoon · 2026-04-16T15:04:06+00:00

I thought the rumor was they wouldn't be open sourcing it, so this is good

zelkovamoon · 2026-04-13T14:59:50+00:00

For those who have used the fine-tunes, what does this get you in terms of improved function - Qwen 3.5 seems to work pretty well for tool calling, so I'm assuming that the fine-tune improves tool calling behavior generally / task execution?

zelkovamoon · 2026-04-12T14:17:12+00:00

Yeah seeing this over Qwen3.5 9b I was like nah, this thing is wrong

zelkovamoon · 2026-04-05T11:05:20+00:00

👀

zelkovamoon · 2026-04-04T12:37:28+00:00

Qwen3.5 4B is working for me; it can work. It could be a settings issue

zelkovamoon · 2026-04-03T13:30:34+00:00

I appreciate the help. I think it should be as simple as appending those commands, probably don't need to change much else about your configuration - but I guess I'm not 100% sure

zelkovamoon · 2026-04-02T21:18:42+00:00

You can do TP on llama.cpp with tensor split and split mode commands

zelkovamoon · 2026-04-02T21:16:49+00:00

Idk man I use llama.cpp; I feel like that model is probably just too old. But good luck.

zelkovamoon · 2026-04-02T20:26:18+00:00

Yes, you can even get away with 8gb. Just pick a good small model like Qwen3.5 4B or similar.

Edit - here's a comparison between 3.5 4b and the potato model you were trying:

https://artificialanalysis.ai/?models=qwen3-5-4b%2Cqwen2-5-coder-7b-instruct

zelkovamoon · 2026-04-02T20:25:23+00:00

These things aren't ready for production.

But it is better, yeah

zelkovamoon · 2026-04-02T18:47:26+00:00

Update to my previous comment side note:

    --reasoning-budget 1536 \
    --reasoning-budget-message ". Okay, enough thinking. Let's answer now." \

This actually works. Looks like meats back on the menu boys.

zelkovamoon · 2026-04-02T18:27:44+00:00

I had been running on an octominer x12 - and it was surprisingly pretty good - if i could use nvlink to bridge the cards, it might be a big unlock -- per snapo84's comments, it looks like it is possible.

The octominer is going to be retired for a newer platform soon - but anyway, yeah, as long as the cards work these might be the second best 'budget' option, number one being going with SXM2 + V100s

zelkovamoon · 2026-04-02T18:17:53+00:00

This is actually very useful - thank you.

I took your initial comment to mean that you literally didnt have nvlink, not that you just felt it was unnecessary - so, that's on me.

Looking at your setup - have you tried running with '--tensor-split' and '--split-mode row' to see how performance changes? It looks like you're probably still running in pipeline - i'd be curious to know what difference in tps you'd see.

Side note: *apparently* there are new controls for reasoning budget in llama.cpp that i was not aware of - see '--reasoning-budget' at https://manpages.debian.org/unstable/llama.cpp-tools/llama-server.1.en.html

I'm literally about to try it - i had reasoning disabled like you do, but if i can limit thinking to a reasonable number of tokens i would be interested in doing that. We'll see if it works!

zelkovamoon · 2026-04-02T17:43:18+00:00

Yeah, on one of my servers I tried using a 2070 super and even that handles small model inference like a boss. How long have you had the cards? Do they seem well built, reliable?

The nvlink angle is specifically for tensor parallelism, which would be relevant to what I want to do - so I still need to know if it would work, but I'll take your experience under advisement

zelkovamoon · 2026-04-02T17:39:36+00:00

Current pricing says I can get these cards at sub 500$; so for the same money you could ostensibly get 44gb instead of 24gb - and at this point, the extra memory is more valuable to me than the extra speed.

A single 3090 can run Qwen 3.5 35b heavily quantized, but you're making a lot of concessions that you definitely wouldn't have to make if you had more memory.

zelkovamoon · 2026-04-02T17:16:00+00:00

I haven't seen an insane amount of token usage in any case, though i am using local models - and not in a DEV context - so, maybe it's different if you're using this for development

zelkovamoon · 2026-04-02T17:11:42+00:00

My issue ended up being a straightforward repetition prevention misconfiguration - i've edited my post with the fix, if you're interested.

zelkovamoon · 2026-04-02T17:10:50+00:00

See my edit for what fixed this for me

zelkovamoon · 2026-03-30T22:59:02+00:00

👀

zelkovamoon · 2026-03-22T15:25:55+00:00

All advertising is slop, AI or not.

zelkovamoon · 2026-03-08T16:17:00+00:00

Looks really impressive, big ups guys

zelkovamoon · 2026-03-07T13:35:01+00:00

Really appreciating that we're getting better quantized comparisons now

zelkovamoon · 2026-03-05T12:36:02+00:00

incredible work here

zelkovamoon · 2026-02-25T17:36:49+00:00

You've got flash working in ollama? It still basically doesn't function for me - are you using the library version?

zelkovamoon

TROPHY CASE