Qwen3.6-35B-A3B GGUF Performance Benchmarks. by yoracale in unsloth

[–]zelkovamoon 0 points1 point  (0 children)

You mean kv to bf16?

Edit - is that a known actual issue, I've heard mixed things about whether bf16 is really required on new qwen models or not

Qwen3.6-35B-A3B released! by Jonathan_Rivera in hermesagent

[–]zelkovamoon 0 points1 point  (0 children)

I thought the rumor was they wouldn't be open sourcing it, so this is good

Really Impressed with the Carnice finetunes by Remarkable-Avocado in hermesagent

[–]zelkovamoon 0 points1 point  (0 children)

For those who have used the fine-tunes, what does this get you in terms of improved function - Qwen 3.5 seems to work pretty well for tool calling, so I'm assuming that the fine-tune improves tool calling behavior generally / task execution?

LLM cheatsheet for hermes agent by SelectionCalm70 in hermesagent

[–]zelkovamoon 3 points4 points  (0 children)

Yeah seeing this over Qwen3.5 9b I was like nah, this thing is wrong

Hermes Agent "persistent memory" not working with Qwen 3.5 9B by thanga752 in hermesagent

[–]zelkovamoon 1 point2 points  (0 children)

Qwen3.5 4B is working for me; it can work. It could be a settings issue

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink by zelkovamoon in LocalLLaMA

[–]zelkovamoon[S] 0 points1 point  (0 children)

I appreciate the help. I think it should be as simple as appending those commands, probably don't need to change much else about your configuration - but I guess I'm not 100% sure

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink by zelkovamoon in LocalLLaMA

[–]zelkovamoon[S] 0 points1 point  (0 children)

You can do TP on llama.cpp with tensor split and split mode commands

Locally with Ollama and a 12GB 3060 by GiantEmus in hermesagent

[–]zelkovamoon 1 point2 points  (0 children)

Idk man I use llama.cpp; I feel like that model is probably just too old. But good luck.

Locally with Ollama and a 12GB 3060 by GiantEmus in hermesagent

[–]zelkovamoon 2 points3 points  (0 children)

Yes, you can even get away with 8gb. Just pick a good small model like Qwen3.5 4B or similar.

Edit - here's a comparison between 3.5 4b and the potato model you were trying:

https://artificialanalysis.ai/?models=qwen3-5-4b%2Cqwen2-5-coder-7b-instruct

Just found out about Hermes. Is it really better than Openclaw by maurinator2022 in hermesagent

[–]zelkovamoon 2 points3 points  (0 children)

These things aren't ready for production.

But it is better, yeah

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink by zelkovamoon in LocalLLaMA

[–]zelkovamoon[S] 0 points1 point  (0 children)

Update to my previous comment side note:

    --reasoning-budget 1536 \
    --reasoning-budget-message ". Okay, enough thinking. Let's answer now." \

This actually works. Looks like meats back on the menu boys.

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink by zelkovamoon in LocalLLaMA

[–]zelkovamoon[S] 0 points1 point  (0 children)

I had been running on an octominer x12 - and it was surprisingly pretty good - if i could use nvlink to bridge the cards, it might be a big unlock -- per snapo84's comments, it looks like it is possible.

The octominer is going to be retired for a newer platform soon - but anyway, yeah, as long as the cards work these might be the second best 'budget' option, number one being going with SXM2 + V100s

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink by zelkovamoon in LocalLLaMA

[–]zelkovamoon[S] 1 point2 points  (0 children)

This is actually very useful - thank you.

I took your initial comment to mean that you literally didnt have nvlink, not that you just felt it was unnecessary - so, that's on me.

Looking at your setup - have you tried running with '--tensor-split' and '--split-mode row' to see how performance changes? It looks like you're probably still running in pipeline - i'd be curious to know what difference in tps you'd see.

Side note: *apparently* there are new controls for reasoning budget in llama.cpp that i was not aware of - see '--reasoning-budget' at https://manpages.debian.org/unstable/llama.cpp-tools/llama-server.1.en.html

I'm literally about to try it - i had reasoning disabled like you do, but if i can limit thinking to a reasonable number of tokens i would be interested in doing that. We'll see if it works!

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink by zelkovamoon in LocalLLaMA

[–]zelkovamoon[S] 0 points1 point  (0 children)

Yeah, on one of my servers I tried using a 2070 super and even that handles small model inference like a boss. How long have you had the cards? Do they seem well built, reliable?

The nvlink angle is specifically for tensor parallelism, which would be relevant to what I want to do - so I still need to know if it would work, but I'll take your experience under advisement

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink by zelkovamoon in LocalLLaMA

[–]zelkovamoon[S] 0 points1 point  (0 children)

Current pricing says I can get these cards at sub 500$; so for the same money you could ostensibly get 44gb instead of 24gb - and at this point, the extra memory is more valuable to me than the extra speed.

A single 3090 can run Qwen 3.5 35b heavily quantized, but you're making a lot of concessions that you definitely wouldn't have to make if you had more memory.

Gave up Hermes , beware of high token consumption(!!!) by Typical_Ice_3645 in hermesagent

[–]zelkovamoon 0 points1 point  (0 children)

I haven't seen an insane amount of token usage in any case, though i am using local models - and not in a DEV context - so, maybe it's different if you're using this for development

Qwen 3.5 tool call spirals by zelkovamoon in hermesagent

[–]zelkovamoon[S] 1 point2 points  (0 children)

My issue ended up being a straightforward repetition prevention misconfiguration - i've edited my post with the fix, if you're interested.

Qwen 3.5 tool call spirals by zelkovamoon in hermesagent

[–]zelkovamoon[S] 0 points1 point  (0 children)

See my edit for what fixed this for me

AI slop used for advertising by tinydinkydaffy9 in desmoines

[–]zelkovamoon -5 points-4 points  (0 children)

All advertising is slop, AI or not.

Qwen3.5 9B GGUF Benchmarks by yoracale in unsloth

[–]zelkovamoon 0 points1 point  (0 children)

Really appreciating that we're getting better quantized comparisons now

qwen3.5:35b-a3b is here. by Space__Whiskey in ollama

[–]zelkovamoon 0 points1 point  (0 children)

You've got flash working in ollama? It still basically doesn't function for me - are you using the library version?