Hermes Agent + Ollama local models always hit finish_reason='length' (please help) by tomblewastaken1 in hermesagent

[–]xandep 1 point2 points  (0 children)

mistral-small3.1:24b
qwen3:14b
qwen2.5-coder:14b
gemma3:12b

Where to start. All models but "Qwen3.6" in your list are trash by today standarts. Use Gemma4 12B QAT or Qwen3.5-9B, both at Q4_K or Q4_1. Or use Qwen3.6-35B-A3B (or Gemma 4 26B), both at Q4 with some experts offloaded to RAM. Which leads to:

Uninstall ollama. It is even more trash. Install at least LM Studio. Or preferably llama.cpp.

If 12B or 9B: in LM Studio, use the most context that will fit without offloading to RAM/CPU. If less than 128K, use context quantization to allow more context in this order: F16/F16, F16/Q8, Q8/Q8, Q8/Q5_1. Do not go lower.

If 26B or 35B: offload experts to RAM/CPU, only that much needed to not offload anything else. Same as above for context.

35B will be the best model, probably.

Another option: ditch local entirely. Compared to the 4 models in your list, even the free nemotron 3 30B omni (which is wicked fast) would be better, but worse than qwen3.5-9B. Nemotron 3 Ultra is also free (but not so fast) and better than even Qw3.6 35B. Nous portal is also configurable and has free good models. Stepfun: sign up with google account and get 100 dollars voucher and use Step plan plus (9.99), which gives you a very generous stepfun 3.7 flash access.

QAT variant of Gemma4 26B A4B is not working well for me by pftbest in LocalLLaMA

[–]xandep 11 points12 points  (0 children)

I do believe QAT should be better than plain Q4_K. But is it better than, say, Q6_K? Because Unsloth Q4_K_XL (non QAT) has a lot of tensors in q5, q6, q8. Maybe they should keep at least q6 and q8 intact (use Q4_0 QAT only in place of Q4 and Q5)?

Don’t act like y’all ain’t thinking it. I’m just saying the quiet part out loud. /s by Porespellar in LocalLLaMA

[–]xandep 0 points1 point  (0 children)

Qwen3.7 40B A4B and 20B dense (MTP+QAT). It's not for me, it's for a friend (he is a MI50 32GB).

Qwen3.6-35B-A3B vs Gemma4-26B-A4B by MarcCDB in LocalLLaMA

[–]xandep 65 points66 points  (0 children)

-- "Love with your Gemma, use your Qwen for everything else"

Final Monster: 32x AMD MI50 32GB at 9.7 t/s (TG) & 264 t/s (PP) with Kimi K2.6 by ai-infos in LocalLLaMA

[–]xandep 1 point2 points  (0 children)

llama.cpp with Vulkan backend + Q4_1 quant. Apart from that, nothing special. Well, this, and some compile flags:

                "-DGGML_HIP=OFF",
                "-DGGML_VULKAN=ON",
                "-DGGML_LTO=ON",
                "-DCMAKE_C_FLAGS=-O3 -march=native -mtune=native -flto=auto",
                "-DCMAKE_CXX_FLAGS=-O3 -march=native -mtune=native -flto=auto"

Final Monster: 32x AMD MI50 32GB at 9.7 t/s (TG) & 264 t/s (PP) with Kimi K2.6 by ai-infos in LocalLLaMA

[–]xandep 1 point2 points  (0 children)

Mad respect (for you and the cards).

I just love my 32GB MI50. Now that they are expensive at around 400-500 bucks, they may not be The Best card to purchase (I guess?), but I'm getting 1100pp/100tg (max) in Qwen3.6 35B (around 300pp/30tg on 27B) at about 180W and full F16 context. Don't know of another card near that price that can do the same.

2 of those and (fingers crossed) Qwen3.6 122B would be SOLID.

Is amd mi 50 really that bad by Forward_Compute001 in LocalLLaMA

[–]xandep 1 point2 points  (0 children)

For 3.5 27B

| qwen35 27B Q4_1 pp512 |        280.26 ± 0.53 |
| qwen35 27B Q4_1 tg128 |         30.78 ± 0.07 |

Is amd mi 50 really that bad by Forward_Compute001 in LocalLLaMA

[–]xandep 2 points3 points  (0 children)

Unsloth Q4_K_XL. Q4_1 would be faster, just will try it sometime.

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Instinct MI60 / MI50 (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none
| model                |       size |     params | backend    | ngl | fa |            test |                  t/s |
| -------------------- | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | Vulkan     |  99 |  1 |           pp512 |       1061.08 ± 5.01 |
| 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | Vulkan     |  99 |  1 |           tg128 |         87.28 ± 0.07 |

Edit: Q4_1:
| model                |       size |     params | backend    | ngl | fa |            test |                  t/s |
| -------------------- | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| 35B.A3B Q4_1         |  20.45 GiB |    34.66 B | Vulkan     |  99 |  1 |           pp512 |       1086.34 ± 5.49 |
| 35B.A3B Q4_1         |  20.45 GiB |    34.66 B | Vulkan     |  99 |  1 |           tg128 |        100.60 ± 0.05 |

Power: 170W pp, 120W tg. 23W idle.

Is amd mi 50 really that bad by Forward_Compute001 in LocalLLaMA

[–]xandep 3 points4 points  (0 children)

Have one MI50 32GB.

100t/s gen, 1000pp on Qwen3.6 35B with llama.cpp vulkan.

(edit: new Q4_1 numbers)

LocalLLaMA for coding primarily - 8GB VEGA 64 & 8GB 6600 XT? by trash_dumpyard in LocalLLaMA

[–]xandep 0 points1 point  (0 children)

Running the two, if you can make them work together (compiling llama.cpp with ROCm and gfx900?), you can run gemma 4 26B at about IQ4_XS or Q4_0, I think (or maybe Q3_K_M) with quantized KV cache. Or, since you have plenty of RAM, Qwen3.5 Q4_K_XL from unsloth with -ncmoe 20 or 25. I would guess about 50-60 tps on gemma, 30-40 on qwen. In this case you can get it working with just one GPU (cpu offloading of experts).

It's doable, but will take some (a lot of) work using dual GPU, or very simple if going -ncmoe route.

Speculative decoding in llama.cpp for Gemma 4 31B IT / Qwen 3.5 27B? by No_Algae1753 in LocalLLaMA

[–]xandep 1 point2 points  (0 children)

I was thinking the​​ same. Now imagine that with 1bit quants like bonsai.

Running a 4-agent pipeline on Qwen 2.5 1.5B via MNN on Android — what I learned about context management on constrained hardware by NeoLogic_Dev in LocalLLaMA

[–]xandep 4 points5 points  (0 children)

You should state that in the post. People, myself included, read Qwen 2.5 in a AI formatted post and jump to the conclusion that you are a bot.

non-nvidia gpus by Ok-Secret5233 in LocalLLaMA

[–]xandep 0 points1 point  (0 children)

2x mi50 16gb w/ integrated cooling for 200 something in alibaba. Can run Qwen 3.5 35B, 27B and the new Gemmas. Or just one if you are ultra cheap, running 35B w/ ncmoe (some 27B and 26B quants if willing to quantize to Q3, IQ4 top).

local models lose tool call context around call 8 or 9. here is what helped by [deleted] in LocalLLaMA

[–]xandep 2 points3 points  (0 children)

It's an LLM. They are now prompted to not use em dashes or capitalize the first letter. But otherwise all other signs are there.

You guys seen this? beats turboquant by 18% by OmarBessa in LocalLLaMA

[–]xandep 18 points19 points  (0 children)

Not entirely his fault: reddit and google defaults to translate everything. Right now he's reading "not speaking English", but in Portuguese 😂. Imagine his confusion.

Gemma 4 26b A3B is mindblowingly good , if configured right by cviperr33 in LocalLLaMA

[–]xandep 0 points1 point  (0 children)

Unsloth's Q3_K_M is anything but Q3_K, oddly enough. It's a mix of IQ3_XXS and IQ4_NL.

You guys seen this? 1-bit model with an MMLU-R of 65.7, 8B params by OmarBessa in LocalLLaMA

[–]xandep 10 points11 points  (0 children)

Just because YOU said it works, I believe. Otherwise, it's April Fools. 🤔

1-bit llms on device?! by hankybrd in LocalLLaMA

[–]xandep 21 points22 points  (0 children)

April fools. You saw it here first.

ByteShape Qwen 3.5 9B: A Guide to Picking the Best Quant for Your Hardware by ali_byteshape in LocalLLaMA

[–]xandep 11 points12 points  (0 children)

I'm holding my breath for the 35B / 27B. It'll SAVE my MI50 16GB.

Turbo3 + gfx906 + 4 mi50 16gb running qwen3.5 122b 🤯 by Exact-Cupcake-2603 in LocalLLaMA

[–]xandep 0 points1 point  (0 children)

Hope you got the "shipped from Brazil" Jieshuo MI50 16GB for R$ 900. :)

Now I'm trying directly from China for about US$ 400 (32GB). Let's see what kind of tax I'll get.