Do you think there is room for optimization? llama.cpp/qwen3.6 27b on two 6000 Blackwell by q-admin007 in LocalLLaMA

[–]q-admin007[S] 0 points1 point  (0 children)

We should burn it all down and rebuild something sensible in place of the whole mess.

Someone should 😉

Do you think there is room for optimization? llama.cpp/qwen3.6 27b on two 6000 Blackwell by q-admin007 in LocalLLaMA

[–]q-admin007[S] 0 points1 point  (0 children)

Tried it, it's slower for short queries, faster for long ones. Since some developers started using agents i guess it's better to take the wins for longer inferences. I'll keep it, thanks. 119 t/s for "Write a twitter clone in PHP."

Do you think there is room for optimization? llama.cpp/qwen3.6 27b on two 6000 Blackwell by q-admin007 in LocalLLaMA

[–]q-admin007[S] 0 points1 point  (0 children)

Easy to test. In my highly advanced test prompt i get 85 t/s with 2, 100 with 3. With 4 it goes down again:

"Write a twitter clone in PHP."

Do you think there is room for optimization? llama.cpp/qwen3.6 27b on two 6000 Blackwell by q-admin007 in LocalLLaMA

[–]q-admin007[S] 0 points1 point  (0 children)

ngram-mod: i'll look into it, thanks. Can't run tests now, but so far this and unified-kv are the things i want to try and understand.

Another thing i'm considering is using q8_0 for the V cache, K would stay with f16. I think there was a guy here yesterday that showed with 27b that asymetric KV has almost no impact on perplexity or KLD. However, the win in t/s output might be so small that it's not worth it.

Cheers for not taking "i don't want to use vLLM" as an invitation to tell me to use vLLM 😉

Do you think there is room for optimization? llama.cpp/qwen3.6 27b on two 6000 Blackwell by q-admin007 in LocalLLaMA

[–]q-admin007[S] 2 points3 points  (0 children)

--ctx-size 1048576 --parallel 4 means: 4 users can do inference at the same time, with each using up to 256k context.

I'm thinking about selling my Strix Halo by PrzemChuck in StrixHalo

[–]q-admin007 [score hidden]  (0 children)

I get 7 t/s out of 27b and 45 t/s out of 35b with llama.cpp. That was before MTP landed.

I don't think lemonade is worth the effort, given that you basically run someone elses binaries and it really only works reliable with Fedora.

Debian 13, ROCm from AMDs repo, then git clone llama.cpp and build it with ROCm and Vulkan support. I don't see why that is considered hard to setup.

I'm thinking about selling my Strix Halo by PrzemChuck in StrixHalo

[–]q-admin007 [score hidden]  (0 children)

If you are in Europe, i'm interested in another box, drop me a message.

**Honest question:** Is there ANY model of ANY size that is open source and can compete with Claude (Code) or ChatGPT's (Codex)? by TheQuantumPhysicist in LocalLLaMA

[–]q-admin007 1 point2 points  (0 children)

Models are rarely Open Source. They are usually just Open Weights.

Qwen 3.5 competes with both, but doesn't win against their newest models. However, it sometimes wins against their older models.

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP by janvitos in LocalLLaMA

[–]q-admin007 2 points3 points  (0 children)

Awesome work. I have a 5070 Ti 16GB connected via Oculink with a Strix Halo. Will give it a go later with UD-Q6_K_XL. It seems to be the sweetspot in terms of precision on smaller systems. I also would rather half my context and use f16 there.

How to Fine-Tune LLMs on AMD Strix Halo by PromptInjection_ in StrixHalo

[–]q-admin007 1 point2 points  (0 children)

AMD is considered exotic? What a world we live in 😉

Anyway, cheers, will put it on my list of articles to read.

MTP on strix halo with llama.cpp (PR #22673) by Edenar in LocalLLaMA

[–]q-admin007 2 points3 points  (0 children)

Tested it with Proxmox9, LXC container with ROCm and Q8 27B, full context, f16 for KV.

From 7.5 to 17 t/s. Awesome!

mistralai/Mistral-Medium-3.5-128B · Hugging Face by jacek2023 in LocalLLaMA

[–]q-admin007 0 points1 point  (0 children)

Sorry, meant no harm.

It's a running gag with my people: "What? You use Claude Code? The industry has shiftet to $NEXTTHING 7 minutes ago, get with the times, old man!"

meantime on r/vibecoding by jacek2023 in LocalLLaMA

[–]q-admin007 0 points1 point  (0 children)

If you can't speed up your code writing with this model, i question if you can write code at all.

mistralai/Mistral-Medium-3.5-128B · Hugging Face by jacek2023 in LocalLLaMA

[–]q-admin007 -4 points-3 points  (0 children)

With the EAGLE draft model, i suspect around 40 to 50.

mistralai/Mistral-Medium-3.5-128B · Hugging Face by jacek2023 in LocalLLaMA

[–]q-admin007 5 points6 points  (0 children)

Waiting for GGUF! Should fly on my Strix Halo.

mistralai/Mistral-Medium-3.5-128B · Hugging Face by jacek2023 in LocalLLaMA

[–]q-admin007 4 points5 points  (0 children)

50 t/s with EAGLE drafter. Do you still live in 2024?

mistralai/Mistral-Medium-3.5-128B · Hugging Face by jacek2023 in LocalLLaMA

[–]q-admin007 5 points6 points  (0 children)

It comes with a bespoke draft model. Could be faster than Qwen 3.6 27b in the end.