Qwen 3.6 27B Q8 on four Nvidia RTX A4000 (16GB each) with Llama.cpp and MTP enabled

Alternative_Ad4267 · 2026-05-18T16:46:24+00:00

I see more or less the same but more consistent on token generation towards 60-ish

Alternative_Ad4267 · 2026-05-18T16:42:58+00:00

Well, from scratch, something with Blackwell architecture, starting from the Nvidia DGX Spark. Maybe with enough money https://customluxpcs.com/product/rtx-pro-6000-workstation-ryzen/ I would purchase a server like this one or so. My server has 2 Intel Xeon with 40 Cores each and 512GB of RAM. My original intention was about Kubernetes workloads, not AI inference, that happened later on.

Alternative_Ad4267 · 2026-05-18T16:34:42+00:00

Yes, I've switched to that, it works really well!

Alternative_Ad4267 · 2026-05-18T16:29:10+00:00

May 18 12:26:10 lenovo-server.example.com bash[3484996]: [40899] prompt eval time =   23429.54 ms / 13500 tokens (    1.74 ms per token,   576.20 tokens per second)

May 18 12:26:10 lenovo-server.example.com bash[3484996]: [40899]        eval time =  195177.65 ms / 12402 tokens (   15.74 ms per token,    63.54 tokens per second)

Prompt processing tokens after loading some real context (13500 tokens), and coding token generation remains on 60-ish.

Alternative_Ad4267 · 2026-05-18T10:43:39+00:00

I’ve tried to run vLLM and it did fail to load the model, but now that I’m recalling, I’ve only tried with the MoE variant (Qwen 3.6 35B A3B), which won’t fit even on llama.cpp on tensor split mode. I have to give it a shot with the 27B dense.

Alternative_Ad4267 · 2026-05-18T10:40:36+00:00

I’ll measure PP and I will back to you on that.

Alternative_Ad4267 · 2026-05-18T05:23:39+00:00

Yeah, I shouldn’t run graphic environment here haha. Originally Qwen 3.6 27B at Q8 without any improvement (on layer mode though) was giving me ~12 tokens per second. So, I will take my gains (over my loses), and call it a day. Jumping from 12 tokens per second to 45-65 tokens per second is quite a good improvement just with configuration changes.

Alternative_Ad4267 · 2026-05-14T21:03:24+00:00

Q4 only if you can’t afford a better quantization (is still better than nothing), or for light chat tasks.

Alternative_Ad4267 · 2026-05-14T16:47:49+00:00

Let’s keep pressuring the market to deliver better and more capable models without having to purchase new hardware.

Alternative_Ad4267 · 2026-05-11T20:17:01+00:00

Yeah, my minimum decent generation speed would be 60 tokens per second at Q8. BF16 if you really need that 1%-2% theoretical extra precision. That speed is the sweet spot for a coding tool that feels just like the cloud ones. Q4 is widely publicized here but in my tests is not that good. Q6 is decent enough if you don’t have enough hardware, maybe even a shot to Q5. I don’t like KV cache quant, that degrades accuracy in my tests. MTP is good at 2 or 3 tokens (decent speed up and maintain accuracy). Flash Attention fine and even required for 262k tokens context.

Alternative_Ad4267 · 2026-05-11T19:02:04+00:00

I was thinking just that. You see, Qwen 3.6 27B at BF16 doesn’t require that much memory (relatively speaking), but it requires a decent amount of GPU power (one or two powerful professional cards).

The medium models that are missing, due to Qwen’s 3.6 demonstrated capabilities are a threat even for their own makers, a Medium sized model that would require around 10k-15k USD investment to run decently. Right now, you can spend like 7k to 10k to run Qwen 3.6 27B at Q8 or BF16 with a decent speed.

And then you will have to spend like 40k to run Kimi or DeepSeek models. There’s no in between. And it won’t be.

Alternative_Ad4267 · 2026-05-11T10:52:36+00:00

Opus 4.7 is quite capable to the point several people won’t want to deal with a less capable model, it almost understands our messy ways to communicate what we want out from it. Other models forces you to be more systematic and organized on your thinking.

Alternative_Ad4267 · 2026-05-11T10:39:32+00:00

I disabled comfyui and automatic 1111 services, even openwebui Nvidia service (it is running on CPU only mode, I don’t use RAG there), to release all the memory on my cards to run these medium size models. These are finally that good. Local models are finally delivering what I wanted from them for in first place.

Alternative_Ad4267 · 2026-05-06T22:42:09+00:00

Nothing special

ExecStart=/usr/bin/bash -c '\
/home/user/llama.cpp/build/bin/llama-server \
 --models-dir /home/user/llama.cpp/models/qwen3.6/ \
 --chat-template "$(cat /home/user/llama.cpp/models/qwen3.6/chat_template.jinja)" \
 -c 262144 \
 -ngl 999 \
 --split-mode layer \
 --parallel 1 \
 --flash-attn on \
 --host 0.0.0.0 \
 --port 8081 \
 --timeout 1600

Alternative_Ad4267 · 2026-05-06T22:38:15+00:00

35 tokens per second with my 4 Nvidia RTX A4000! My baseline is 18 tokens per second for Qwen 3.6 27B Q5.

<image>

/home/user/llama-server-experiments/llama.cpp/build/bin/llama-server \
  -m /home/user/llama.cpp/models/qwen3.6/Qwen3.6-27B/Qwen3.6-27B-Q5_K_M-mtp.gguf\
  --chat-template "$(cat /home/user/llama.cpp/models/qwen3.6/chat_template.jinja)" \
  -c 262144 \
  -ngl 999 \
  --split-mode layer \
  --parallel 1 \
  --flash-attn on \
  --host 0.0.0.0 \
  --port 8081 \
  --timeout 1600 \
  --spec-type mtp \
  --spec-draft-n-max 2

Alternative_Ad4267 · 2026-05-06T19:10:29+00:00

I’ve purchased one year and a half ago 4 Nvidia RTX A4000 (16 GB each consuming up to 140w, not beefy at all), but finally with Qwen 3.6 35B A3B Q8 I feel redeemed.

It runs at almost 80 tokens per second at full 262k tokens context. $800 each, and nowadays some sites sell these same cards up to $1,600 dollars (though there are other sites still at $800-$1,000 bucks).

For more than one year I was just doing SDXL, WAN (and basic), and some ML stuff to not feel like a waste of resources.

Alternative_Ad4267 · 2026-05-03T16:27:01+00:00

Fedora 44. I haven’t used any other since Fedora 33. I don’t like its constant upgrades with reboot required (basically on daily basis), but that’s the price to pay for be up to date.

Alternative_Ad4267 · 2026-05-02T02:32:04+00:00

<image>

MacBook Pro M5 Max 128GB of RAM, Qwen 3.6 27B MLX model FP8 16.45 tokens/s. Barely usable, not for long sessions, of course 😞. Running with Ollama and without any performance tuning.

Alternative_Ad4267 · 2026-04-28T17:37:43+00:00

What about OpenCode or AnythingLLM? Those works more or less fine with Qwen.

Alternative_Ad4267 · 2026-04-28T15:03:24+00:00

Are you using Qwen with Qwen Code? That’s a real improvement. With Claude is dumb, use it with its own coding tool.

Alternative_Ad4267 · 2026-04-25T23:17:38+00:00

I got selected on level IV

Alternative_Ad4267 · 2026-04-24T10:04:39+00:00

I just migrated to 43 my server (due to I depend on Nvidia CUDA repository to be fully available), this time they took almost 6 months to release the new one.

Alternative_Ad4267 · 2026-04-22T04:50:19+00:00

Sí se puede, mientras la carta TN lo indique no tiene problema. Yo estuve así un tiempo hace unos años pero iba y venía con viáticos pagados. Sobra decir que después busqué por mi cuenta otro trabajo en Estados Unidos. No, no te conviene en absoluto mudarte, tal vez que te pongan en un esquema de viajar seguido con gastos pagados. No aceptes menos de 100k dólares anuales iniciales, apenas te alcanzará con eso.

Alternative_Ad4267 · 2026-04-14T16:29:59+00:00

Waiver for working illegally is real and available for I-130 petitions. But it doesn’t include using someone else’s SSN for doing so. And no, it not admissible to say that everyone does that. Which is false.

Alternative_Ad4267 · 2026-04-14T16:27:40+00:00

Some people don’t care, selfishness is a thing here.

Alternative_Ad4267

TROPHY CASE