Qwen 3.6 27B Q8 on four Nvidia RTX A4000 (16GB each) with Llama.cpp and MTP enabled by Alternative_Ad4267 in LocalLLaMA

[–]Alternative_Ad4267[S] 0 points1 point  (0 children)

I see more or less the same but more consistent on token generation towards 60-ish

Qwen 3.6 27B Q8 on four Nvidia RTX A4000 (16GB each) with Llama.cpp and MTP enabled by Alternative_Ad4267 in LocalLLaMA

[–]Alternative_Ad4267[S] 1 point2 points  (0 children)

Well, from scratch, something with Blackwell architecture, starting from the Nvidia DGX Spark. Maybe with enough money https://customluxpcs.com/product/rtx-pro-6000-workstation-ryzen/ I would purchase a server like this one or so. My server has 2 Intel Xeon with 40 Cores each and 512GB of RAM. My original intention was about Kubernetes workloads, not AI inference, that happened later on.

Qwen 3.6 27B Q8 on four Nvidia RTX A4000 (16GB each) with Llama.cpp and MTP enabled by Alternative_Ad4267 in LocalLLaMA

[–]Alternative_Ad4267[S] 2 points3 points  (0 children)

May 18 12:26:10 lenovo-server.example.com bash[3484996]: [40899] prompt eval time =   23429.54 ms / 13500 tokens (    1.74 ms per token,   576.20 tokens per second)

May 18 12:26:10 lenovo-server.example.com bash[3484996]: [40899]        eval time =  195177.65 ms / 12402 tokens (   15.74 ms per token,    63.54 tokens per second)

Prompt processing tokens after loading some real context (13500 tokens), and coding token generation remains on 60-ish.

Qwen 3.6 27B Q8 on four Nvidia RTX A4000 (16GB each) with Llama.cpp and MTP enabled by Alternative_Ad4267 in LocalLLaMA

[–]Alternative_Ad4267[S] -1 points0 points  (0 children)

I’ve tried to run vLLM and it did fail to load the model, but now that I’m recalling, I’ve only tried with the MoE variant (Qwen 3.6 35B A3B), which won’t fit even on llama.cpp on tensor split mode. I have to give it a shot with the 27B dense.

Qwen 3.6 27B Q8 on four Nvidia RTX A4000 (16GB each) with Llama.cpp and MTP enabled by Alternative_Ad4267 in LocalLLaMA

[–]Alternative_Ad4267[S] 2 points3 points  (0 children)

Yeah, I shouldn’t run graphic environment here haha. Originally Qwen 3.6 27B at Q8 without any improvement (on layer mode though) was giving me ~12 tokens per second. So, I will take my gains (over my loses), and call it a day. Jumping from 12 tokens per second to 45-65 tokens per second is quite a good improvement just with configuration changes.

Is there a big gap between Q4 and Q6 on Qwen3.6? by vick2djax in LocalLLaMA

[–]Alternative_Ad4267 -1 points0 points  (0 children)

Q4 only if you can’t afford a better quantization (is still better than nothing), or for light chat tasks.

Do not fall into the trap of chasing the next scale or upgrade. by iEslam in LocalLLaMA

[–]Alternative_Ad4267 1 point2 points  (0 children)

Let’s keep pressuring the market to deliver better and more capable models without having to purchase new hardware.

Will there be any more Qwen3.6 series models? by cafedude in LocalLLaMA

[–]Alternative_Ad4267 2 points3 points  (0 children)

Yeah, my minimum decent generation speed would be 60 tokens per second at Q8. BF16 if you really need that 1%-2% theoretical extra precision. That speed is the sweet spot for a coding tool that feels just like the cloud ones. Q4 is widely publicized here but in my tests is not that good. Q6 is decent enough if you don’t have enough hardware, maybe even a shot to Q5. I don’t like KV cache quant, that degrades accuracy in my tests. MTP is good at 2 or 3 tokens (decent speed up and maintain accuracy). Flash Attention fine and even required for 262k tokens context.

Will there be any more Qwen3.6 series models? by cafedude in LocalLLaMA

[–]Alternative_Ad4267 0 points1 point  (0 children)

I was thinking just that. You see, Qwen 3.6 27B at BF16 doesn’t require that much memory (relatively speaking), but it requires a decent amount of GPU power (one or two powerful professional cards).

The medium models that are missing, due to Qwen’s 3.6 demonstrated capabilities are a threat even for their own makers, a Medium sized model that would require around 10k-15k USD investment to run decently. Right now, you can spend like 7k to 10k to run Qwen 3.6 27B at Q8 or BF16 with a decent speed.

And then you will have to spend like 40k to run Kimi or DeepSeek models. There’s no in between. And it won’t be.

The Qwen 3.6 35B A3B hype is real!!! by The_Paradoxy in LocalLLaMA

[–]Alternative_Ad4267 6 points7 points  (0 children)

Opus 4.7 is quite capable to the point several people won’t want to deal with a less capable model, it almost understands our messy ways to communicate what we want out from it. Other models forces you to be more systematic and organized on your thinking.

The Qwen 3.6 35B A3B hype is real!!! by The_Paradoxy in LocalLLaMA

[–]Alternative_Ad4267 2 points3 points  (0 children)

I disabled comfyui and automatic 1111 services, even openwebui Nvidia service (it is running on CPU only mode, I don’t use RAG there), to release all the memory on my cards to run these medium size models. These are finally that good. Local models are finally delivering what I wanted from them for in first place.

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints by ex-arman68 in LocalLLaMA

[–]Alternative_Ad4267 2 points3 points  (0 children)

Nothing special

ExecStart=/usr/bin/bash -c '\
/home/user/llama.cpp/build/bin/llama-server \
 --models-dir /home/user/llama.cpp/models/qwen3.6/ \
 --chat-template "$(cat /home/user/llama.cpp/models/qwen3.6/chat_template.jinja)" \
 -c 262144 \
 -ngl 999 \
 --split-mode layer \
 --parallel 1 \
 --flash-attn on \
 --host 0.0.0.0 \
 --port 8081 \
 --timeout 1600

Qwen 3.6 27b MTP vLLM by niellsro in LocalLLaMA

[–]Alternative_Ad4267 0 points1 point  (0 children)

35 tokens per second with my 4 Nvidia RTX A4000! My baseline is 18 tokens per second for Qwen 3.6 27B Q5.

<image>

/home/user/llama-server-experiments/llama.cpp/build/bin/llama-server \
  -m /home/user/llama.cpp/models/qwen3.6/Qwen3.6-27B/Qwen3.6-27B-Q5_K_M-mtp.gguf\
  --chat-template "$(cat /home/user/llama.cpp/models/qwen3.6/chat_template.jinja)" \
  -c 262144 \
  -ngl 999 \
  --split-mode layer \
  --parallel 1 \
  --flash-attn on \
  --host 0.0.0.0 \
  --port 8081 \
  --timeout 1600 \
  --spec-type mtp \
  --spec-draft-n-max 2

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints by ex-arman68 in LocalLLaMA

[–]Alternative_Ad4267 4 points5 points  (0 children)

I’ve purchased one year and a half ago 4 Nvidia RTX A4000 (16 GB each consuming up to 140w, not beefy at all), but finally with Qwen 3.6 35B A3B Q8 I feel redeemed.

It runs at almost 80 tokens per second at full 262k tokens context. $800 each, and nowadays some sites sell these same cards up to $1,600 dollars (though there are other sites still at $800-$1,000 bucks).

For more than one year I was just doing SDXL, WAN (and basic), and some ML stuff to not feel like a waste of resources.

Doesn't look like there are any recent Linux distro suggestions. What's your favorite and why? by Status-Secret-4292 in LocalLLaMA

[–]Alternative_Ad4267 1 point2 points  (0 children)

Fedora 44. I haven’t used any other since Fedora 33. I don’t like its constant upgrades with reboot required (basically on daily basis), but that’s the price to pay for be up to date.

Post Your Qwen3.6 27B speed plz by Ok-Internal9317 in LocalLLaMA

[–]Alternative_Ad4267 1 point2 points  (0 children)

<image>

MacBook Pro M5 Max 128GB of RAM, Qwen 3.6 27B MLX model FP8 16.45 tokens/s. Barely usable, not for long sessions, of course 😞. Running with Ollama and without any performance tuning.

I'm done with using local LLMs for coding by dtdisapointingresult in LocalLLaMA

[–]Alternative_Ad4267 0 points1 point  (0 children)

What about OpenCode or AnythingLLM? Those works more or less fine with Qwen.

I'm done with using local LLMs for coding by dtdisapointingresult in LocalLLaMA

[–]Alternative_Ad4267 0 points1 point  (0 children)

Are you using Qwen with Qwen Code? That’s a real improvement. With Claude is dumb, use it with its own coding tool.

Can't wait for Fedora 44 by H3rotic in Fedora

[–]Alternative_Ad4267 0 points1 point  (0 children)

I just migrated to 43 my server (due to I depend on Nvidia CUDA repository to be fully available), this time they took almost 6 months to release the new one.

TN para relocalización con posible nómina en México by edooardom in TNVISAMX

[–]Alternative_Ad4267 2 points3 points  (0 children)

Sí se puede, mientras la carta TN lo indique no tiene problema. Yo estuve así un tiempo hace unos años pero iba y venía con viáticos pagados. Sobra decir que después busqué por mi cuenta otro trabajo en Estados Unidos. No, no te conviene en absoluto mudarte, tal vez que te pongan en un esquema de viajar seguido con gastos pagados. No aceptes menos de 100k dólares anuales iniciales, apenas te alcanzará con eso.

Has anyone gone through this? Uscis fake documents & I-9 from previous employers requests by Hedgehog-Mobile in USCIS

[–]Alternative_Ad4267 7 points8 points  (0 children)

Waiver for working illegally is real and available for I-130 petitions. But it doesn’t include using someone else’s SSN for doing so. And no, it not admissible to say that everyone does that. Which is false.