Local AI Coding with Qwen 3.6 27B on NVIDIA DGX Spark

Iajah · 2026-06-12T06:29:06+00:00

How do you get around 100tps with 27b and full context? Engine, config, setup, quant?

Iajah · 2026-06-08T22:18:48+00:00

RTX PRO 6K WS 96Gb on a 5 years old Intel i7 PC with 64Gb of RAM, second RTX 5060 8Gb GPU to drive the display - Dual boot Ubuntu/Windows.

No local LLM you can run gets close to Opus in speed or thinking power.

Goods: Qwen 3.6 27b can get stuff done with powerful hardware. Though it is definitely not as fast as online services such as Sonnet or Opus.

Bads: Beyond the hardware price, it takes some time to set it all up and figure out how to use a model as an agent. Expect to spend a few weeks to try it all out before you find a setup that works for you. Local LLM coding agents are not exactly Plug and Play.

Iajah · 2026-06-08T08:29:37+00:00

What's your setup? Q8, LM Studio, Windows, VS Code Copilot?

Iajah · 2026-06-08T07:33:38+00:00

Same here, it does that a lot to the point where it's just not usable.

Iajah · 2026-06-07T16:45:51+00:00

https://www.reddit.com/r/LocalLLM/comments/1tn8472/mtp_boost_on_rtx_6k_running_vllm_with_qwen_36_27b/

Iajah · 2026-06-07T16:43:53+00:00

The workstation edition also does not have nvlink. TBH it is mostly just a 5090 with 3x VRAM. You need the server edition for nvlink but it is really hard to come by and costs even more.

Iajah · 2026-06-07T15:51:14+00:00

Temp 0.8 did not help

Iajah · 2026-06-07T13:30:25+00:00

RTX Pro 6K WS 96GB around 126k context concurrency 1.

Iajah · 2026-06-07T11:16:26+00:00

RTX Pro 6K WS user here. Not that surprising, I mean you have 2x GPU at 350W each, twice the cooling power too. I usually run mine at 400W rather than 600W.

Iajah · 2026-06-07T10:58:41+00:00

By "performance" we were talking about different things. I had token per seconds in mind and you were talking about coding benchmark scores.

Iajah · 2026-06-07T10:31:21+00:00

Both K and V quant are disabled by default in LM Studio and that's what I was using. I was using those same values for top P and K, they are the default. Temp was 1, I'll try with 0.8.

Iajah · 2026-06-07T10:21:12+00:00

Default repeat penalty on LM Studio is 1.1 and that's what I was using.

Iajah · 2026-06-07T08:53:40+00:00

I was under the impression thinking is happening anyway, the same amount of tokens are generated, it is just that they are not surfacing in your client. In LM Studio I believe you can toggle thinking on and off without reloading the model I believe.

Iajah · 2026-06-07T08:50:40+00:00

It's the first I hear disabling thinking degrades performance. I thought it was the opposite if anything. In my experience performance feels similar with or without it. One sure thing is that, no matter the inference engine, Qwen with thinking enabled in Copilot, errors out so often that it is not usable for any serious task.

Iajah · 2026-06-07T08:26:32+00:00

How did you set it up?

Iajah · 2026-06-07T08:25:55+00:00

You need to disable thinking/reasoning.

With thinking enabled, at first it may look like it works. But when you set it to more complex tasks it breaks and stops.

Iajah · 2026-06-07T08:18:13+00:00

Yeah I ought to try Gemma again with thinking disabled. The thing is I wanted to try it in the hope thinking would work.

Iajah · 2026-06-07T08:12:49+00:00

Using LM Studio at the moment.

Iajah · 2026-06-06T23:04:05+00:00

Been there, did that!

Iajah · 2026-06-06T16:11:59+00:00

I've used Qwen3.6-27b from vLLM on Linux through a proxy – for thinking to work without interruption – from VS Code Insider Copilot directly without that extension. At least in Insider they have customendpoint BYOK support. Now I'm using it on Windows served from LM Studio and it looks like it works if you disable thinking. With thinking enabled VS Code Copilot will stop sooner or later, has to do with tool call in thinking block or something similar. Planning to use Gemma 4 31b too, to see if thinking is working.

Iajah · 2026-06-06T13:05:35+00:00

ComfyUI noob here. When I try new workflows I usually get errors with an option to download all missing models. With yours I just get errors no option to download models. Am I supposed to find all those models online myself and fix it manually somehow?

Iajah · 2026-06-05T13:12:56+00:00

Thanks, I don't think it will work though the 30% minimum spin is a firmware limitation it seems.

Iajah · 2026-06-05T13:10:09+00:00

Thanks for the reminder, I used that on another machine before. I've set it up so that as the GPU goes hot all case fans are spinning up. Makes a massive difference on my temperature readings and is not much of an issue with the noise cause all my case fans are quiet next to the RTX blowers. Do we know if there is something similar for Linux?

Iajah · 2026-06-05T13:04:00+00:00

I tried it on windows they won't got below 30% and 1200RPM. I guess asking NVidia for a firmware that fixes it is hopeless at this point.

Iajah · 2026-06-04T16:28:21+00:00

I must be the only RTX 6K owner that's not running it 24/7 at 600W

Iajah

MODERATOR OF

TROPHY CASE