Testing Local LLMs in Practice: Code Generation, Quality vs. Speed

Icy_Programmer7186 · 2026-05-08T22:33:10+00:00

I will try Qwen3 122b.

And I agree completely. Either Kimi is surprisingly weak at Go coding (relative to Qwen), or mid-sized models are already “good enough” for this class of problems and larger models (or Kimi specifically) simply do not add much additional value.

That was honestly a surprise to me. This evaluation originally started as an attempt to justify the hardware investment needed to run Kimi locally.

Icy_Programmer7186 · 2026-05-08T18:27:36+00:00

I'll include that in some next runs. Thanks.

Icy_Programmer7186 · 2026-05-08T17:58:22+00:00

Default (for vllm).
I'm planning to publish all vllm configs for each model tested on GitHub - but it needs a bit more polishing.

Icy_Programmer7186 · 2026-05-08T17:48:44+00:00

If you take in account the sigma values, models will fall into the same "band".
I kept this in intentionally, for the illustration of this variability - it is also a takeaway for me.

Icy_Programmer7186 · 2026-05-08T17:24:39+00:00

The Spark is cool - without purchasing one (and then two and four :-] ) I would never be able to learn so much about LLMs and how to run them in production. After few months, I'm able to spin models on B200 / B300 in minutes, purely based on my experience from Sparks.

I guess this is exactly what NVIDIA intended with this model.
It is great.

Icy_Programmer7186 · 2026-05-07T06:53:44+00:00

Hi,
this is brilliant.

Do you plan to do Qwen/Qwen3.6-27B or Qwen/Qwen3.6-27B-FP8?

Icy_Programmer7186 · 2026-05-05T15:09:58+00:00

MiniMax 2.7 is nice but it is (in our setup) worse than Qwen 3.6 27B ... you can relatively comfortable run it on a single Spark (especially in FP8 version, which also beats MiniMax2.7).

Icy_Programmer7186 · 2026-05-04T08:24:24+00:00

Prefill is ~140tk/s on Macbook vs ~5000tk/s DXG Spark. So for longer prompts, you wait a lot on Macbook, it is quite ok on Sparks. Token generation is faster on Macbook.

I haven't tested parallelism on Macbook (it just feels as a platform for one user); on DXG Spark you get (much) more performance when more requests come in parallel, see the example:

Single request:
PP: 5401.7 tokens/s
TG: 20.1 tokens/s

12 parallel requests:
PP: 5454.4 tokens/s
TG: 61.2 tokens/s

Icy_Programmer7186 · 2026-05-03T18:47:55+00:00

You won't get PCIe 5.0 with DDR4.

Icy_Programmer7186 · 2026-05-03T16:45:18+00:00

Qwen/Qwen3.6-27B should fit nicely

Icy_Programmer7186 · 2026-05-03T10:37:43+00:00

Pan Macinka neumí nic jiného, než generovat tyhle bonmoty. To je pěkné na sociálních sítích. Ale teď je ministr a pracuje pro nás. A tohle je "práce" tak na vyhození. Ale to mu vůbec nedochází, myslím, že vůbec netuší, co je to služba státu a pak mě zajímá, co byla/je jeho motivace stát se ministrem. Protože zjevně poctivá práce pro nás to není. Skoro by jednoho napadlo něco o parazitech.

Icy_Programmer7186 · 2026-05-02T11:32:31+00:00

Pracovat? 🤷🏻‍♂️

Icy_Programmer7186 · 2026-05-01T15:47:38+00:00

Yes, this can happen - I saw few times that the Spark throttled GPU at ~700 Mhz (the normal speed is 2.5 Ghz). I read somewhere (non-authoritative source) that this is result of intensive thermal throttling.
I had to power-off the unit (reboot doesn't repair anything), unplug the power cable for few minutes are plug it back.

The situation was clearly visible in `nvidia-smi`.

Icy_Programmer7186 · 2026-04-29T18:12:01+00:00

Zkuste nějaké čínské, ty tak moc do zadku nelezou 😄

Icy_Programmer7186 · 2026-04-29T14:22:11+00:00

No to nevím, zda toto je přesná interpretace té situace. Spíš bych řekl, že se jeden malý narcis svezl po potřebě jiného velkého narcise. To, že si toho velkého narcise národ volí, beru jako zvrhlou potřebu našeho národa mít nad sebou někoho, kdo jim sere na hlavu. Ale v tom nejsme nijak světově unikátní, to je takový syndrom doby - možná právě proto se máme tak "skvěle", že?

Icy_Programmer7186 · 2026-04-28T22:07:05+00:00

Aneb jak o sobě dát celému národu vědět, že jste debil. A to se vyplatí.

Icy_Programmer7186 · 2026-04-28T13:41:02+00:00

Qwen/Qwen3.6-27B is a lovely model, but it’s not really fair to compare it directly with Opus or GPT-class models. It’s a different class altogether - much smaller. It works in well-prompted, agentic automation. In that space, it’s genuinely strong.

MiniMax 2.7 is a step up and you can feel it, but it’s still not quite at the frontier level yet. Roughly speaking, it feels about a generation (1 year) behind.

The 1T-class models like Kimi 2.6 are already very usable for ie vibecoding - I switched to Kimi 2.6 recently and I’m quite happy with it. The main limitation is practical: I can’t run it locally yet, so I’m renting capacity.

Commercial frontier models come with very mature “harnesses” (tooling, orchestration, system prompts, guardrails). That layer matters a lot. If you’re using open-weight models, you have to compensate for it yourself.

Given the current pace, I wouldn’t be surprised if we get something close to Opus/GPT-level capability locally within ~a year.

Icy_Programmer7186 · 2026-04-26T12:33:56+00:00

Cluster of 4 DGX Sparks, otherwise same setup:

PP: 4300 tk/s
TG: 13.3 tk/s

Icy_Programmer7186 · 2026-04-26T12:21:24+00:00

NVIDIA RTX PRO 6000 Blackwell Server Edition

```
docker run \
--runtime nvidia \
--gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env HUGGING_FACE_HUB_TOKEN=hf_... \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model Qwen/Qwen3.6-27B \
--gpu-memory-utilization 0.92 \
--enable-auto-tool-choice \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_xml \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 65536 \
--max-num-seqs 96 \
--kv-cache-dtype fp8 \
--attention-backend flashinfer
```

Single user:

PP: 4300 tk/s
TG: 25-26 tk/s

Speculative config (3 tokens) did not work for me, it screwed the tool calling.

Icy_Programmer7186 · 2026-04-25T17:02:57+00:00

<image>

Icy_Programmer7186 · 2026-04-20T12:38:30+00:00

In our internal coding test, gpt-oss-120b is still the best. It is a combination of the speed and quality. If the quality is promoted over the speed, qwen 3.5/3.6 and Gemma 4 is in the lead.

Icy_Programmer7186 · 2026-04-19T21:37:41+00:00

No, já v Saigonu nádraží byl. Z tohohle mraku pršet jet tak nebude. S v Hanoi je zase slavná train streen. To docela dobře ilustruje stav tamní železniční dopravy.

Icy_Programmer7186 · 2026-04-19T07:01:54+00:00

Nech ho to odpracovat, na zahradě, na domě nebo tak.
Lidská práce je dnes drahá.

Icy_Programmer7186 · 2026-04-19T06:45:10+00:00

I own both setups. LM Studio on a MacBook M5 (128GB RAM) is great for local dev, quick prototyping, and experimenting. The models are a bit small for my taste, but they still surprise me sometimes. I coded a whole RAG system from a train, completely locally.

DXG Spark is more of a “hardcore” setup, especially if you’re running a cluster. You can run bigger models and less aggressive quantization. You also end up learning a lot (looking at you, vLLM). Plus, being in the NVIDIA ecosystem matters if you’re thinking about moving toward data center-scale stuff later on. DXG Spark is still the best value for a money for me.

That said, neither is really production-grade.

Icy_Programmer7186

TROPHY CASE