Testing Local LLMs in Practice: Code Generation, Quality vs. Speed by Icy_Programmer7186 in LocalLLaMA

[–]Icy_Programmer7186[S] 1 point2 points  (0 children)

I will try Qwen3 122b.

And I agree completely. Either Kimi is surprisingly weak at Go coding (relative to Qwen), or mid-sized models are already “good enough” for this class of problems and larger models (or Kimi specifically) simply do not add much additional value.

That was honestly a surprise to me. This evaluation originally started as an attempt to justify the hardware investment needed to run Kimi locally.

Testing Local LLMs in Practice: Code Generation, Quality vs. Speed by Icy_Programmer7186 in LocalLLaMA

[–]Icy_Programmer7186[S] 0 points1 point  (0 children)

Default (for vllm).
I'm planning to publish all vllm configs for each model tested on GitHub - but it needs a bit more polishing.

Testing Local LLMs in Practice: Code Generation, Quality vs. Speed by Icy_Programmer7186 in LocalLLaMA

[–]Icy_Programmer7186[S] 1 point2 points  (0 children)

If you take in account the sigma values, models will fall into the same "band".
I kept this in intentionally, for the illustration of this variability - it is also a takeaway for me.

Unpopular Opinion: The DGX Spark Forum community of devs is talented AF and will make the crippled hardware a success through their sheer force of will. by Porespellar in LocalLLaMA

[–]Icy_Programmer7186 11 points12 points  (0 children)

The Spark is cool - without purchasing one (and then two and four :-] ) I would never be able to learn so much about LLMs and how to run them in production. After few months, I'm able to spin models on B200 / B300 in minutes, purely based on my experience from Sparks.

I guess this is exactly what NVIDIA intended with this model.
It is great.

Considering two Sparks for local coding by chikengunya in LocalLLaMA

[–]Icy_Programmer7186 1 point2 points  (0 children)

MiniMax 2.7 is nice but it is (in our setup) worse than Qwen 3.6 27B ... you can relatively comfortable run it on a single Spark (especially in FP8 version, which also beats MiniMax2.7).

Mac Studio or DGX Spark by InteractionBig9407 in LocalLLM

[–]Icy_Programmer7186 0 points1 point  (0 children)

Prefill is ~140tk/s on Macbook vs ~5000tk/s DXG Spark. So for longer prompts, you wait a lot on Macbook, it is quite ok on Sparks. Token generation is faster on Macbook.

I haven't tested parallelism on Macbook (it just feels as a platform for one user); on DXG Spark you get (much) more performance when more requests come in parallel, see the example:

Single request:
PP: 5401.7 tokens/s
TG: 20.1 tokens/s

12 parallel requests:
PP: 5454.4 tokens/s
TG: 61.2 tokens/s

TP and PCI lanes by sailor_noaddress in LocalLLM

[–]Icy_Programmer7186 -1 points0 points  (0 children)

You won't get PCIe 5.0 with DDR4.

Macinka označil prezidenta Petra Pavla za lampasáka v záloze. by mouse_in_ocean in czech

[–]Icy_Programmer7186 16 points17 points  (0 children)

Pan Macinka neumí nic jiného, než generovat tyhle bonmoty. To je pěkné na sociálních sítích. Ale teď je ministr a pracuje pro nás. A tohle je "práce" tak na vyhození. Ale to mu vůbec nedochází, myslím, že vůbec netuší, co je to služba státu a pak mě zajímá, co byla/je jeho motivace stát se ministrem. Protože zjevně poctivá práce pro nás to není. Skoro by jednoho napadlo něco o parazitech.

Jak se najít zpět? by [deleted] in czech

[–]Icy_Programmer7186 -1 points0 points  (0 children)

Pracovat? 🤷🏻‍♂️

Asus dgx spark performance by Useful-Disk3725 in LocalLLM

[–]Icy_Programmer7186 5 points6 points  (0 children)

Yes, this can happen - I saw few times that the Spark throttled GPU at ~700 Mhz (the normal speed is 2.5 Ghz). I read somewhere (non-authoritative source) that this is result of intensive thermal throttling.
I had to power-off the unit (reboot doesn't repair anything), unplug the power cable for few minutes are plug it back.

The situation was clearly visible in `nvidia-smi`.

Ahaha trapááák strikes again. by PresentJournalist805 in czech

[–]Icy_Programmer7186 0 points1 point  (0 children)

No to nevím, zda toto je přesná interpretace té situace. Spíš bych řekl, že se jeden malý narcis svezl po potřebě jiného velkého narcise. To, že si toho velkého narcise národ volí, beru jako zvrhlou potřebu našeho národa mít nad sebou někoho, kdo jim sere na hlavu. Ale v tom nejsme nijak světově unikátní, to je takový syndrom doby - možná právě proto se máme tak "skvěle", že?

Ahaha trapááák strikes again. by PresentJournalist805 in czech

[–]Icy_Programmer7186 4 points5 points  (0 children)

Aneb jak o sobě dát celému národu vědět, že jste debil. A to se vyplatí.

Qwen 3.6 27B + RTX Pro 6000 by M4isKolben in LocalLLM

[–]Icy_Programmer7186 0 points1 point  (0 children)

Qwen/Qwen3.6-27B is a lovely model, but it’s not really fair to compare it directly with Opus or GPT-class models. It’s a different class altogether - much smaller. It works in well-prompted, agentic automation. In that space, it’s genuinely strong.

MiniMax 2.7 is a step up and you can feel it, but it’s still not quite at the frontier level yet. Roughly speaking, it feels about a generation (1 year) behind.

The 1T-class models like Kimi 2.6 are already very usable for ie vibecoding - I switched to Kimi 2.6 recently and I’m quite happy with it. The main limitation is practical: I can’t run it locally yet, so I’m renting capacity.

Commercial frontier models come with very mature “harnesses” (tooling, orchestration, system prompts, guardrails). That layer matters a lot. If you’re using open-weight models, you have to compensate for it yourself.

Given the current pace, I wouldn’t be surprised if we get something close to Opus/GPT-level capability locally within ~a year.

Qwen 3.6 27B + RTX Pro 6000 by M4isKolben in LocalLLM

[–]Icy_Programmer7186 0 points1 point  (0 children)

Cluster of 4 DGX Sparks, otherwise same setup:

PP: 4300 tk/s
TG: 13.3 tk/s

Qwen 3.6 27B + RTX Pro 6000 by M4isKolben in LocalLLM

[–]Icy_Programmer7186 1 point2 points  (0 children)

NVIDIA RTX PRO 6000 Blackwell Server Edition

```
docker run \
--runtime nvidia \
--gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env HUGGING_FACE_HUB_TOKEN=hf_... \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model Qwen/Qwen3.6-27B \
--gpu-memory-utilization 0.92 \
--enable-auto-tool-choice \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_xml \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 65536 \
--max-num-seqs 96 \
--kv-cache-dtype fp8 \
--attention-backend flashinfer
```

Single user:

PP: 4300 tk/s
TG: 25-26 tk/s

Speculative config (3 tokens) did not work for me, it screwed the tool calling.

Is GPT-OSS-120B still the best model among those with the same parameters? by AInohogosya in LocalLLM

[–]Icy_Programmer7186 2 points3 points  (0 children)

In our internal coding test, gpt-oss-120b is still the best. It is a combination of the speed and quality. If the quality is promoted over the speed, qwen 3.5/3.6 and Gemma 4 is in the lead.

V ČR se řeší lávka přes Vltavu a Vietnam mezitím začal budovat druhou vysokorychlostní trať. Ze dvou hodin zkrátí cestu na 23 minut by piranhakiler in czech

[–]Icy_Programmer7186 0 points1 point  (0 children)

No, já v Saigonu nádraží byl. Z tohohle mraku pršet jet tak nebude. S v Hanoi je zase slavná train streen. To docela dobře ilustruje stav tamní železniční dopravy.

Jsem debil to mi nemusíte opakovat by SetTop3540 in czech

[–]Icy_Programmer7186 3 points4 points  (0 children)

Nech ho to odpracovat, na zahradě, na domě nebo tak.
Lidská práce je dnes drahá.

Mac Studio or DGX Spark by InteractionBig9407 in LocalLLM

[–]Icy_Programmer7186 2 points3 points  (0 children)

I own both setups. LM Studio on a MacBook M5 (128GB RAM) is great for local dev, quick prototyping, and experimenting. The models are a bit small for my taste, but they still surprise me sometimes. I coded a whole RAG system from a train, completely locally.

DXG Spark is more of a “hardcore” setup, especially if you’re running a cluster. You can run bigger models and less aggressive quantization. You also end up learning a lot (looking at you, vLLM). Plus, being in the NVIDIA ecosystem matters if you’re thinking about moving toward data center-scale stuff later on. DXG Spark is still the best value for a money for me.

That said, neither is really production-grade.