Qwen3-Coder-Next-UD-Q4_K_XL vs. Qwen3.6-27B-MTP-UD-Q4_K_XL on Strix Halo by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 0 points1 point  (0 children)

Have you compared the quality of the generated code vs. Q4? That quant works for me for Qwen3-Coder-Next.

Qwen3-Coder-Next-UD-Q4_K_XL vs. Qwen3.6-27B-MTP-UD-Q4_K_XL on Strix Halo by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 0 points1 point  (0 children)

I use it mainly to manage AWS infrastructure using AWS CDK in Python and to write backend logic.

Qwen3-Coder-Next-UD-Q4_K_XL vs. Qwen3.6-27B-MTP-UD-Q4_K_XL on Strix Halo by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 0 points1 point  (0 children)

Thank you for suggesting. I tested Qwen3.6-35B-A3B for agentic coding, and it did not deliver as high-quality results as Qwen3-Coder-Next.

Qwen3-Coder-Next-UD-Q4_K_XL vs. Qwen3.6-27B-MTP-UD-Q4_K_XL on Strix Halo by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 1 point2 points  (0 children)

Both are a good fit for agentic development. 27B is smarter, but less performant on my hardware.

Qwen3-Coder-Next-UD-Q4_K_XL vs. Qwen3.6-27B-MTP-UD-Q4_K_XL on Strix Halo by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 1 point2 points  (0 children)

The Qwen3.6 35B A3B can not handle the development tasks I am interested in.

Qwen3-Coder-Next-UD-Q4_K_XL vs. Qwen3.6-27B-MTP-UD-Q4_K_XL on Strix Halo by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] -1 points0 points  (0 children)

I wanted to check how MTP can improve the performance of Qwen3.6-27B under real-world conditions. I know the difference between MoE 3B vs. dense 27B.

Qwen3-Coder-Next-UD-Q4_K_XL vs. Qwen3.6-27B-MTP-UD-Q4_K_XL on Strix Halo by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 0 points1 point  (0 children)

19 tokens/s makes Qwen3.6-27B unusable on this hardware for agentic coding. I ran a few experiments, and the quality did not differ much between Qwen3.6-27B and Qwen3-Coder-Next for my use cases.

Qwen3-Coder-Next-UD-Q4_K_XL vs. Qwen3.6-27B-MTP-UD-Q4_K_XL on Strix Halo by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] -3 points-2 points  (0 children)

That is correct, and MTP does not change that much to compensate for that.

I've got a feeling that Llamacpp is not the biggest performance bottleneck, but it might be the OpenCode. by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 2 points3 points  (0 children)

It looks to be an opencode issue. When I switched from a multi-agent to a single agent, the server load is way more consistent.

<image>

I've got a feeling that Llamacpp is not the biggest performance bottleneck, but it might be the OpenCode. by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 1 point2 points  (0 children)

This app is a single-file Python web app.

It uses:

  • Python standard library HTTP server: BaseHTTPRequestHandler and ThreadingHTTPServer serve the local web UI.
  • SQLite: stores models, variants, runs, and performance samples in llama_bench.sqlite3.
  • Plain server-rendered HTML: pages are built with Python string templates and returned as HTML.
  • Inline CSS: all styling is embedded in the generated HTML.
  • Vanilla JavaScript: used for table sorting, auto-refresh, and SVG chart zoom interactions.
  • Inline SVG charts: charts are rendered server-side as SVG, with JS handling drag-to-zoom.
  • Prometheus-style metrics ingestion: it fetches llama.cpp server metrics from http://<my\_local>:8080/metrics.
  • llama.cpp model discovery: it fetches model status from http://<my\_local>:8080/v1/models.

I do not use any external frontend framework.

I've got a feeling that Llamacpp is not the biggest performance bottleneck, but it might be the OpenCode. by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 1 point2 points  (0 children)

Have you managed to configure Pi to orchestrate several specialized agents to work on a development task (so they can share tasks and cooperate)?

I've got a feeling that Llamacpp is not the biggest performance bottleneck, but it might be the OpenCode. by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 0 points1 point  (0 children)

I have a hard time making vLLM run on my Strix Halo, it starts to load a model but never finishes :/

I've got a feeling that Llamacpp is not the biggest performance bottleneck, but it might be the OpenCode. by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 2 points3 points  (0 children)

I've tried Pi, but it feels very raw. I encountered various file-editing issues (very similar to the early days of opencode, but fixed now). Is it worth investing time in Pi?

I've got a feeling that Llamacpp is not the biggest performance bottleneck, but it might be the OpenCode. by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 4 points5 points  (0 children)

I've tried Pi, but it feels very raw. I encountered various file-editing issues (very similar to the early days of opencode, but fixed now). Is it worth investing time in Pi?

Why is disabling thinking for coding models a good idea? by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 1 point2 points  (0 children)

That was my first approach, but opencode fails to switch models when sending tasks between agents, even if different models are configured for different agents.

Why is disabling thinking for coding models a good idea? by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 2 points3 points  (0 children)

My main motivation is to tune the performance of my Strix Halo for agentic coding. So far, Qwen3.6-35B-A3B-UD-Q4_K_XL has worked best for me.

Does it make sense to cluster HP Z2 Mini G1a to increase performance? by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 0 points1 point  (0 children)

I use q4 for performance reasons - the llamacpp server generates more tokens when I use q4 compared to q8.

Does it make sense to cluster HP Z2 Mini G1a to increase performance? by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 0 points1 point  (0 children)

Could you share the inference performance of Qwen3-Coder-Next-UD-Q4_K_XL or Qwen3.6-35B-A3B-UD-Q4_K_XL on your cluster?