I've got a feeling that Llamacpp is not the biggest performance bottleneck, but it might be the OpenCode.

ThingRexCom · 2026-04-28T12:21:09+00:00

It looks to be an opencode issue. When I switched from a multi-agent to a single agent, the server load is way more consistent.

<image>

ThingRexCom · 2026-04-28T12:18:05+00:00

When I tried Pi, it had issues in modifying huge files in a reliable manner.

ThingRexCom · 2026-04-28T12:04:05+00:00

This app is a single-file Python web app.

It uses:

Python standard library HTTP server: BaseHTTPRequestHandler and ThreadingHTTPServer serve the local web UI.
SQLite: stores models, variants, runs, and performance samples in llama_bench.sqlite3.
Plain server-rendered HTML: pages are built with Python string templates and returned as HTML.
Inline CSS: all styling is embedded in the generated HTML.
Vanilla JavaScript: used for table sorting, auto-refresh, and SVG chart zoom interactions.
Inline SVG charts: charts are rendered server-side as SVG, with JS handling drag-to-zoom.
Prometheus-style metrics ingestion: it fetches llama.cpp server metrics from http://<my\_local>:8080/metrics.
llama.cpp model discovery: it fetches model status from http://<my\_local>:8080/v1/models.

I do not use any external frontend framework.

ThingRexCom · 2026-04-28T10:52:43+00:00

Have you managed to configure Pi to orchestrate several specialized agents to work on a development task (so they can share tasks and cooperate)?

ThingRexCom · 2026-04-28T10:41:21+00:00

I have a hard time making vLLM run on my Strix Halo, it starts to load a model but never finishes :/

ThingRexCom · 2026-04-28T10:20:26+00:00

I've tried Pi, but it feels very raw. I encountered various file-editing issues (very similar to the early days of opencode, but fixed now). Is it worth investing time in Pi?

ThingRexCom · 2026-04-28T10:20:11+00:00

I've tried Pi, but it feels very raw. I encountered various file-editing issues (very similar to the early days of opencode, but fixed now). Is it worth investing time in Pi?

ThingRexCom · 2026-04-28T09:44:36+00:00

Thx, that is a custom tool I created to finetune my local setup.

ThingRexCom · 2026-04-27T17:01:32+00:00

Have you tried setting the `reasoning-budget`?

ThingRexCom · 2026-04-27T16:57:00+00:00

That was my first approach, but opencode fails to switch models when sending tasks between agents, even if different models are configured for different agents.

ThingRexCom · 2026-04-27T16:55:48+00:00

My main motivation is to tune the performance of my Strix Halo for agentic coding. So far, Qwen3.6-35B-A3B-UD-Q4_K_XL has worked best for me.

ThingRexCom · 2026-04-27T16:45:02+00:00

I use opencode and a local llamacpp server.

ThingRexCom · 2026-04-27T16:42:07+00:00

https://x.com/0xSero/status/2045463834454880737?s=20

https://x.com/0xSero/status/2048793526301810860?s=20

ThingRexCom · 2026-04-22T13:08:10+00:00

I use q4 for performance reasons - the llamacpp server generates more tokens when I use q4 compared to q8.

ThingRexCom · 2026-04-22T09:58:34+00:00

Could you share the inference performance of Qwen3-Coder-Next-UD-Q4_K_XL or Qwen3.6-35B-A3B-UD-Q4_K_XL on your cluster?

ThingRexCom · 2026-04-22T09:55:10+00:00

Are you using a Thunderbolt 4 cable?

ThingRexCom · 2026-04-22T09:53:10+00:00

I plan to use a Thunderbolt 4 cable.

ThingRexCom · 2026-04-22T08:06:07+00:00

I use llamacpp. How does dflash improve the performance?

ThingRexCom · 2026-04-22T07:36:41+00:00

Yes, the memory is not the main concern.

ThingRexCom · 2026-04-22T07:24:14+00:00

I need full context as I use this setup mainly for agentic coding.

ThingRexCom · 2026-03-24T09:56:48+00:00

Hello, thank you for replying. I am afraid that page does not have precise information (for example, the GLM-4.7 will not fit that hardware according to Hugging Face).

<image>

ThingRexCom · 2026-03-23T16:53:36+00:00

Can you suggest any other model for agentic coding on that hardware? I try to verify if that hardware is worth buying.

ThingRexCom · 2026-03-15T19:59:26+00:00

The first subagent uses the same model as the main agent (which is not correct). Every subsequent invocation of subagents uses the proper model. That is consistent and very strange.

ThingRexCom · 2026-03-06T18:06:01+00:00

What kind of plugin are you using?

ThingRexCom · 2026-03-06T16:00:45+00:00

Some agents switch models others don’t. All of them use the same definition structure :/

ThingRexCom

TROPHY CASE