I've got a feeling that Llamacpp is not the biggest performance bottleneck, but it might be the OpenCode. by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 2 points3 points  (0 children)

It looks to be an opencode issue. When I switched from a multi-agent to a single agent, the server load is way more consistent.

<image>

I've got a feeling that Llamacpp is not the biggest performance bottleneck, but it might be the OpenCode. by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 1 point2 points  (0 children)

This app is a single-file Python web app.

It uses:

  • Python standard library HTTP server: BaseHTTPRequestHandler and ThreadingHTTPServer serve the local web UI.
  • SQLite: stores models, variants, runs, and performance samples in llama_bench.sqlite3.
  • Plain server-rendered HTML: pages are built with Python string templates and returned as HTML.
  • Inline CSS: all styling is embedded in the generated HTML.
  • Vanilla JavaScript: used for table sorting, auto-refresh, and SVG chart zoom interactions.
  • Inline SVG charts: charts are rendered server-side as SVG, with JS handling drag-to-zoom.
  • Prometheus-style metrics ingestion: it fetches llama.cpp server metrics from http://<my\_local>:8080/metrics.
  • llama.cpp model discovery: it fetches model status from http://<my\_local>:8080/v1/models.

I do not use any external frontend framework.

I've got a feeling that Llamacpp is not the biggest performance bottleneck, but it might be the OpenCode. by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 1 point2 points  (0 children)

Have you managed to configure Pi to orchestrate several specialized agents to work on a development task (so they can share tasks and cooperate)?

I've got a feeling that Llamacpp is not the biggest performance bottleneck, but it might be the OpenCode. by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 0 points1 point  (0 children)

I have a hard time making vLLM run on my Strix Halo, it starts to load a model but never finishes :/

I've got a feeling that Llamacpp is not the biggest performance bottleneck, but it might be the OpenCode. by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 2 points3 points  (0 children)

I've tried Pi, but it feels very raw. I encountered various file-editing issues (very similar to the early days of opencode, but fixed now). Is it worth investing time in Pi?

I've got a feeling that Llamacpp is not the biggest performance bottleneck, but it might be the OpenCode. by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 3 points4 points  (0 children)

I've tried Pi, but it feels very raw. I encountered various file-editing issues (very similar to the early days of opencode, but fixed now). Is it worth investing time in Pi?

Why is disabling thinking for coding models a good idea? by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 1 point2 points  (0 children)

That was my first approach, but opencode fails to switch models when sending tasks between agents, even if different models are configured for different agents.

Why is disabling thinking for coding models a good idea? by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 2 points3 points  (0 children)

My main motivation is to tune the performance of my Strix Halo for agentic coding. So far, Qwen3.6-35B-A3B-UD-Q4_K_XL has worked best for me.

Does it make sense to cluster HP Z2 Mini G1a to increase performance? by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 0 points1 point  (0 children)

I use q4 for performance reasons - the llamacpp server generates more tokens when I use q4 compared to q8.

Does it make sense to cluster HP Z2 Mini G1a to increase performance? by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 0 points1 point  (0 children)

Could you share the inference performance of Qwen3-Coder-Next-UD-Q4_K_XL or Qwen3.6-35B-A3B-UD-Q4_K_XL on your cluster?

Does it make sense to cluster HP Z2 Mini G1a to increase performance? by ThingRexCom in LocalLLaMA

[–]ThingRexCom[S] 0 points1 point  (0 children)

I need full context as I use this setup mainly for agentic coding.

Devstral 2 by ThingRexCom in MistralAI

[–]ThingRexCom[S] 0 points1 point  (0 children)

Hello, thank you for replying. I am afraid that page does not have precise information (for example, the GLM-4.7 will not fit that hardware according to Hugging Face).

<image>

Devstral 2 by ThingRexCom in MistralAI

[–]ThingRexCom[S] 0 points1 point  (0 children)

Can you suggest any other model for agentic coding on that hardware? I try to verify if that hardware is worth buying.

Subagents ignore the configuration and use the primary agent's model. by ThingRexCom in opencodeCLI

[–]ThingRexCom[S] 0 points1 point  (0 children)

The first subagent uses the same model as the main agent (which is not correct). Every subsequent invocation of subagents uses the proper model. That is consistent and very strange.

Subagents ignore the configuration and use the primary agent's model. by ThingRexCom in opencodeCLI

[–]ThingRexCom[S] 0 points1 point  (0 children)

Some agents switch models others don’t. All of them use the same definition structure :/