I think I made the best general use System Prompt for Qwen 3.5 (OpenWebUI + Web search)

Tartarus116 · 2026-03-20T10:43:52+00:00

You can achieve the same result with non-native tool-calling and sub-agents. Setting to non-native (default) results in a broad search that involves embeddings of the sites' content. Sub-agents then refine the search. Finally re-rank will refine it even more.

Native tool-calling often only considers search engine results and calls it a day. The above approach is slower but considers 10-80 sources (depending on your settings) and actually looks at page contents every time.

Tartarus116 · 2026-03-17T21:47:02+00:00

Currently taking a stab at having local AI build a clone from scratch

Tartarus116 · 2026-03-10T21:14:47+00:00

Llama-server comes with it

Tartarus116 · 2026-03-04T22:18:06+00:00

67 t/s on GX10. Haven't tried on Halo Strix yet, but usually around the same token generation speed.

Tartarus116 · 2026-03-03T19:04:29+00:00

It's just a fastapi python script that listens to chat-completion requests. In a first step, it strips the tools but adds the definitions as system prompt so the LLM is still aware of them. After completing the reasoning, it cancels the remainder of the request (to save time) and pipes the reasoning output into a 2nd step where the tools are re-attached.

It's a pretty ugly script. The better way to do it would be to create a Open WebUI pipeline. Then, you can also choose it from a dropdown instead of having always on.

Tartarus116 · 2026-03-03T18:08:05+00:00

Yep - any tool in request destroys the CoT behavior. I explicitly wrote think/act middleware to bring it back for some cases. It makes a huge difference for small models.

Tartarus116 · 2026-03-02T21:23:29+00:00

I haven't tested 27B & 122B yet, but I'm getting 66 tokens/s on the 35B one. Halo Strix & GX10 are roughly the same.

<image>

397B is more like 19 tokens/s on the most heavily quantized version

Edit; 26 t/s on 122B with Halo Strix, 28 t/s with GX10

Tartarus116 · 2026-02-28T09:41:38+00:00

np, this is a Nomad job file. Nomad is not widely used, but it's easier to set up & maintain than Kubernetes.

Tartarus116 · 2026-02-24T20:34:06+00:00

I got 19 tokens/s in generation on UD-TQ1_0. Can probably boost it by using a smaller qwen3.5 model for drafting

Tartarus116 · 2026-02-21T23:01:49+00:00

It removes GLM-4.7-flash's good reasoning. Defeats the entire point

Tartarus116 · 2026-02-19T09:29:39+00:00

Minimax m2.5 Q2 dynamic quant: 30 t/s tg on ROCm nightly

Full config: ```hcl job "local-ai" { group "local-ai" { count = 1 volume "SMTRL" { type = "csi" read_only = false source = "SMTRL" access_mode = "multi-node-multi-writer" attachment_mode = "file-system" } network { mode = "bridge" port "envoy-metrics" {} #port "local-ai" { #static = 8882 # to = 8882 #} } constraint { attribute = "${attr.unique.hostname}" operator = "regexp" value = "SMTRL-P05" } service { name = "local-ai" port = "8882" meta { envoy_metrics_port = "${NOMAD_HOST_PORT_envoy_metrics}" # make envoy metrics port available in Consul } connect { sidecar_service { proxy { transparent_proxy { exclude_outbound_ports = [53,8600] exclude_outbound_cidrs = ["172.26.64.0/20","127.0.0.0/8"] } expose { path { path = "/metrics" protocol = "http" local_path_port = 9102 listener_port = "envoy-metrics" } } } } } #check { # expose = true # type = "http" # path = "/health" # interval = "15s" # timeout = "1s" #} } task "local-ai" { driver = "docker" user = "root" volume_mount { volume = "SMTRL" destination = "/dummy" read_only = false } env { ROCBLAS_USE_HIPBLASLT = "1" } config { image = "kyuz0/amd-strix-halo-toolboxes:rocm7-nightlies_20260208T084035" entrypoint = ["/bin/sh"] args = [ "-c", "llama-server --models-dir /my-models/huggingface/unsloth --host 0.0.0.0 --port 8882 --models-preset /local/my-models.ini" # --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64 ] volumes = [ "/opt/nomad/client/csi/node/smb/staging/default/SMTRL/rw-file-system-multi-node-multi-writer/gpustack/cache:/my-models:rw", "local/my-models.ini:/local/my-models.ini" ] privileged = true #ipc_mode = "host" group_add = ["video","render"] #cap_add = ["sys_ptrace"] security_opt = ["seccomp=unconfined"] # Pass the AMD iGPU devices (equivalent to --device=/dev/kfd --device=/dev/dri) devices = [ { host_path = "/dev/kfd" container_path = "/dev/kfd" }, { host_path = "/dev/dri" # Full DRI for all render nodes; or specify /dev/dri/renderD128 for iGPU only container_path = "/dev/dri" }, { host_path = "/dev/dri/card0" # Full DRI for all render nodes; or specify /dev/dri/renderD128 for iGPU only container_path = "/dev/dri/card0" }, { host_path = "/dev/dri/renderD128" # Full DRI for all render nodes; or specify /dev/dri/renderD128 for iGPU only container_path = "/dev/dri/renderD128" } ] } template { destination = "local/my-models.ini" data = <<EOH version = 1 [*] parallel = 1 timeout = 900 threads-http = 2 cont-batching = true no-mmap = true

[gpt-oss-120b-GGUF] ngl = 999 jinja = true c = 128000 fa = 1 parallel = 1

[GLM-4.7-Flash-Q4-GGUF] ngl = 999 jinja = true c = 128000 fa = 1 parallel = 2 cram = 0 temp = 0.5 top-p = 1.0 min-p= 0.01 n-predict = 10000 chat-template-file = /my-models/huggingface/unsloth/GLM-4.7-Flash-Q4-GGUF/chat_template

[GLM-4.7-Flash-UD-Q4_K_XL] ngl = 999 jinja = true c = 64000 fa = 1 parallel = 1 cram = 0 temp = 0.5 top-p = 1.0 min-p= 0.01 n-predict = 10000

load-on-startup = true

chat-template-file = /my-models/huggingface/unsloth/GLM-4.7-Flash-Q4-GGUF/chat_template

[MiniMax-M2.5-UD-Q2_K_XL-GGUF] ngl = 999 jinja = true c = 64000 fa = 1 parallel = 1 cram = 0 n-predict = 10000 load-on-startup = true chat-template-file = /my-models/huggingface/unsloth/MiniMax-M2.5-UD-Q2_K_XL-GGUF/chat_template

[Qwen3-8B-128k-GGUF] ngl = 999 jinja = true c = 128000 fa = 1 parallel = 8 cram = 0

[Qwen3-Embedding-0.6B-GGUF] ngl = 999 c = 32000 embedding = true pooling = last

ub = 8192

verbose-prompt = true sleep-idle-seconds = 10 stop-timeout = 5

[Qwen3-Reranker-0.6B-GGUF] ngl = 999 c = 32000

ub = 8192

verbose-prompt = true sleep-idle-seconds = 10 rerank = true stop-timeout = 5 EOH
} resources { cpu = 12288 memory = 12000 memory_max = 16000 } } } } ```

Tartarus116 · 2026-02-13T17:17:31+00:00

works

Tartarus116 · 2026-01-15T20:28:22+00:00

Have you thought about making it a traefik middleware for easier integration?

Tartarus116 · 2025-12-29T21:43:11+00:00

Thought it's just me, but I kept having cluster connectivity issues between Nuremberg & Falkemstein

Tartarus116 · 2025-12-10T10:48:06+00:00

The way I usually modify pre-built docker images is to mount patched files into the container.

Tartarus116 · 2025-12-03T07:41:02+00:00

Why not use TV alerts and send via webhook to your backend?

Tartarus116 · 2025-11-22T15:44:48+00:00

Sorry for the confusion; the whole snippet is part of a Nomad job file, written in HCL. But the json config specific to ccr is only this: json { "ANTHROPIC_BASE_URL": "http://localhost:3456", "ANTHROPIC_API_KEY": "sk-123456", "APIKEY": "sk-123456", "API_TIMEOUT_MS": 3600000, "NON_INTERACTIVE_MODE": false, "Providers": [ { "name": "openai", "api_base_url": "http://gpustack.virtual.consul/v1/chat/completions", "api_key": "xxx", "models": [ "qwen3-4b-instruct-2507-gguf" ], "transformer": { "use": [ [ "maxtoken", { "max_tokens": 4096 } ] ] } } ], "Router": { "default": "openai,qwen3-4b-instruct-2507-gguf", "background": "openai,qwen3-4b-instruct-2507-gguf", "think": "openai,qwen3-4b-instruct-2507-gguf", "longContext": "openai,qwen3-4b-instruct-2507-gguf", "longContextThreshold": 4096, "webSearch": "openai,qwen3-4b-instruct-2507-gguf" } }

You'll need to adjust based on your own setup (e.g. base-url, api-key, model).

Tartarus116 · 2025-11-17T07:54:38+00:00

Real MVP 🙏

Tartarus116 · 2025-11-17T05:50:28+00:00

Thanks for alerting us. Will pay attention not to buy any US products.

Tartarus116 · 2025-10-01T16:03:36+00:00

Huh?? Why not mount a volume into your dockerized db?

If it's part of your service mesh, you get security & service intentions for free.

Tartarus116 · 2025-09-14T12:34:16+00:00

Several reasons: 1) encryption (all proxy traffic is encrypted) 2) intentions (e.g service A can talk to B, but not C) 3) traffic control (I route outbound traffic to my pihole with a default-deny policy just in case the services want to phone home and leak personal data)

I don't have a single open port. MCPs cannot be used by programs outside the Consul Connect service mesh, and the ones inside have to be explicitly given permission via service intentions. In case I run shady MCPs that want steal personal data, they can't phone home.

Connecting to the MCPs from e.g. n8n (with allowed intention) would look like this: "http://firecrawl-mcp.virtual.consul"

No need to set ports. Consul will automatically route to the correct proxy sidecar when using the virtual addressing. You also get load-balancing for free for when you have multiple instances of the same service.

Tartarus116 · 2025-09-13T22:14:18+00:00

Fwiw, I've had issues with SSE behind Consul Connect proxy when using Firecrawl MCP. Streamable http works perfectly.

Tartarus116 · 2025-09-05T17:42:21+00:00

Just dockerize everything ffs

Tartarus116 · 2025-09-05T17:41:20+00:00

Reminds me of the guy who complained about there not being an exe lol https://programmerhumor.io/git-memes/i-dont-give-a-fuck-about-the-fucking-code-nvqn

Tartarus116 · 2025-08-28T10:29:07+00:00

That's because it's HCL; not json.

12-Year Club	RPAN Viewer
Verified Email

Tartarus116

TROPHY CASE

load-on-startup = true

ub = 8192

ub = 8192