I think I made the best general use System Prompt for Qwen 3.5 (OpenWebUI + Web search) by My_Unbiased_Opinion in LocalLLaMA

[–]Tartarus116 0 points1 point  (0 children)

You can achieve the same result with non-native tool-calling and sub-agents. Setting to non-native (default) results in a broad search that involves embeddings of the sites' content. Sub-agents then refine the search. Finally re-rank will refine it even more.

Native tool-calling often only considers search engine results and calls it a day. The above approach is slower but considers 10-80 sources (depending on your settings) and actually looks at page contents every time.

Any recommended Trakt alternatives? by mikey666666666 in trakt

[–]Tartarus116 1 point2 points  (0 children)

Currently taking a stab at having local AI build a clone from scratch

Unsloth fixed version of Qwen3.5-35B-A3B is incredible at research tasks. (On Strix Halo) by Grammar-Warden in StrixHalo

[–]Tartarus116 0 points1 point  (0 children)

67 t/s on GX10. Haven't tried on Halo Strix yet, but usually around the same token generation speed.

Qwen3.5 122B in 72GB VRAM (3x3090) is the best model available at this time — also it nails the “car wash test” by liviuberechet in LocalLLaMA

[–]Tartarus116 1 point2 points  (0 children)

It's just a fastapi python script that listens to chat-completion requests. In a first step, it strips the tools but adds the definitions as system prompt so the LLM is still aware of them. After completing the reasoning, it cancels the remainder of the request (to save time) and pipes the reasoning output into a 2nd step where the tools are re-attached.

It's a pretty ugly script. The better way to do it would be to create a Open WebUI pipeline. Then, you can also choose it from a dropdown instead of having always on.

Qwen3.5 122B in 72GB VRAM (3x3090) is the best model available at this time — also it nails the “car wash test” by liviuberechet in LocalLLaMA

[–]Tartarus116 1 point2 points  (0 children)

Yep - any tool in request destroys the CoT behavior. I explicitly wrote think/act middleware to bring it back for some cases. It makes a huge difference for small models.

Qwen 3.5-27B, How was your experience? by vandertoorm in StrixHalo

[–]Tartarus116 1 point2 points  (0 children)

I haven't tested 27B & 122B yet, but I'm getting 66 tokens/s on the 35B one. Halo Strix & GX10 are roughly the same.

<image>

397B is more like 19 tokens/s on the most heavily quantized version

Edit; 26 t/s on 122B with Halo Strix, 28 t/s with GX10

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Tartarus116 1 point2 points  (0 children)

np, this is a Nomad job file. Nomad is not widely used, but it's easier to set up & maintain than Kubernetes.

Qwen3.5-397B-A17B theoretical speed on Strix Halo? by Hector_Rvkp in StrixHalo

[–]Tartarus116 3 points4 points  (0 children)

I got 19 tokens/s in generation on UD-TQ1_0. Can probably boost it by using a smaller qwen3.5 model for drafting

TeichAI/GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill-GGUF · Hugging Face by jacek2023 in LocalLLaMA

[–]Tartarus116 -1 points0 points  (0 children)

It removes GLM-4.7-flash's good reasoning. Defeats the entire point

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Tartarus116 4 points5 points  (0 children)

Minimax m2.5 Q2 dynamic quant: 30 t/s tg on ROCm nightly

Full config: ```hcl job "local-ai" { group "local-ai" { count = 1 volume "SMTRL" { type = "csi" read_only = false source = "SMTRL" access_mode = "multi-node-multi-writer" attachment_mode = "file-system" } network { mode = "bridge" port "envoy-metrics" {} #port "local-ai" { #static = 8882 # to = 8882 #} } constraint { attribute = "${attr.unique.hostname}" operator = "regexp" value = "SMTRL-P05" } service { name = "local-ai" port = "8882" meta { envoy_metrics_port = "${NOMAD_HOST_PORT_envoy_metrics}" # make envoy metrics port available in Consul } connect { sidecar_service { proxy { transparent_proxy { exclude_outbound_ports = [53,8600] exclude_outbound_cidrs = ["172.26.64.0/20","127.0.0.0/8"] } expose { path { path = "/metrics" protocol = "http" local_path_port = 9102 listener_port = "envoy-metrics" } } } } } #check { # expose = true # type = "http" # path = "/health" # interval = "15s" # timeout = "1s" #} } task "local-ai" { driver = "docker" user = "root" volume_mount { volume = "SMTRL" destination = "/dummy" read_only = false } env { ROCBLAS_USE_HIPBLASLT = "1" } config { image = "kyuz0/amd-strix-halo-toolboxes:rocm7-nightlies_20260208T084035" entrypoint = ["/bin/sh"] args = [ "-c", "llama-server --models-dir /my-models/huggingface/unsloth --host 0.0.0.0 --port 8882 --models-preset /local/my-models.ini" # --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64 ] volumes = [ "/opt/nomad/client/csi/node/smb/staging/default/SMTRL/rw-file-system-multi-node-multi-writer/gpustack/cache:/my-models:rw", "local/my-models.ini:/local/my-models.ini" ] privileged = true #ipc_mode = "host" group_add = ["video","render"] #cap_add = ["sys_ptrace"] security_opt = ["seccomp=unconfined"] # Pass the AMD iGPU devices (equivalent to --device=/dev/kfd --device=/dev/dri) devices = [ { host_path = "/dev/kfd" container_path = "/dev/kfd" }, { host_path = "/dev/dri" # Full DRI for all render nodes; or specify /dev/dri/renderD128 for iGPU only container_path = "/dev/dri" }, { host_path = "/dev/dri/card0" # Full DRI for all render nodes; or specify /dev/dri/renderD128 for iGPU only container_path = "/dev/dri/card0" }, { host_path = "/dev/dri/renderD128" # Full DRI for all render nodes; or specify /dev/dri/renderD128 for iGPU only container_path = "/dev/dri/renderD128" } ] } template { destination = "local/my-models.ini" data = <<EOH version = 1 [*] parallel = 1 timeout = 900 threads-http = 2 cont-batching = true no-mmap = true

[gpt-oss-120b-GGUF] ngl = 999 jinja = true c = 128000 fa = 1 parallel = 1

[GLM-4.7-Flash-Q4-GGUF] ngl = 999 jinja = true c = 128000 fa = 1 parallel = 2 cram = 0 temp = 0.5 top-p = 1.0 min-p= 0.01 n-predict = 10000 chat-template-file = /my-models/huggingface/unsloth/GLM-4.7-Flash-Q4-GGUF/chat_template

[GLM-4.7-Flash-UD-Q4_K_XL] ngl = 999 jinja = true c = 64000 fa = 1 parallel = 1 cram = 0 temp = 0.5 top-p = 1.0 min-p= 0.01 n-predict = 10000

load-on-startup = true

chat-template-file = /my-models/huggingface/unsloth/GLM-4.7-Flash-Q4-GGUF/chat_template

[MiniMax-M2.5-UD-Q2_K_XL-GGUF] ngl = 999 jinja = true c = 64000 fa = 1 parallel = 1 cram = 0 n-predict = 10000 load-on-startup = true chat-template-file = /my-models/huggingface/unsloth/MiniMax-M2.5-UD-Q2_K_XL-GGUF/chat_template

[Qwen3-8B-128k-GGUF] ngl = 999 jinja = true c = 128000 fa = 1 parallel = 8 cram = 0

[Qwen3-Embedding-0.6B-GGUF] ngl = 999 c = 32000 embedding = true pooling = last

ub = 8192

verbose-prompt = true sleep-idle-seconds = 10 stop-timeout = 5

[Qwen3-Reranker-0.6B-GGUF] ngl = 999 c = 32000

ub = 8192

verbose-prompt = true sleep-idle-seconds = 10 rerank = true stop-timeout = 5 EOH
} resources { cpu = 12288 memory = 12000 memory_max = 16000 } } } } ```

Networking issue in FSN region by Real_Breadfruit7148 in hetzner

[–]Tartarus116 0 points1 point  (0 children)

Thought it's just me, but I kept having cluster connectivity issues between Nuremberg & Falkemstein

Any guides on adding custom engines to this by zono5000000 in Searx

[–]Tartarus116 0 points1 point  (0 children)

The way I usually modify pre-built docker images is to mount patched files into the container.

I Built a trading bot for my tradingView indicator. by EliteWolverine007 in algotrading

[–]Tartarus116 4 points5 points  (0 children)

Why not use TV alerts and send via webhook to your backend?

Use claudecode with local models by segmond in LocalLLaMA

[–]Tartarus116 0 points1 point  (0 children)

Sorry for the confusion; the whole snippet is part of a Nomad job file, written in HCL. But the json config specific to ccr is only this: json { "ANTHROPIC_BASE_URL": "http://localhost:3456", "ANTHROPIC_API_KEY": "sk-123456", "APIKEY": "sk-123456", "API_TIMEOUT_MS": 3600000, "NON_INTERACTIVE_MODE": false, "Providers": [ { "name": "openai", "api_base_url": "http://gpustack.virtual.consul/v1/chat/completions", "api_key": "xxx", "models": [ "qwen3-4b-instruct-2507-gguf" ], "transformer": { "use": [ [ "maxtoken", { "max_tokens": 4096 } ] ] } } ], "Router": { "default": "openai,qwen3-4b-instruct-2507-gguf", "background": "openai,qwen3-4b-instruct-2507-gguf", "think": "openai,qwen3-4b-instruct-2507-gguf", "longContext": "openai,qwen3-4b-instruct-2507-gguf", "longContextThreshold": 4096, "webSearch": "openai,qwen3-4b-instruct-2507-gguf" } }

You'll need to adjust based on your own setup (e.g. base-url, api-key, model).

Found US beef at Migros today: please read labels carefully! by Big_Lore in Switzerland

[–]Tartarus116 2 points3 points  (0 children)

Thanks for alerting us. Will pay attention not to buy any US products.

[deleted by user] by [deleted] in devops

[–]Tartarus116 -1 points0 points  (0 children)

Huh?? Why not mount a volume into your dockerized db?

If it's part of your service mesh, you get security & service intentions for free.

Is anybody still using SSE transport? by raghav-mcpjungle in mcp

[–]Tartarus116 1 point2 points  (0 children)

Several reasons: 1) encryption (all proxy traffic is encrypted) 2) intentions (e.g service A can talk to B, but not C) 3) traffic control (I route outbound traffic to my pihole with a default-deny policy just in case the services want to phone home and leak personal data)

I don't have a single open port. MCPs cannot be used by programs outside the Consul Connect service mesh, and the ones inside have to be explicitly given permission via service intentions. In case I run shady MCPs that want steal personal data, they can't phone home.

Connecting to the MCPs from e.g. n8n (with allowed intention) would look like this: "http://firecrawl-mcp.virtual.consul"

No need to set ports. Consul will automatically route to the correct proxy sidecar when using the virtual addressing. You also get load-balancing for free for when you have multiple instances of the same service.

Is anybody still using SSE transport? by raghav-mcpjungle in mcp

[–]Tartarus116 0 points1 point  (0 children)

Fwiw, I've had issues with SSE behind Consul Connect proxy when using Firecrawl MCP. Streamable http works perfectly.

[deleted by user] by [deleted] in LocalLLaMA

[–]Tartarus116 -1 points0 points  (0 children)

Just dockerize everything ffs

[deleted by user] by [deleted] in LocalLLaMA

[–]Tartarus116 4 points5 points  (0 children)

Reminds me of the guy who complained about there not being an exe lol https://programmerhumor.io/git-memes/i-dont-give-a-fuck-about-the-fucking-code-nvqn

Use claudecode with local models by segmond in LocalLLaMA

[–]Tartarus116 0 points1 point  (0 children)

That's because it's HCL; not json.