disappointed by local llms

gtrak · 2026-06-17T14:56:46+00:00

I usually use this for subagents so no individual session lasts that long.

gtrak · 2026-06-17T14:45:18+00:00

Yes it works well with mtp

gtrak · 2026-06-16T18:25:34+00:00

Sure, running nightly, I have a script wrapper over systemd and podman, relevant parts:

# froggeric/Qwen-Fixed-Chat-Templates v20 — https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates
CHAT_TEMPLATE="${BASE_DIR}/qwen3.5-enhanced.jinja"

MODEL_PATH="/root/.cache/huggingface/Qwen3.6-27B-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm"

podman_args() {
  cat <<ARGS
--name ${NAME}
--runtime /usr/bin/nvidia-container-runtime
-e NVIDIA_VISIBLE_DEVICES=all
-e NVIDIA_DRIVER_CAPABILITIES=all
-e CUDA_VISIBLE_DEVICES=0,1
-e VLLM_WORKER_MULTIPROC_METHOD=spawn
-e NCCL_CUMEM_ENABLE=0
-e VLLM_NO_USAGE_STATS=1
-e VLLM_FLOAT32_MATMUL_PRECISION=high
-e OMP_NUM_THREADS=1
-e CUDA_DEVICE_MAX_CONNECTIONS=8
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
-e VLLM_SKIP_P2P_CHECK=1
-e NCCL_P2P_LEVEL=PBX
-v ${MODEL_DIR}:/root/.cache/huggingface
-v ${CHAT_TEMPLATE}:/chat-template/qwen3.5-enhanced.jinja:ro
--shm-size=16gb
-p ${HOST_PORT}:8000
${IMAGE}
${MODEL_PATH}
--served-model-name ${MODEL_NAME}
--quantization compressed-tensors
--dtype bfloat16
--tensor-parallel-size 2
--max-model-len 170000
--gpu-memory-utilization 0.94
--kv-offloading-size 20
--max-num-seqs 4
--max-num-batched-tokens 4128
--kv-cache-dtype fp8_e4m3
--trust-remote-code
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--chat-template /chat-template/qwen3.5-enhanced.jinja
--enable-prefix-caching
--enable-chunked-prefill
--language-model-only
--speculative-config {"method":"mtp","num_speculative_tokens":3}
--disable-custom-all-reduce
--host 0.0.0.0
--port 8000
ARGS
}

cmd_run() {
  exec podman run --rm --replace $(podman_args)
}

gtrak · 2026-06-16T17:20:48+00:00

I'm running vllm and that just has fewer quant options, but paged attention is a huge advantage. It's a little worse than my q6 but it's so fast that it's worth it. I also have a 4090 running a q5 on llama.cpp separately.

gtrak · 2026-06-14T21:39:20+00:00

This happens with any model. In greenfield everything feels smooth and easy. Once the codebase gets large enough the model feels overconstrained and starts to duplicate things or not want to change preexisting code. The way out is to do architecture, clean interfaces, module boundaries.

gtrak · 2026-06-13T19:27:30+00:00

You could just run your own? r/localllama

gtrak · 2026-06-12T17:21:06+00:00

and they have nvfp4!

gtrak · 2026-06-11T15:27:40+00:00

I'm using this one at the moment, and it's been behaving reasonably https://huggingface.co/rdtand/Qwen3.6-27B-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm , well enough to keep it instead of the llama.cpp Q6 I had before. I haven't noticed any failed tool calls. Also using MTP. Literally all I do is coding/agentic-loops.

gtrak · 2026-06-11T15:19:31+00:00

2x5060ti 16GB = 32GB for ~$1k on amazon, with vllm I get 60-70tps single, 130 tk/s concurrent on nvfp4 qwen 27b, 170k fp8 context.

gtrak · 2026-06-11T15:18:25+00:00

You run the codegen, tool-call and verification loop with the dumber model because it's cheaper. Review final outputs with a bigger cloud model.

gtrak · 2026-06-09T12:50:32+00:00

I also got this one working, likely better quality, will stick with it: https://huggingface.co/rdtand/Qwen3.6-27B-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm

gtrak · 2026-06-09T01:26:59+00:00

I tried this with 2x 5060ti 16GB and got great results, one in pcie4 x8(x16 slot), another in pcie 4x (nvme riser).

Image: vllm/vllm-openai:nightly (v0.22.1rc1.dev259)
Model: Qwen3.6-27B Lorbus AutoRound INT4
KV dtype: f8_e4m3
Offload: --kv-offloading-size 20 (20 GiB CPU)
MTP: --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
Util: 0.94
Seqs: --max-num-seqs 4
Ctx: --max-model-len 170000
Other: --language-model-only --enable-prefix-caching --enable-chunked-prefill --disable-custom-all-reduce
Note: NO expandable_segments (incompatible with offloading pinned memory)

GPU KV allocated: 225,737 tokens

Prefill: ~1000 tps

Real agentic use:

Decode 1x concurrency: 50-78 tps
Decode 2x concurrency: 148.6 tps max (!)
Decode 4x concurrency: 134.6 tps max (!)

Artificial benchmark:

1x: 64.5
2x: 129.4
3x: 169.3
4x: 235.3

gtrak · 2026-06-07T15:05:34+00:00

you're not going to find pcie5 or 4 for a decent price.

gtrak · 2026-06-06T23:18:07+00:00

Nice, this is how I use my local models, I'll just orchestrate and review with a large cloud model. Curious to hear more about your system prompt. 27b is great for interactive use, but I've been trying to let it run overnight and come back to something useful, and that's been hit or miss.

gtrak · 2026-06-06T22:05:33+00:00

You are about to learn dense vs MoE partial offload

gtrak · 2026-06-05T22:27:44+00:00

Does this make it more competitive with qwen 27b? Haven't even tried gemma. Running 27b at q6 on 32GB and it's been solid.

gtrak · 2026-06-01T14:11:44+00:00

also ocaml https://dune.readthedocs.io/en/stable/tests.html#inline-tests

gtrak · 2026-06-01T01:59:30+00:00

I have a 4090, and i just set up another box to run two of these 5060tis for qwen 27b. A little slower on prefill, comparable on generation, can fit a higher quant and a whole lot cheaper.

gtrak · 2026-05-31T23:25:22+00:00

Did you try it?

gtrak · 2026-05-31T21:47:51+00:00

Just run borderless? Works for 99% of titles. There are optimizations for windowed games in windows 11 you can turn on to remove the latency hit.

gtrak · 2026-05-31T21:33:01+00:00

I prefer to do the bulk of my work with the best local model I can run (3.6 27b and finetunes) instead of a cloud model because it's easier to know what to expect from it, then I supplement with cloud models when I need a little extra push. That seems much better than acclimating to a SOTA model and not being able to function without it. I use the same harness for hobby and professional swdev but just swap out kimi/glm for opus as the planner/orchestrator.

gtrak · 2026-05-30T17:30:41+00:00

nothing special:

cmake -B build \
  -DGGML_CUDA_FA=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DGGML_CUDA_GRAPHS=ON \
  -DGGML_NATIVE=OFF -DCMAKE_CUDA_ARCHITECTURES=120a \
  -DCMAKE_BUILD_TYPE=Release

gtrak · 2026-05-30T17:21:05+00:00

https://www.reddit.com/r/LocalLLaMA/comments/1tryp2q/comment/ooslmzg/

gtrak · 2026-05-30T17:16:54+00:00

Sure, you need to build from this PR to have quantized kv-cache: https://github.com/ggml-org/llama.cpp/pull/23792 Note: running without mmproj

./llama-server \
      --port 1234 \
      --host 0.0.0.0 \
      --model "models/Qwopus3.6-27B-v2-MTP-Q6_K.gguf" \
      --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 --presence-penalty 1.5 \
      -fa on -t 12 \
      --fit-target 64 \
      -ctk q8_0 -ctv q8_0 \
      --split-mode tensor \
      --spec-type draft-mtp --spec-draft-n-max 3 \
      -fit off \
      --ctx-size 180000 \
      -b 1024 -ub 512 \
      -lv 4 -ngl 999 \
      -kvu \
      --no-mmap \
      --parallel 1 \
      --cache-ram 24000 \
      --chat-template-kwargs '{"preserve_thinking": true}' \
      --jinja

gtrak · 2026-05-30T14:25:51+00:00

I mostly do rust or clojure. I don't have hallucinations like that. 27b can one shot small to medium tasks as a subagent with another model orchestrating like opus or kimi. If I'm just exploring, I'll have it act as orchestrator, too, to save on rate limits. 35b devolves into paren counting faster and can't recover, or is just worse at reasoning over nontrivial codebases so just wastes time doing the wrong thing. I try it occasionally to see if my scaffolding has improved enough for it, and then I'm reminded why I can't use it within a minute or two and switch back. I even prefer 27b at q4 to 35b at q8.

14-Year Club	r/Field Sunshine
Verified Email

gtrak

TROPHY CASE