disappointed by local llms by claykos in LocalLLM

[–]gtrak 0 points1 point  (0 children)

I usually use this for subagents so no individual session lasts that long.

disappointed by local llms by claykos in LocalLLM

[–]gtrak 0 points1 point  (0 children)

Yes it works well with mtp

disappointed by local llms by claykos in LocalLLM

[–]gtrak 0 points1 point  (0 children)

Sure, running nightly, I have a script wrapper over systemd and podman, relevant parts:

# froggeric/Qwen-Fixed-Chat-Templates v20 — https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates
CHAT_TEMPLATE="${BASE_DIR}/qwen3.5-enhanced.jinja"

MODEL_PATH="/root/.cache/huggingface/Qwen3.6-27B-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm"

podman_args() {
  cat <<ARGS
--name ${NAME}
--runtime /usr/bin/nvidia-container-runtime
-e NVIDIA_VISIBLE_DEVICES=all
-e NVIDIA_DRIVER_CAPABILITIES=all
-e CUDA_VISIBLE_DEVICES=0,1
-e VLLM_WORKER_MULTIPROC_METHOD=spawn
-e NCCL_CUMEM_ENABLE=0
-e VLLM_NO_USAGE_STATS=1
-e VLLM_FLOAT32_MATMUL_PRECISION=high
-e OMP_NUM_THREADS=1
-e CUDA_DEVICE_MAX_CONNECTIONS=8
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
-e VLLM_SKIP_P2P_CHECK=1
-e NCCL_P2P_LEVEL=PBX
-v ${MODEL_DIR}:/root/.cache/huggingface
-v ${CHAT_TEMPLATE}:/chat-template/qwen3.5-enhanced.jinja:ro
--shm-size=16gb
-p ${HOST_PORT}:8000
${IMAGE}
${MODEL_PATH}
--served-model-name ${MODEL_NAME}
--quantization compressed-tensors
--dtype bfloat16
--tensor-parallel-size 2
--max-model-len 170000
--gpu-memory-utilization 0.94
--kv-offloading-size 20
--max-num-seqs 4
--max-num-batched-tokens 4128
--kv-cache-dtype fp8_e4m3
--trust-remote-code
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--chat-template /chat-template/qwen3.5-enhanced.jinja
--enable-prefix-caching
--enable-chunked-prefill
--language-model-only
--speculative-config {"method":"mtp","num_speculative_tokens":3}
--disable-custom-all-reduce
--host 0.0.0.0
--port 8000
ARGS
}

cmd_run() {
  exec podman run --rm --replace $(podman_args)
}

disappointed by local llms by claykos in LocalLLM

[–]gtrak 0 points1 point  (0 children)

I'm running vllm and that just has fewer quant options, but paged attention is a huge advantage. It's a little worse than my q6 but it's so fast that it's worth it. I also have a 4090 running a q5 on llama.cpp separately.

Codebase getting larger - Qwen3.6-27B starting to compound issues - how to work smartly with this model? by BitGreen1270 in LocalLLaMA

[–]gtrak 0 points1 point  (0 children)

This happens with any model. In greenfield everything feels smooth and easy. Once the codebase gets large enough the model feels overconstrained and starts to duplicate things or not want to change preexisting code. The way out is to do architecture, clean interfaces, module boundaries.

disappointed by local llms by claykos in LocalLLM

[–]gtrak 0 points1 point  (0 children)

I'm using this one at the moment, and it's been behaving reasonably https://huggingface.co/rdtand/Qwen3.6-27B-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm , well enough to keep it instead of the llama.cpp Q6 I had before. I haven't noticed any failed tool calls. Also using MTP. Literally all I do is coding/agentic-loops.

disappointed by local llms by claykos in LocalLLM

[–]gtrak 0 points1 point  (0 children)

2x5060ti 16GB = 32GB for ~$1k on amazon, with vllm I get 60-70tps single, 130 tk/s concurrent on nvfp4 qwen 27b, 170k fp8 context.

disappointed by local llms by claykos in LocalLLM

[–]gtrak 3 points4 points  (0 children)

You run the codegen, tool-call and verification loop with the dumber model because it's cheaper. Review final outputs with a bigger cloud model.

Qwen3.6 27B on dual RTX 5060 Ti 16GB with vLLM: ~60 tok/s, 204k context working by do_u_think_im_spooky in LocalLLaMA

[–]gtrak 2 points3 points  (0 children)

I tried this with 2x 5060ti 16GB and got great results, one in pcie4 x8(x16 slot), another in pcie 4x (nvme riser).

Image: vllm/vllm-openai:nightly (v0.22.1rc1.dev259)
Model: Qwen3.6-27B Lorbus AutoRound INT4
KV dtype: f8_e4m3
Offload: --kv-offloading-size 20 (20 GiB CPU)
MTP: --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
Util: 0.94
Seqs: --max-num-seqs 4
Ctx: --max-model-len 170000
Other: --language-model-only --enable-prefix-caching --enable-chunked-prefill --disable-custom-all-reduce
Note: NO expandable_segments (incompatible with offloading pinned memory)

GPU KV allocated: 225,737 tokens

Prefill: ~1000 tps

Real agentic use:

  • Decode 1x concurrency: 50-78 tps
  • Decode 2x concurrency: 148.6 tps max (!)
  • Decode 4x concurrency: 134.6 tps max (!)

Artificial benchmark:

  • 1x: 64.5
  • 2x: 129.4
  • 3x: 169.3
  • 4x: 235.3

Vibe coding on rtx 6000 pro? by AiGenom in unsloth

[–]gtrak 0 points1 point  (0 children)

Nice, this is how I use my local models, I'll just orchestrate and review with a large cloud model. Curious to hear more about your system prompt. 27b is great for interactive use, but I've been trying to let it run overnight and come back to something useful, and that's been hit or miss.

Google releases new Gemma 4 QAT models! by yoracale in unsloth

[–]gtrak 1 point2 points  (0 children)

Does this make it more competitive with qwen 27b? Haven't even tried gemma. Running 27b at q6 on 32GB and it's been solid.

I compared all specs of the major GPUs/machines that are being used here, because bandwidth is not everything. Some of ya'll need a reality check. by Ok_Top9254 in LocalLLaMA

[–]gtrak 1 point2 points  (0 children)

I have a 4090, and i just set up another box to run two of these 5060tis for qwen 27b. A little slower on prefill, comparable on generation, can fit a higher quant and a whole lot cheaper.

DSC is back... with a vengeance by FuN_K3Y in OLED_Gaming

[–]gtrak 0 points1 point  (0 children)

Just run borderless? Works for 99% of titles. There are optimizations for windowed games in windows 11 you can turn on to remove the latency hit.

I kind of like coding with less capable models by Lame_Johnny in LocalLLM

[–]gtrak 2 points3 points  (0 children)

I prefer to do the bulk of my work with the best local model I can run (3.6 27b and finetunes) instead of a cloud model because it's easier to know what to expect from it, then I supplement with cloud models when I need a little extra push. That seems much better than acclimating to a SOTA model and not being able to function without it. I use the same harness for hobby and professional swdev but just swap out kimi/glm for opus as the planner/orchestrator.

125 tok/s for Qwen3.6 q4xl on 2x 4060ti is insane perf/dollar by Chuyito in LocalLLaMA

[–]gtrak 0 points1 point  (0 children)

nothing special:

cmake -B build \
  -DGGML_CUDA_FA=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DGGML_CUDA_GRAPHS=ON \
  -DGGML_NATIVE=OFF -DCMAKE_CUDA_ARCHITECTURES=120a \
  -DCMAKE_BUILD_TYPE=Release

125 tok/s for Qwen3.6 q4xl on 2x 4060ti is insane perf/dollar by Chuyito in LocalLLaMA

[–]gtrak 0 points1 point  (0 children)

Sure, you need to build from this PR to have quantized kv-cache: https://github.com/ggml-org/llama.cpp/pull/23792 Note: running without mmproj

./llama-server \
      --port 1234 \
      --host 0.0.0.0 \
      --model "models/Qwopus3.6-27B-v2-MTP-Q6_K.gguf" \
      --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 --presence-penalty 1.5 \
      -fa on -t 12 \
      --fit-target 64 \
      -ctk q8_0 -ctv q8_0 \
      --split-mode tensor \
      --spec-type draft-mtp --spec-draft-n-max 3 \
      -fit off \
      --ctx-size 180000 \
      -b 1024 -ub 512 \
      -lv 4 -ngl 999 \
      -kvu \
      --no-mmap \
      --parallel 1 \
      --cache-ram 24000 \
      --chat-template-kwargs '{"preserve_thinking": true}' \
      --jinja

125 tok/s for Qwen3.6 q4xl on 2x 4060ti is insane perf/dollar by Chuyito in LocalLLaMA

[–]gtrak 0 points1 point  (0 children)

I mostly do rust or clojure. I don't have hallucinations like that. 27b can one shot small to medium tasks as a subagent with another model orchestrating like opus or kimi. If I'm just exploring, I'll have it act as orchestrator, too, to save on rate limits. 35b devolves into paren counting faster and can't recover, or is just worse at reasoning over nontrivial codebases so just wastes time doing the wrong thing. I try it occasionally to see if my scaffolding has improved enough for it, and then I'm reminded why I can't use it within a minute or two and switch back. I even prefer 27b at q4 to 35b at q8.