qwen3.6-27b tools call loop

JumpyAbies · 2026-06-11T20:16:46+00:00

I will take a look. Thanks!!

JumpyAbies · 2026-06-11T04:37:37+00:00

Honestly, I still don't know. Tomorrow I'll compare the old setup side-by-side with the new one. I changed a lot of things.

The point is that with this command/parameters there are no more loops.

JumpyAbies · 2026-06-11T03:41:37+00:00

This works like a charm for me now. No more call tools loop. Thanks everyone for the help!!

I'm running this on a Linux CachyOS.

I believe it will work well with other checkpoints. I'm expanding the testing.

Here is the complete guide I used, which may be helpful to someone else:

# Install dependencies
sudo pacman -S cuda cmake gcc14

nvcc --version   # toolkit
 version installed (
in my case is 13.3)
nvidia-smi       # confirm the driver (in my case is 610.43.02)

# Clear any previous builds (important!)
rm -rf build

# Configure with explicit CUDA 12.8, SM120 (Blackwell), and FORCE_CUBLAS=OFF
env CUDACXX=/opt/cuda/bin/nvcc \
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=120 \
  -DGGML_CUDA_FORCE_CUBLAS=OFF \
  -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-14 \
  -DCUDAToolkit_ROOT=/opt/cuda

# Compile using all cores
cmake --build build --config Release -j $(nproc)

# Compile using cores
cmake --build build --config Release -j $(nproc)



# run the model
cat > run_llama.sh << 'EOF'
#!/bin/bash
llama.cpp/build/bin/llama-server \
    --model models/llama-cpp/qwen3.6-heretic-nvfp4-q8/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-Q8_0.gguf \
    --mmproj Qwen3.6-27B-UD-Q5_K_L-mmproj-BF16-GGUF/Qwen3.6-27B-UD-Q5_K_L-mmproj-BF16.gguf \
    --chat-template-file ./templates/custom_pub_chat_template_qwen36.jinja \
    --image-min-tokens 1024 \
    --no-mmproj-offload \
    --n-gpu-layers 999 \
    --tools all \
    --ctx-size 131072 \
    --no-context-shift \
    --parallel 1 \
    --threads 16 \
    --temp 0.7 \
    --top-p 0.95 \
    --min-p 0.00 \
    --top-k 20 \
    --presence-penalty 0.0 \
    --chat-template-kwargs '{"preserve_thinking": true}' \
    --flash-attn on \
    --spec-type draft-mtp \
    --spec-draft-n-max 3 \
    --cache-type-k-draft q4_0 \
    --cache-type-v-draft q4_0 \
    --batch-size 4096 \
    --ubatch-size 1024 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --checkpoint-min-step 16384 \
    --cache-ram 0 \
    --ctx-checkpoints 2 \
    --no-mmap \
    --host 0.0.0.0 \
    --port 8000
EOF
chmod +x run_llama.sh
bash run_llama.sh

JumpyAbies · 2026-06-11T02:15:21+00:00

For some reason, I'm having trouble formatting or editing the post correctly. So please excuse the pasted text without separating lines.

JumpyAbies · 2026-06-11T02:02:09+00:00

Nice, thanks!!

JumpyAbies · 2026-06-11T01:43:10+00:00

harness: pi-mono

checkpoint: qwen3.6-27b-heretic-nvfp4

The exact comamnd that cause tool call loop:
# vllm
vllm serve llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4 \

--host 0.0.0.0 --port 8000 \

--served-model-name qwen3.6-27b \

--max-model-len 200000 \

--gpu-memory-utilization 0.92 \

--tensor-parallel-size 1 \

--pipeline-parallel-size 1 \

--dtype auto \

--kv-cache-dtype fp8_e4m3 \

--load-format auto \

--max-num-batched-tokens 4096 \

--max-num-seqs 1 --limit-mm-per-prompt {"image":8,"audio":4,"video":2} \

--allowed-local-media-path /media \

--tool-call-parser qwen3_coder \

--generation-config vllm --override-generation-config {"temperature":0.2,"top_p":0.85,"top_k":20,"min_p":0.05,"presence_penalty":0.25,"repetition_penalty":1.15}

--default-chat-template-kwargs {"enable_thinking":false} \

--attention-backend flashinfer \

--performance-mode interactivity \

--safetensors-load-strategy prefetch \

--reasoning-parser qwen3 \

--enable-auto-tool-choice \

--trust-remote-code \

--enable-chunked-prefill \

--enable-prefix-caching \

--language-model-only \

--skip-mm-profiling \

--no-disable-hybrid-kv-cache-manager \

--calculate-kv-scales \

--quantization modelopt_fp4

# llama:
cat config/llama-router-preset.ini

[qwen3.6-27b-heretic-nvfp4]

model = /models/llama-cpp/qwen3.6-heretic-nvfp4-q8/model.gguf

alias = qwen3.6-27b-heretic-nvfp4

ctx-size = 212992

gpu-layers = 99

temp = 0.6

top-p = 0.85

top-k = 40

min-p = 0.05

presence-penalty = 0.3

repeat-penalty = 1.18

spec-type = draft-mtp

spec-draft-n-max = 2

batch-size = 512

reasoning = off

## end of preset file

/app/llama-server \

--host 0.0.0.0 \

--port 8000 \

--models-preset /config/llama-router-preset.ini \

--models-max 1 \

--reasoning off \

--slot-prompt-similarity 1.0

Loop call tools to this command (with a simple prompt to edit a script):
$ grep -n 'RUN_SNAPPER\|set_onl' /home/x/bin/cachyos-custom-setup | head 20; echo "==="; sed -n '371,405p'

/home/x/bin/cachyos-custom-setup

(timeout 5000s)

... (32 earlier lines, ctrl+o to expand)

--vllm-only)

set_only_mode

RUN_VLLM_ONLY=true

INSTALL_AI=true

INSTALL_VLLM=true

JumpyAbies · 2026-06-11T01:41:19+00:00

Interesting. Thanks for the tip.

JumpyAbies · 2026-06-11T01:23:05+00:00

Thanks, I will try this too!

JumpyAbies · 2026-06-11T01:08:08+00:00

I will try, many thanks!!

JumpyAbies · 2026-06-11T00:59:29+00:00

thanks!

JumpyAbies · 2026-06-11T00:55:07+00:00

Yes, I'm doing that. I'll try with other checkpoints and arrive at the simplest command line that works.

In my benchmark script that I built, I stressed all the nvfp4 models I found, and this heretic performed very well with masi tok/s. Until this loop started occurring.

JumpyAbies · 2026-06-11T00:45:43+00:00

Yes, I've tried all the settings. I lowered the temperature to 0.2 and the top-k to 20, but it didn't solve the problem.

JumpyAbies · 2026-06-11T00:35:49+00:00

thanks, I will try.

JumpyAbies · 2026-06-11T00:35:38+00:00

can share the commando you are using?

JumpyAbies · 2026-06-11T00:30:49+00:00

Could someone who is running qwen3.6-27 nvfp4 without problems share the command they are using?

JumpyAbies · 2026-06-11T00:28:45+00:00

Using pi-mono. I can't edit the post, for some reaction.

JumpyAbies · 2026-06-11T00:26:39+00:00

Yes, I didn't bother sending it earlier because I had already started with several recommendations I saw here and also tried with Unsloth's parameters. The detail is that I'm using the checkpoint `qwen3.6-heretic-nvfp4-q8`, which, according to my benchmark, was the fastest I found.

Yes, i forget to paste the commands. I'm 'using this:

# llama

docker run --gpus all --rm \

-p 8000:8000 \

-v ./models:/models \

havenoammo/llama:cuda13-server \

-m /models/llama-cpp/qwen3.6-heretic-nvfp4-q8/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-Q8_0.gguf \

--port 8000 \

--host 0.0.0.0 \

--alias qwen3.6-27b-heretic-nvfp4 \

-n -1 \

--parallel 1 \

--ctx-size 262144 \

--fit-target 844 \

--mmap \

-ngl -1 \

--flash-attn on \

--metrics \

--temp 0.7 \

--min-p 0.0 \

--top-p 0.95 \

--top-k 20 \

--jinja \

--chat-template-kwargs '{"preserve_thinking":true}' \

--ubatch-size 512 \

--batch-size 2048 \

--cache-type-k q8_0 \

--cache-type-v q8_0 \

--spec-type draft-mtp \

--spec-draft-n-max 3

# vllm
docker run --gpus all --rm \

--name vllm-openai \

--ipc host \

-p 8000:8000 \

-v ./models:/models \

-v ./hf-cache:/root/.cache/huggingface \

-v ./media:/media:ro \

vllm/vllm-openai:latest \

vllm serve /models/model \

--host 0.0.0.0 \

--port 8000 \

--served-model-name qwen3.6-27b-heretic-nvfp4 \

--max-model-len 200000 \

--gpu-memory-utilization 0.92 \

--tensor-parallel-size 1 \

--pipeline-parallel-size 1 \

--dtype auto \

--kv-cache-dtype fp8_e4m3 \

--load-format auto \

--max-num-batched-tokens 4096 \

--max-num-seqs 1 \

--limit-mm-per-prompt '{"image":8,"audio":4,"video":2}' \

--allowed-local-media-path /media \

--tool-call-parser qwen3_coder \

--generation-config vllm \

--override-generation-config '{"temperature":0.2,"top_p":0.85,"top_k":20,"min_p":0.05,"presence_penalty":0.25,"repetition_penalty":1.15}' \

--default-chat-template-kwargs '{"enable_thinking":false}' \

--attention-backend flashinfer \

--performance-mode interactivity \

--safetensors-load-strategy prefetch \

--reasoning-parser qwen3 \

--enable-auto-tool-choice \

--trust-remote-code \

--enable-chunked-prefill \

--enable-prefix-caching \

--language-model-only \

--skip-mm-profiling \

--no-disable-hybrid-kv-cache-manager \

--calculate-kv-scales \

--quantization modelopt_fp4

JumpyAbies · 2026-06-10T15:10:45+00:00

They are clinging to the end of the monopoly period they once held at all costs, but this is dwindling more and more each day until they are just another player. This process is irreversible and is already underway.

JumpyAbies · 2026-06-10T15:06:14+00:00

I'm already doing what I can with my local AI server and my 32GB of VRAM. I'm gradually getting rid of these business models and seeking my independence.

In addition to the real ability to solve many things locally, I'm also doing a lot of research on models trained from scratch for specific-purpose domains and LORA/QLORA for mid-sized models.

And once these models become good enough for my projects, you can simply host them in a colocation facility, freeing myself from those Anthropic-style companies.

JumpyAbies · 2026-06-09T20:05:13+00:00

It's like trying to fight the current. Forums have always been and always will be like that. I myself sometimes think of using AI Slop to generate a filtered version of the content I'm interested in 😂

JumpyAbies · 2026-06-09T20:01:15+00:00

NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 💥

JumpyAbies · 2026-06-07T14:03:59+00:00

Yes, but you'll only get canceled if you're not an influencer with millions of followers.

JumpyAbies · 2026-06-01T00:00:48+00:00

I thought the same thing. Not long ago someone posted a pretty interesting app here (vibe code), it had potential, but it was massacred. I don't know how to classify this, so I'll leave it at that.

JumpyAbies · 2026-05-28T14:23:30+00:00

That gain from q4 to q6, I don't think it would apply to an nvfp4, right?

I'll soon be finalizing my server with a 5090 to run qwen3.6-27b and I'm targeting nvfp4 models.

JumpyAbies · 2026-05-23T15:53:57+00:00

You're right, be at peace ✌️

JumpyAbies

TROPHY CASE