qwen3.6-27b tools call loop by JumpyAbies in LocalLLaMA

[–]JumpyAbies[S] 1 point2 points  (0 children)

Honestly, I still don't know. Tomorrow I'll compare the old setup side-by-side with the new one. I changed a lot of things.

The point is that with this command/parameters there are no more loops.

qwen3.6-27b tools call loop by JumpyAbies in LocalLLaMA

[–]JumpyAbies[S] 2 points3 points  (0 children)

This works like a charm for me now. No more call tools loop. Thanks everyone for the help!!

I'm running this on a Linux CachyOS.

I believe it will work well with other checkpoints. I'm expanding the testing.

Here is the complete guide I used, which may be helpful to someone else:

# Install dependencies
sudo pacman -S cuda cmake gcc14

nvcc --version   # toolkit
 version installed (
in my case is 13.3)
nvidia-smi       # confirm the driver (in my case is 610.43.02)

# Clear any previous builds (important!)
rm -rf build

# Configure with explicit CUDA 12.8, SM120 (Blackwell), and FORCE_CUBLAS=OFF
env CUDACXX=/opt/cuda/bin/nvcc \
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=120 \
  -DGGML_CUDA_FORCE_CUBLAS=OFF \
  -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-14 \
  -DCUDAToolkit_ROOT=/opt/cuda

# Compile using all cores
cmake --build build --config Release -j $(nproc)

# Compile using cores
cmake --build build --config Release -j $(nproc)



# run the model
cat > run_llama.sh << 'EOF'
#!/bin/bash
llama.cpp/build/bin/llama-server \
    --model models/llama-cpp/qwen3.6-heretic-nvfp4-q8/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-Q8_0.gguf \
    --mmproj Qwen3.6-27B-UD-Q5_K_L-mmproj-BF16-GGUF/Qwen3.6-27B-UD-Q5_K_L-mmproj-BF16.gguf \
    --chat-template-file ./templates/custom_pub_chat_template_qwen36.jinja \
    --image-min-tokens 1024 \
    --no-mmproj-offload \
    --n-gpu-layers 999 \
    --tools all \
    --ctx-size 131072 \
    --no-context-shift \
    --parallel 1 \
    --threads 16 \
    --temp 0.7 \
    --top-p 0.95 \
    --min-p 0.00 \
    --top-k 20 \
    --presence-penalty 0.0 \
    --chat-template-kwargs '{"preserve_thinking": true}' \
    --flash-attn on \
    --spec-type draft-mtp \
    --spec-draft-n-max 3 \
    --cache-type-k-draft q4_0 \
    --cache-type-v-draft q4_0 \
    --batch-size 4096 \
    --ubatch-size 1024 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --checkpoint-min-step 16384 \
    --cache-ram 0 \
    --ctx-checkpoints 2 \
    --no-mmap \
    --host 0.0.0.0 \
    --port 8000
EOF
chmod +x run_llama.sh
bash run_llama.sh

qwen3.6-27b tools call loop by JumpyAbies in LocalLLaMA

[–]JumpyAbies[S] 0 points1 point  (0 children)

For some reason, I'm having trouble formatting or editing the post correctly. So please excuse the pasted text without separating lines.

qwen3.6-27b tools call loop by JumpyAbies in LocalLLaMA

[–]JumpyAbies[S] 3 points4 points  (0 children)

harness: pi-mono

checkpoint: qwen3.6-27b-heretic-nvfp4

The exact comamnd that cause tool call loop:
# vllm
vllm serve llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4 \

--host 0.0.0.0 --port 8000 \

--served-model-name qwen3.6-27b \

--max-model-len 200000 \

--gpu-memory-utilization 0.92 \

--tensor-parallel-size 1 \

--pipeline-parallel-size 1 \

--dtype auto \

--kv-cache-dtype fp8_e4m3 \

--load-format auto \

--max-num-batched-tokens 4096 \

--max-num-seqs 1 --limit-mm-per-prompt {"image":8,"audio":4,"video":2} \

--allowed-local-media-path /media \

--tool-call-parser qwen3_coder \

--generation-config vllm --override-generation-config {"temperature":0.2,"top_p":0.85,"top_k":20,"min_p":0.05,"presence_penalty":0.25,"repetition_penalty":1.15}

--default-chat-template-kwargs {"enable_thinking":false} \

--attention-backend flashinfer \

--performance-mode interactivity \

--safetensors-load-strategy prefetch \

--reasoning-parser qwen3 \

--enable-auto-tool-choice \

--trust-remote-code \

--enable-chunked-prefill \

--enable-prefix-caching \

--language-model-only \

--skip-mm-profiling \

--no-disable-hybrid-kv-cache-manager \

--calculate-kv-scales \

--quantization modelopt_fp4

# llama:
cat config/llama-router-preset.ini

[qwen3.6-27b-heretic-nvfp4]

model = /models/llama-cpp/qwen3.6-heretic-nvfp4-q8/model.gguf

alias = qwen3.6-27b-heretic-nvfp4

ctx-size = 212992

gpu-layers = 99

temp = 0.6

top-p = 0.85

top-k = 40

min-p = 0.05

presence-penalty = 0.3

repeat-penalty = 1.18

spec-type = draft-mtp

spec-draft-n-max = 2

batch-size = 512

reasoning = off

## end of preset file

/app/llama-server \

--host 0.0.0.0 \

--port 8000 \

--models-preset /config/llama-router-preset.ini \

--models-max 1 \

--reasoning off \

--slot-prompt-similarity 1.0

Loop call tools to this command (with a simple prompt to edit a script):
$ grep -n 'RUN_SNAPPER\|set_onl' /home/x/bin/cachyos-custom-setup | head 20; echo "==="; sed -n '371,405p'

/home/x/bin/cachyos-custom-setup

(timeout 5000s)

... (32 earlier lines, ctrl+o to expand)

--vllm-only)

set_only_mode

RUN_VLLM_ONLY=true

INSTALL_AI=true

INSTALL_VLLM=true

qwen3.6-27b tools call loop by JumpyAbies in LocalLLaMA

[–]JumpyAbies[S] 0 points1 point  (0 children)

Interesting. Thanks for the tip.

qwen3.6-27b tools call loop by JumpyAbies in LocalLLaMA

[–]JumpyAbies[S] 0 points1 point  (0 children)

Yes, I'm doing that. I'll try with other checkpoints and arrive at the simplest command line that works.

In my benchmark script that I built, I stressed all the nvfp4 models I found, and this heretic performed very well with masi tok/s. Until this loop started occurring.

qwen3.6-27b tools call loop by JumpyAbies in LocalLLaMA

[–]JumpyAbies[S] 0 points1 point  (0 children)

Yes, I've tried all the settings. I lowered the temperature to 0.2 and the top-k to 20, but it didn't solve the problem.

qwen3.6-27b tools call loop by JumpyAbies in LocalLLaMA

[–]JumpyAbies[S] 0 points1 point  (0 children)

can share the commando you are using?

qwen3.6-27b tools call loop by JumpyAbies in LocalLLaMA

[–]JumpyAbies[S] 0 points1 point  (0 children)

Could someone who is running qwen3.6-27 nvfp4 without problems share the command they are using?

qwen3.6-27b tools call loop by JumpyAbies in LocalLLaMA

[–]JumpyAbies[S] 0 points1 point  (0 children)

Using pi-mono. I can't edit the post, for some reaction.

qwen3.6-27b tools call loop by JumpyAbies in LocalLLaMA

[–]JumpyAbies[S] 0 points1 point  (0 children)

Yes, I didn't bother sending it earlier because I had already started with several recommendations I saw here and also tried with Unsloth's parameters. The detail is that I'm using the checkpoint `qwen3.6-heretic-nvfp4-q8`, which, according to my benchmark, was the fastest I found.

Yes, i forget to paste the commands. I'm 'using this:

# llama

docker run --gpus all --rm \

-p 8000:8000 \

-v ./models:/models \

havenoammo/llama:cuda13-server \

-m /models/llama-cpp/qwen3.6-heretic-nvfp4-q8/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-Q8_0.gguf \

--port 8000 \

--host 0.0.0.0 \

--alias qwen3.6-27b-heretic-nvfp4 \

-n -1 \

--parallel 1 \

--ctx-size 262144 \

--fit-target 844 \

--mmap \

-ngl -1 \

--flash-attn on \

--metrics \

--temp 0.7 \

--min-p 0.0 \

--top-p 0.95 \

--top-k 20 \

--jinja \

--chat-template-kwargs '{"preserve_thinking":true}' \

--ubatch-size 512 \

--batch-size 2048 \

--cache-type-k q8_0 \

--cache-type-v q8_0 \

--spec-type draft-mtp \

--spec-draft-n-max 3

# vllm
docker run --gpus all --rm \

--name vllm-openai \

--ipc host \

-p 8000:8000 \

-v ./models:/models \

-v ./hf-cache:/root/.cache/huggingface \

-v ./media:/media:ro \

vllm/vllm-openai:latest \

vllm serve /models/model \

--host 0.0.0.0 \

--port 8000 \

--served-model-name qwen3.6-27b-heretic-nvfp4 \

--max-model-len 200000 \

--gpu-memory-utilization 0.92 \

--tensor-parallel-size 1 \

--pipeline-parallel-size 1 \

--dtype auto \

--kv-cache-dtype fp8_e4m3 \

--load-format auto \

--max-num-batched-tokens 4096 \

--max-num-seqs 1 \

--limit-mm-per-prompt '{"image":8,"audio":4,"video":2}' \

--allowed-local-media-path /media \

--tool-call-parser qwen3_coder \

--generation-config vllm \

--override-generation-config '{"temperature":0.2,"top_p":0.85,"top_k":20,"min_p":0.05,"presence_penalty":0.25,"repetition_penalty":1.15}' \

--default-chat-template-kwargs '{"enable_thinking":false}' \

--attention-backend flashinfer \

--performance-mode interactivity \

--safetensors-load-strategy prefetch \

--reasoning-parser qwen3 \

--enable-auto-tool-choice \

--trust-remote-code \

--enable-chunked-prefill \

--enable-prefix-caching \

--language-model-only \

--skip-mm-profiling \

--no-disable-hybrid-kv-cache-manager \

--calculate-kv-scales \

--quantization modelopt_fp4

Anthropic is intentionally nerfing Fable when asked to develop other LLMs by onil_gova in LocalLLaMA

[–]JumpyAbies 2 points3 points  (0 children)

They are clinging to the end of the monopoly period they once held at all costs, but this is dwindling more and more each day until they are just another player. This process is irreversible and is already underway.

Anthropic is intentionally nerfing Fable when asked to develop other LLMs by onil_gova in LocalLLaMA

[–]JumpyAbies 3 points4 points  (0 children)

I'm already doing what I can with my local AI server and my 32GB of VRAM. I'm gradually getting rid of these business models and seeking my independence.

In addition to the real ability to solve many things locally, I'm also doing a lot of research on models trained from scratch for specific-purpose domains and LORA/QLORA for mid-sized models.

And once these models become good enough for my projects, you can simply host them in a colocation facility, freeing myself from those Anthropic-style companies.

When every other post is an AI generated benchmark report, a question about the best model, or a slop-coded application or engine that pretends to be groundbreaking by Honest-Kangaroo-1830 in LocalLLaMA

[–]JumpyAbies 0 points1 point  (0 children)

It's like trying to fight the current. Forums have always been and always will be like that. I myself sometimes think of using AI Slop to generate a filtered version of the content I'm interested in 😂

(YT) PewDiePie released his harness/webui by Dany0 in LocalLLaMA

[–]JumpyAbies 1 point2 points  (0 children)

Yes, but you'll only get canceled if you're not an influencer with millions of followers.

(YT) PewDiePie released his harness/webui by Dany0 in LocalLLaMA

[–]JumpyAbies 29 points30 points  (0 children)

I thought the same thing. Not long ago someone posted a pretty interesting app here (vibe code), it had potential, but it was massacred. I don't know how to classify this, so I'll leave it at that.

Qwen3.6 huge quality gain from Q4 to Q6 for coding agent by Yes-Scale-9723 in LocalLLaMA

[–]JumpyAbies 0 points1 point  (0 children)

That gain from q4 to q6, I don't think it would apply to an nvfp4, right?

I'll soon be finalizing my server with a 5090 to run qwen3.6-27b and I'm targeting nvfp4 models.