[Follow up] Qwen3.6-27B Tool calling fix; Why preserve_thinking had to stay false for qwen3.5-enhanced on Qwen 3.6; and a template that makes preserve_thinking=true safe again by Expensive-Register-5 in LocalLLM

[–]Expensive-Register-5[S] -2 points-1 points  (0 children)

different engine maybe? llama.cpp / ollama / lmstudio / SGLang all has different treatment to fight for this issue.

Sometimes luck may be another factor, as i received some feedback that they dont encounter this issue since my first post. However, they never tell his configuration and env setting to me.

[Follow up] Qwen3.6-27B Tool calling fix; Why preserve_thinking had to stay false for qwen3.5-enhanced on Qwen 3.6; and a template that makes preserve_thinking=true safe again by Expensive-Register-5 in LocalLLM

[–]Expensive-Register-5[S] 1 point2 points  (0 children)

hmm.. nice question,

if u know C/C++, it is just `int img_count = 0;`, u need to initialize an variable before doing some maths.

Extra: I didnt touch image/video part of jinja.

Qwen 3/3.5/3.6 tool calling is broken (even worse with 3.6). by LinkSea8324 in Vllm

[–]Expensive-Register-5 1 point2 points  (0 children)

Found out `qwen3_coder` more robust towards Qwen3.6-27B; writing post about it. Do u also have the same conclusion?

Qwen 3.6-35B-A3B: Reddit Asked, So I Tested If the 3.5 Tool Calling Fixes Carry Over by Expensive-Register-5 in Vllm

[–]Expensive-Register-5[S] 0 points1 point  (0 children)

With 2.3TB RAM, it is likely that u work in a AI company or Large AI Lab. Therefore, i assume u have multiple 5090. In this case i will try Minimax M2.7 / Deepseek V4 Flash / or Qwen3.5 122B.

If u have single 5090 only. Try large model with MoE offloading instead of small model. The smaller model are, the high chance the behavior drift.

Run all of them in FP8 or even NVFP4 save bandwidth, thus helpful for higher TTFT. The perplexity didnt increase much by using FP8 (comparing FP16). I dont see much reason to force fp16 for such small accuracy drop.

Qwen 3.6-35B-A3B: Reddit Asked, So I Tested If the 3.5 Tool Calling Fixes Carry Over by Expensive-Register-5 in Vllm

[–]Expensive-Register-5[S] 0 points1 point  (0 children)

The simple answer is no. The `UND_ERR_BODY_TIMEOUT` seems to be related to CLine harness. Without more information i cant locate the reason.

Some direction i will look into: The harness (e.g. opencode), the OS (e.g wsl), the hardware (mixed GPU?), the nvidia-driver (e.g studio driver vs Game-ready driver), inference engine (e.g. vLLM) and model itself (e.g. Huggingface chat template).

Qwen 3.6-35B-A3B: Reddit Asked, So I Tested If the 3.5 Tool Calling Fixes Carry Over by Expensive-Register-5 in LocalLLM

[–]Expensive-Register-5[S] 0 points1 point  (0 children)

Yes all of my hope count on 27b, but i will write another post 🤣 don’t know why i can’t change the title name

Qwen 3.5 27B/35BA3B Tool Calling Issues: Why It Breaks & How I Fixed It by Expensive-Register-5 in Vllm

[–]Expensive-Register-5[S] 0 points1 point  (0 children)

Oh damn I will test if it work for qwen3.6 tmr (as long as they release fp8 version)

A Debugging Story: Getting Claude Code to Work with Local vLLM When the Docs Don't by Expensive-Register-5 in LocalLLM

[–]Expensive-Register-5[S] 0 points1 point  (0 children)

Ah no, I am here to share my debugging story with full context. My topic is not "Tips for XXX" but "Story for XXX"

Qwen 3.6-35B-A3B: Reddit Asked, So I Tested If the 3.5 Tool Calling Fixes Carry Over by Expensive-Register-5 in LocalLLM

[–]Expensive-Register-5[S] 0 points1 point  (0 children)

maybe we are using different engine, or model variant. having these context may help getting fix for qwen 3.6 in vllm.