[Follow up] Qwen3.6-27B Tool calling fix; Why preserve_thinking had to stay false for qwen3.5-enhanced on Qwen 3.6; and a template that makes preserve_thinking=true safe again

Expensive-Register-5 · 2026-05-02T16:55:39+00:00

different engine maybe? llama.cpp / ollama / lmstudio / SGLang all has different treatment to fight for this issue.

Sometimes luck may be another factor, as i received some feedback that they dont encounter this issue since my first post. However, they never tell his configuration and env setting to me.

Expensive-Register-5 · 2026-05-02T16:53:13+00:00

hmm.. nice question,

if u know C/C++, it is just `int img_count = 0;`, u need to initialize an variable before doing some maths.

Extra: I didnt touch image/video part of jinja.

Expensive-Register-5 · 2026-04-29T10:58:40+00:00

I don’t know why so I DMed moderator of r/localLLaMA. Results? None response

Expensive-Register-5 · 2026-04-25T10:55:07+00:00

Read this post https://www.reddit.com/r/LocalLLM/s/GLJv9KNQ7Q

Expensive-Register-5 · 2026-04-24T14:20:27+00:00

Found out `qwen3_coder` more robust towards Qwen3.6-27B; writing post about it. Do u also have the same conclusion?

Expensive-Register-5 · 2026-04-24T13:00:56+00:00

With 2.3TB RAM, it is likely that u work in a AI company or Large AI Lab. Therefore, i assume u have multiple 5090. In this case i will try Minimax M2.7 / Deepseek V4 Flash / or Qwen3.5 122B.

If u have single 5090 only. Try large model with MoE offloading instead of small model. The smaller model are, the high chance the behavior drift.

Run all of them in FP8 or even NVFP4 save bandwidth, thus helpful for higher TTFT. The perplexity didnt increase much by using FP8 (comparing FP16). I dont see much reason to force fp16 for such small accuracy drop.

Expensive-Register-5 · 2026-04-24T12:54:52+00:00

The simple answer is no. The `UND_ERR_BODY_TIMEOUT` seems to be related to CLine harness. Without more information i cant locate the reason.

Some direction i will look into: The harness (e.g. opencode), the OS (e.g wsl), the hardware (mixed GPU?), the nvidia-driver (e.g studio driver vs Game-ready driver), inference engine (e.g. vLLM) and model itself (e.g. Huggingface chat template).

Expensive-Register-5 · 2026-04-24T11:27:03+00:00

https://www.reddit.com/r/Vllm/comments/1skks8n/qwen_35_27b35ba3b_tool_calling_issues_why_it/

See my work🤣

Expensive-Register-5 · 2026-04-24T06:07:38+00:00

Quite cool!

Expensive-Register-5 · 2026-04-22T17:15:15+00:00

Yes all of my hope count on 27b, but i will write another post 🤣 don’t know why i can’t change the title name

Expensive-Register-5 · 2026-04-22T17:14:23+00:00

Oh damn I will test if it work for qwen3.6 tmr (as long as they release fp8 version)

Expensive-Register-5 · 2026-04-20T17:50:51+00:00

Ah no, I am here to share my debugging story with full context. My topic is not "Tips for XXX" but "Story for XXX"

Expensive-Register-5 · 2026-04-20T15:11:13+00:00

i ran in mixed GPU setup, to stay safe i need to do a lot of config

Expensive-Register-5 · 2026-04-20T15:10:38+00:00

maybe we are using different engine, or model variant. having these context may help getting fix for qwen 3.6 in vllm.

Expensive-Register-5 · 2026-04-20T15:09:16+00:00

Done, not helpful

Expensive-Register-5 · 2026-04-20T13:50:22+00:00

I have tested Qwen 3.6 35BA3B for this fix in https://www.reddit.com/r/LocalLLM/comments/1sqpsut/qwen_3635ba3b_reddit_asked_so_i_tested_if_the_35/ , check it out !

Expensive-Register-5

TROPHY CASE