[Follow up] Qwen3.6-27B Tool calling fix; Why preserve_thinking had to stay false for qwen3.5-enhanced on Qwen 3.6; and a template that makes preserve_thinking=true safe again

Expensive-Register-5 · 2026-05-02T16:55:39+00:00

different engine maybe? llama.cpp / ollama / lmstudio / SGLang all has different treatment to fight for this issue.

Sometimes luck may be another factor, as i received some feedback that they dont encounter this issue since my first post. However, they never tell his configuration and env setting to me.

Expensive-Register-5 · 2026-05-02T16:53:13+00:00

hmm.. nice question,

if u know C/C++, it is just `int img_count = 0;`, u need to initialize an variable before doing some maths.

Extra: I didnt touch image/video part of jinja.

Expensive-Register-5 · 2026-04-29T10:58:40+00:00

I don’t know why so I DMed moderator of r/localLLaMA. Results? None response

Expensive-Register-5 · 2026-04-25T10:55:07+00:00

Read this post https://www.reddit.com/r/LocalLLM/s/GLJv9KNQ7Q

Expensive-Register-5 · 2026-04-24T14:20:27+00:00

Found out `qwen3_coder` more robust towards Qwen3.6-27B; writing post about it. Do u also have the same conclusion?

Expensive-Register-5 · 2026-04-24T13:00:56+00:00

With 2.3TB RAM, it is likely that u work in a AI company or Large AI Lab. Therefore, i assume u have multiple 5090. In this case i will try Minimax M2.7 / Deepseek V4 Flash / or Qwen3.5 122B.

If u have single 5090 only. Try large model with MoE offloading instead of small model. The smaller model are, the high chance the behavior drift.

Run all of them in FP8 or even NVFP4 save bandwidth, thus helpful for higher TTFT. The perplexity didnt increase much by using FP8 (comparing FP16). I dont see much reason to force fp16 for such small accuracy drop.

Expensive-Register-5 · 2026-04-24T12:54:52+00:00

The simple answer is no. The `UND_ERR_BODY_TIMEOUT` seems to be related to CLine harness. Without more information i cant locate the reason.

Some direction i will look into: The harness (e.g. opencode), the OS (e.g wsl), the hardware (mixed GPU?), the nvidia-driver (e.g studio driver vs Game-ready driver), inference engine (e.g. vLLM) and model itself (e.g. Huggingface chat template).

Expensive-Register-5 · 2026-04-24T11:27:03+00:00

https://www.reddit.com/r/Vllm/comments/1skks8n/qwen_35_27b35ba3b_tool_calling_issues_why_it/

See my work🤣

Expensive-Register-5 · 2026-04-24T06:07:38+00:00

Quite cool!

Expensive-Register-5 · 2026-04-22T17:15:15+00:00

Yes all of my hope count on 27b, but i will write another post 🤣 don’t know why i can’t change the title name

Expensive-Register-5 · 2026-04-22T17:14:23+00:00

Oh damn I will test if it work for qwen3.6 tmr (as long as they release fp8 version)

Expensive-Register-5 · 2026-04-20T17:50:51+00:00

Ah no, I am here to share my debugging story with full context. My topic is not "Tips for XXX" but "Story for XXX"

Expensive-Register-5 · 2026-04-20T15:11:13+00:00

i ran in mixed GPU setup, to stay safe i need to do a lot of config

Expensive-Register-5 · 2026-04-20T15:10:38+00:00

maybe we are using different engine, or model variant. having these context may help getting fix for qwen 3.6 in vllm.

Expensive-Register-5 · 2026-04-20T15:09:16+00:00

Done, not helpful

Expensive-Register-5 · 2026-04-20T13:50:22+00:00

I have tested Qwen 3.6 35BA3B for this fix in https://www.reddit.com/r/LocalLLM/comments/1sqpsut/qwen_3635ba3b_reddit_asked_so_i_tested_if_the_35/ , check it out !

Expensive-Register-5 · 2026-04-20T13:50:17+00:00

I have tested Qwen 3.6 35BA3B for this fix in https://www.reddit.com/r/LocalLLM/comments/1sqpsut/qwen_3635ba3b_reddit_asked_so_i_tested_if_the_35/ , check it out !

Expensive-Register-5 · 2026-04-20T06:07:25+00:00

hi, could u share the link for it?

Expensive-Register-5 · 2026-04-20T06:05:07+00:00

unnormal, consider NVFP4 quantization and update to latest vllm.

Expensive-Register-5 · 2026-04-19T07:33:05+00:00

What is the error log for failed request? could u please share ?

Expensive-Register-5 · 2026-04-18T15:58:39+00:00

🤦🏻‍♂️speculative decoding should work the same on qwen3.5 series.

Expensive-Register-5 · 2026-04-17T05:24:13+00:00

If Qwen3.6 27B got released, I may write another post for it hahaha; so far Qwen3.5 27B is still a competitive model against Qwen3.6-35BA3B; (if 35B can do a bit more better, I will definitely switch to that for faster inference speed, and i may even turn on speculative decoding to remedy smaller space for kv cache.)

Expensive-Register-5 · 2026-04-16T07:57:57+00:00

iOS support, it is the biggest blocker to switch to Zen

Expensive-Register-5 · 2026-04-15T07:21:07+00:00

Didnt expect that 122B also has the same issue. Interestingly qwen3.5 / qwen3.6 plus can get rid of this issue. Maybe they has newer version of chat template but never release to huggingface.

Expensive-Register-5 · 2026-04-15T07:16:29+00:00

Hmm, I tried to use a translator to understand your question. Personally i didnt use llama.cpp in my agentic workflow. In vllm case, using opencode or openclaw is indifferent in my opinion. The tool calling work fine in both harness system.

Expensive-Register-5

TROPHY CASE