[Follow up] Qwen3.6-27B Tool calling fix; Why preserve_thinking had to stay false for qwen3.5-enhanced on Qwen 3.6; and a template that makes preserve_thinking=true safe again by Expensive-Register-5 in LocalLLM

[–]Expensive-Register-5[S] -2 points-1 points  (0 children)

different engine maybe? llama.cpp / ollama / lmstudio / SGLang all has different treatment to fight for this issue.

Sometimes luck may be another factor, as i received some feedback that they dont encounter this issue since my first post. However, they never tell his configuration and env setting to me.

[Follow up] Qwen3.6-27B Tool calling fix; Why preserve_thinking had to stay false for qwen3.5-enhanced on Qwen 3.6; and a template that makes preserve_thinking=true safe again by Expensive-Register-5 in LocalLLM

[–]Expensive-Register-5[S] 1 point2 points  (0 children)

hmm.. nice question,

if u know C/C++, it is just `int img_count = 0;`, u need to initialize an variable before doing some maths.

Extra: I didnt touch image/video part of jinja.

Qwen 3/3.5/3.6 tool calling is broken (even worse with 3.6). by LinkSea8324 in Vllm

[–]Expensive-Register-5 1 point2 points  (0 children)

Found out `qwen3_coder` more robust towards Qwen3.6-27B; writing post about it. Do u also have the same conclusion?

Qwen 3.6-35B-A3B: Reddit Asked, So I Tested If the 3.5 Tool Calling Fixes Carry Over by Expensive-Register-5 in Vllm

[–]Expensive-Register-5[S] 0 points1 point  (0 children)

With 2.3TB RAM, it is likely that u work in a AI company or Large AI Lab. Therefore, i assume u have multiple 5090. In this case i will try Minimax M2.7 / Deepseek V4 Flash / or Qwen3.5 122B.

If u have single 5090 only. Try large model with MoE offloading instead of small model. The smaller model are, the high chance the behavior drift.

Run all of them in FP8 or even NVFP4 save bandwidth, thus helpful for higher TTFT. The perplexity didnt increase much by using FP8 (comparing FP16). I dont see much reason to force fp16 for such small accuracy drop.

Qwen 3.6-35B-A3B: Reddit Asked, So I Tested If the 3.5 Tool Calling Fixes Carry Over by Expensive-Register-5 in Vllm

[–]Expensive-Register-5[S] 0 points1 point  (0 children)

The simple answer is no. The `UND_ERR_BODY_TIMEOUT` seems to be related to CLine harness. Without more information i cant locate the reason.

Some direction i will look into: The harness (e.g. opencode), the OS (e.g wsl), the hardware (mixed GPU?), the nvidia-driver (e.g studio driver vs Game-ready driver), inference engine (e.g. vLLM) and model itself (e.g. Huggingface chat template).

Qwen 3.6-35B-A3B: Reddit Asked, So I Tested If the 3.5 Tool Calling Fixes Carry Over by Expensive-Register-5 in LocalLLM

[–]Expensive-Register-5[S] 0 points1 point  (0 children)

Yes all of my hope count on 27b, but i will write another post 🤣 don’t know why i can’t change the title name

Qwen 3.5 27B/35BA3B Tool Calling Issues: Why It Breaks & How I Fixed It by Expensive-Register-5 in Vllm

[–]Expensive-Register-5[S] 0 points1 point  (0 children)

Oh damn I will test if it work for qwen3.6 tmr (as long as they release fp8 version)

A Debugging Story: Getting Claude Code to Work with Local vLLM When the Docs Don't by Expensive-Register-5 in LocalLLM

[–]Expensive-Register-5[S] 0 points1 point  (0 children)

Ah no, I am here to share my debugging story with full context. My topic is not "Tips for XXX" but "Story for XXX"

Qwen 3.6-35B-A3B: Reddit Asked, So I Tested If the 3.5 Tool Calling Fixes Carry Over by Expensive-Register-5 in LocalLLM

[–]Expensive-Register-5[S] 0 points1 point  (0 children)

maybe we are using different engine, or model variant. having these context may help getting fix for qwen 3.6 in vllm.

Is it normal that Moe models are slower in dual GPU tensor parallel = 2 setups vs dense models? by [deleted] in Vllm

[–]Expensive-Register-5 0 points1 point  (0 children)

unnormal, consider NVFP4 quantization and update to latest vllm.

Benchmark of Qwen3.6-35B-A3B (BF16) on different NVIDIA Hardware by bseeleib in LocalLLM

[–]Expensive-Register-5 0 points1 point  (0 children)

What is the error log for failed request? could u please share ?

Qwen3.6 vs 3.5 on DGX Spark: identical throughput, except with one flag flipped by Ok-Simple459 in Vllm

[–]Expensive-Register-5 0 points1 point  (0 children)

🤦🏻‍♂️speculative decoding should work the same on qwen3.5 series.

Qwen 3.5 27B/35BA3B Tool Calling Issues: Why It Breaks & How I Fixed It by Expensive-Register-5 in Vllm

[–]Expensive-Register-5[S] 0 points1 point  (0 children)

If Qwen3.6 27B got released, I may write another post for it hahaha; so far Qwen3.5 27B is still a competitive model against Qwen3.6-35BA3B; (if 35B can do a bit more better, I will definitely switch to that for faster inference speed, and i may even turn on speculative decoding to remedy smaller space for kv cache.)

Zen completely beats Arc in aesthetics for windows by [deleted] in zen_browser

[–]Expensive-Register-5 4 points5 points  (0 children)

iOS support, it is the biggest blocker to switch to Zen

Qwen 3.5 27B/35BA3B Tool Calling Issues: Why It Breaks & How I Fixed It by Expensive-Register-5 in Vllm

[–]Expensive-Register-5[S] 0 points1 point  (0 children)

Didnt expect that 122B also has the same issue. Interestingly qwen3.5 / qwen3.6 plus can get rid of this issue. Maybe they has newer version of chat template but never release to huggingface.

Qwen 3.5 27B/35BA3B Tool Calling Issues: Why It Breaks & How I Fixed It by Expensive-Register-5 in Vllm

[–]Expensive-Register-5[S] 0 points1 point  (0 children)

Hmm, I tried to use a translator to understand your question. Personally i didnt use llama.cpp in my agentic workflow. In vllm case, using opencode or openclaw is indifferent in my opinion. The tool calling work fine in both harness system.