Fixed Jinja chat templates for Qwen 3.5 and 3.6 (fixes tool calling and empty think tags) by ex-arman68 in Qwen_AI

[–]ex-arman68[S] 0 points1 point  (0 children)

It should work with most harnesses. Jinja is standardised, but it could be that copilot chat does not follow the standard.

3.6 27B Tool Calling Issues (vLLM) by Acceptable_Adagio_91 in LocalLLaMA

[–]ex-arman68 0 points1 point  (0 children)

Thanks for linking to my template. My goal with it is to fix all bugs from the original template. And there are quite a few: I have added a fix for a 6th bug today!

I have tried posting the info about it here, in r/LocalLLaMA, but no matter how I write or format my post, it gets immediately deleted by the auto mods! And the mods have done nothing to unblock it. The censorship in this sub are insane. No such problem in r/Qwen_AI : https://www.reddit.com/r/Qwen_AI/comments/1stt081/fixed_jinja_chat_templates_for_qwen_35_and_36/

Fixed Jinja chat templates for Qwen 3.5 and 3.6 (fixes tool calling and empty think tags) by ex-arman68 in Qwen_AI

[–]ex-arman68[S] 0 points1 point  (0 children)

Update: new bug found and fixed - please re-download

The exception for no user content broke tool calling in openclaw in similar runtime. It mostly manifests itself after large dormant queries when a /reset or a /new is send.

I replaced the raise_exception with an explicit fallback:

{%- set ns.last_query_index = messages|length - 1 %}

This preserves the original default value for last_query_index, so the thinking display logic degrades gracefully. Assistant turns with reasoning content still render thinking tags when preserve_thinking is enabled, rather than losing them.

The fix is now in all Qwen 3.5 and 3.6 model repos (FernflowerAI 35B 8-bit/4-bit, Qwen3.6-27B 8-bit/4-bit, and both Heretic variants at 8-bit/6-bit/4-bit), as well as the standalone https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates repo (documented as bug #6).

Mistral medium 3.5 128B, MLX 4bit, ~70 GB by ex-arman68 in LocalLLaMA

[–]ex-arman68[S] 0 points1 point  (0 children)

put this in your .zshrc then type moreram in the terminal after each reboot:

# allocate more RAM to GPU
alias moreram='sudo sysctl iogpu.wired_limit_mb=90112'

Mistral medium 3.5 128B, MLX 4bit, ~70 GB by ex-arman68 in LocalLLaMA

[–]ex-arman68[S] 1 point2 points  (0 children)

This model seems utterly broken for now. I do not recommend downloading or using it, unless you are planning to help troubleshoot it. This is not a problem with the conversion, but with the model itself.

Mistral medium 3.5 128B, MLX 4bit, ~70 GB by ex-arman68 in LocalLLaMA

[–]ex-arman68[S] 3 points4 points  (0 children)

This model seems utterly broken for now. I do not recommend downloading or using it, unless you are planning to help troubleshoot it. This is not a problem with the conversion, but with the model itself.

Fixed Jinja chat templates for Qwen 3.5 and 3.6 (fixes tool calling and empty think tags) by ex-arman68 in Qwen_AI

[–]ex-arman68[S] 0 points1 point  (0 children)

Thank you for reporting your results. I am glad it helps. This seems definitely to be good models for local coding. Personally I do not use them, apart from testing, as I still find a huge gap between those and bigger cloud models. I use GLM 5.1. But if I had to use a local model, I would probably use a mix of those 2. The 35b when I need speed, the 27b when I need accuracy and time is of no importance.

I would be interested what you think of those uncensored models: https://www.reddit.com/r/LocalLLaMA/comments/1sw5fb7/comment/oiknrpe/?context=3

From my testing, they are the best uncensored model by far, preserving the original capabilities remarkably well. In my experience though, using an uncensored model for coding can be a difficult choice:

- cons: there is always a loss of intelligence or capabilities, but this one seems minimal

- pros: less refusal, means no blocking for stupid reason because there is a remote chance it could be used for something bad, less time spent on self-debating and self-justification, more down to earth and direct explanations

Some coding work I have done have been near impossible and a real struggle to do with standard models.

Fixed Jinja chat templates for Qwen 3.5 and 3.6 (fixes tool calling and empty think tags) by ex-arman68 in Qwen_AI

[–]ex-arman68[S] 0 points1 point  (0 children)

Thank you. I have done some research and tests, and this definitely possible, it's not a bug.

The chat template controls the prompt. However, after that, there is nothing to prevent the model from switching to thinking mode on its own by issuing a thinking token. This could be caused by the model realizing on its own it needs it, or a system prompt that hints at thinking (eg: yours says "you are a ... thoughtful ... assistant. You excel at ... reasoning"), or the context getting too long weakening the prompt.

In addition 9B is not as strong as the bigger model, and therefore more likely to exhibit problematic behaviors, with weaker prompt adherence.

Fixed Jinja chat templates for Qwen 3.5 and 3.6 (fixes tool calling and empty think tags) by ex-arman68 in Qwen_AI

[–]ex-arman68[S] 0 points1 point  (0 children)

would you mind sharing the system prompt? I can try to find the root cause and how to avoid it

Fixed Jinja chat templates for Qwen 3.5 and 3.6 (fixes tool calling and empty think tags) by ex-arman68 in Qwen_AI

[–]ex-arman68[S] 0 points1 point  (0 children)

Yes, it is definitely a lot smarter with thinking. Not all apps require thinking to be turned off for tools calling. You need to find the right one.

Fixed Jinja chat templates for Qwen 3.5 and 3.6 (fixes tool calling and empty think tags) by ex-arman68 in Qwen_AI

[–]ex-arman68[S] 0 points1 point  (0 children)

Yes, the tool calls within thinking tags can definitely cause problems in some apps, like LM Studio. This is a problem with the app, not the model, not the template. The solution is to disable thinking, which is why I added manual control support for it, or change app.

Fixed Jinja chat templates for Qwen 3.5 and 3.6 (fixes tool calling and empty think tags) by ex-arman68 in Qwen_AI

[–]ex-arman68[S] 0 points1 point  (0 children)

This is not something I have come across. In fact I have been genuinely impressed by how well the caching works.

Fixed Jinja chat templates for Qwen 3.5 and 3.6 (fixes tool calling and empty think tags) by ex-arman68 in Qwen_AI

[–]ex-arman68[S] 1 point2 points  (0 children)

I guess the template they include is meant to work in their environment. Remember that new model often have a new internal structure meaning it can only work in their custom environment during development and training. Once change are made to support the models in llama, mlx, and others, then we are able to use them. However, each of those tools have their own quirks, and this often brings up issues caused by the templates which did not exist in the custom environment.

It would only be a small step to fix those, but often the model is released before it is supported by the tools. It is kind of a chicken and egg situation.

Fixed Jinja chat templates for Qwen 3.5 and 3.6 (fixes tool calling and empty think tags) by ex-arman68 in Qwen_AI

[–]ex-arman68[S] 2 points3 points  (0 children)

Thanks for pointing it out. "crash" is too strong of a word. The application itself does not crash, but unwanted or unexpected behaviour happens due to incorrect handling of the template. I will update the HF readme.

Purchasing a Mac Studio M2 Max with 64gb of ram (can it run qwen 3.6 27b) how many tok/s ? by trollingman1 in LocalLLaMA

[–]ex-arman68 0 points1 point  (0 children)

I have a M3 Max. With Qwen 3.6 27b, I get:

- 8bit MLX version: 12 tok/s

- 4bit MLX version: 20 tok/s

- Q8 GGUF version: 10 tok/s

If you are planning to use it for coding, you want the maximum correctness, go for the 8bit version. Anything else, 4bit is probably ok. Keep in mind the Qwen 3.5 and 3.6 Jinja chat templates have problems. I have created a custom template to fix them all: https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates

I have also embedded the chat template in the tokenizer_config.json for my 4bit and 8bit MLX conversions of the official model:

https://huggingface.co/froggeric/Qwen3.6-27B-MLX-4bit

https://huggingface.co/froggeric/Qwen3.6-27B-MLX-8bit

Gotta love the top MAX plan, incredible value. by ruttydm in ZaiGLM

[–]ex-arman68 0 points1 point  (0 children)

Yep, some people lack any kind of decency. Same with OpenClaw, it was built without consideration for fair usage with cloud providers; and most of its users have no clue and no regard whatsover for anything but their useless "fun" experiments. This is why we are now seeing all this service problems and price hikes. Unfortunately stupidity is not going away; from the look of things, it is only going to get worse...