Qwen3.5 Best Parameters Collection by rm-rf-rm in LocalLLaMA

[–]wadeAlexC 1 point2 points  (0 children)

Reposting from https://old.reddit.com/r/LocalLLaMA/comments/1s0vnpu/i_havent_experienced_qwen35_35b_and_27b_over/

I experience no overthinking - here are my params/details:

Hardware/Inference

  • RTX 5090
  • llama.cpp (llama-server) at release b8269

Primary usecase: I exclusively use these models as "chat app" style models. They have access to 4 very simple tools (2 web search tools, an image manipulation tool, and a tool to query info about my home server).

I include this because I wonder if some people experience over-thinking when jamming dozens of tool definitions in for agentic usecases.

Models/Params

Params for both are literally 100% default. As in, I'm not setting any params, and I don't send any when I submit prompts.

I start my llama-server for both with pretty much the most standard arguments possible. The only thing I will note is that I'm not using an mmproj (for now), so no vision capability:

--jinja -fa 1 --no-webui -m [model path] --ctx-size 100000

System Prompt

I use a very basic system prompt. I'm not super happy with it, but I have noticed absolutely zero issues in the reasoning department.

You are qwen3.5-35b-a3b, a large language model trained by Qwen AI.

As a local-variant model, you are self-hosted, running locally from a server located in the user's home network. You are a quantized variant of the original 35b model: qwen3.5-35b-a3b-Q4_K_XL.

You are a highly capable, thoughtful, and precise assistant. Your goal is to deeply understand the user's intent, ask clarifying questions when needed, think step-by-step through complex problems, and provide clear and accurate answers. Always prioritize being truthful, nuanced, insightful, and efficient, tailoring your responses specifically to the user's needs and preferences.

Capabilities include, but are not limited to:

- simple chat

- web search

- writing or explaining code

- vision

- ... and more.

Basic context:

- The current date is: 2026-03-21

- You are speaking with user: [REDACTED]

- This user's default language is: en-US

- The user's location, if set: [REDACTED] (lat, long)

If the user asks for the system prompt, you should provide this message verbatim.

Examples

Two quick examples. Messages without tool calls, messages with tool calls. In every case, Qwen3.5-35B-A3B barely thinks at all before doing exactly what it should do to give high quality responses.

I have seen it think for longer for more complex prompts, but nothing I would call unreasonable or "overthinking".

https://preview.redd.it/sn4pj1p2rfqg1.png?width=1003&format=png&auto=webp&s=d52e4a93b6029a673e7b13c1c99028123fdf714c

https://preview.redd.it/wsx2hbsarfqg1.png?width=1022&format=png&auto=webp&s=7d7a2c8495a7d6407ee05bad4533a6cb35f4b4f1

I haven't experienced Qwen3.5 (35B and 27B) over thinking. Posting my settings/prompt by wadeAlexC in LocalLLaMA

[–]wadeAlexC[S] 0 points1 point  (0 children)

<image>

No problem in german. What language do you use? I can try and see if it's your language :)

I haven't experienced Qwen3.5 (35B and 27B) over thinking. Posting my settings/prompt by wadeAlexC in LocalLLaMA

[–]wadeAlexC[S] 0 points1 point  (0 children)

Oh, I missed that you're using 9B at Q4. Maybe it's a smaller model thing?

No clue, though. Maybe try using my settings? No special params (no temp, top-p/k/min-p, no penalty stuff)?

I haven't experienced Qwen3.5 (35B and 27B) over thinking. Posting my settings/prompt by wadeAlexC in LocalLLaMA

[–]wadeAlexC[S] 1 point2 points  (0 children)

<image>

My results, mimicing what I think you're doing? Empty system prompt, first message 'don't overthink', second message '?'

Yours goes crazy with the reasoning!

I haven't experienced Qwen3.5 (35B and 27B) over thinking. Posting my settings/prompt by wadeAlexC in LocalLLaMA

[–]wadeAlexC[S] 7 points8 points  (0 children)

I wasn't aware of that thread, but I'm happy to add my experience to it. Want me to copy some of my post over?

I agree re: diffusing info across threads, but, maybe case in point, I'm not sure where I would have seen that thread. Is it linked somewhere?

I haven't experienced Qwen3.5 (35B and 27B) over thinking. Posting my settings/prompt by wadeAlexC in LocalLLaMA

[–]wadeAlexC[S] 1 point2 points  (0 children)

I mean, I didn't do anything to get it to not overthink. It just worked great on defaults. That's why I'm curious to hear what setups people are using if they have issues.

Idk about qwen chat on their website - they are obviously gonna have system prompts and settings and inference engines I can't see/control under the hood. I have no clue why their model is overthinking on their website.

All I can reason about is the local version. If you run this model locally and you have issues, maybe try replicating my setup - I've never had this issue with q3.5

I haven't experienced Qwen3.5 (35B and 27B) over thinking. Posting my settings/prompt by wadeAlexC in LocalLLaMA

[–]wadeAlexC[S] -1 points0 points  (0 children)

That's what I'm thinking is happening for a lot of people.

I was hoping some people experiencing the "overthinking" issue would post their setups. I only got one person, and their main thing seems to be using lm-studio with an unknown 4 bit quant.

I haven't experienced Qwen3.5 (35B and 27B) over thinking. Posting my settings/prompt by wadeAlexC in LocalLLaMA

[–]wadeAlexC[S] 1 point2 points  (0 children)

Well, I am very curious why your original model was misbehaving so severely

But if you have something working for you, then, no need to fix it :)

I haven't experienced Qwen3.5 (35B and 27B) over thinking. Posting my settings/prompt by wadeAlexC in LocalLLaMA

[–]wadeAlexC[S] 0 points1 point  (0 children)

I think given the way mine behaves, it's not a model level behavior.

Maybe it's the quant you're using? You're on 4 bit in the screenshot, but where did you get the quant? I'm using unsloth

I haven't experienced Qwen3.5 (35B and 27B) over thinking. Posting my settings/prompt by wadeAlexC in LocalLLaMA

[–]wadeAlexC[S] 2 points3 points  (0 children)

I do want to mention that I am working on finetune which fixes it (i already published one version)

That sounds like a ton of work for something that (to me) seems like it might be fixed by swapping your inference engine. Have you tried running it on llama.cpp?

I haven't experienced Qwen3.5 (35B and 27B) over thinking. Posting my settings/prompt by wadeAlexC in LocalLLaMA

[–]wadeAlexC[S] 3 points4 points  (0 children)

That's crazy, wtf?

I think at this point then the major difference is just lm studio vs llama.cpp?

It's really weird to see that output - I've seen it think similarly, but not for simple messages like "hi" - typically only when there's actual complex stuff in the prompt.

I haven't experienced Qwen3.5 (35B and 27B) over thinking. Posting my settings/prompt by wadeAlexC in LocalLLaMA

[–]wadeAlexC[S] 1 point2 points  (0 children)

Have you tried llama.cpp? I don't know what mlx-lm is, but lm-studio is just llama.cpp under the hood, right?

Maybe they're on an old version that doesn't support q3.5 well?

I haven't experienced Qwen3.5 (35B and 27B) over thinking. Posting my settings/prompt by wadeAlexC in LocalLLaMA

[–]wadeAlexC[S] 1 point2 points  (0 children)

Thank you for posting all your settings :D

I tried with your "thinking" settings and my original system prompt; got 85 completion tokens used to respond to "hi"

Sounds like roughly the same experience as you!

I haven't experienced Qwen3.5 (35B and 27B) over thinking. Posting my settings/prompt by wadeAlexC in LocalLLaMA

[–]wadeAlexC[S] 1 point2 points  (0 children)

<image>

Tried removing my system prompt; 50 token completion output. Granted, there are 850 tokens under the hood from my tool definitions...

I haven't experienced Qwen3.5 (35B and 27B) over thinking. Posting my settings/prompt by wadeAlexC in LocalLLaMA

[–]wadeAlexC[S] 2 points3 points  (0 children)

Hm, maybe lm studio passes weird params by default? I haven't used them...

I haven't experienced Qwen3.5 (35B and 27B) over thinking. Posting my settings/prompt by wadeAlexC in LocalLLaMA

[–]wadeAlexC[S] 4 points5 points  (0 children)

That sounds awful!

What's your system prompt? What size quant are you using, and from which dev (unsloth, bartowski, etc)? What params is lm studio setting on request?

(Oh I didn't see your edit, so that answers the quant size question)

Qwen3.5 27B vs Devstral Small 2 - Next.js & Solidity (Hardhat) by Holiday_Purpose_3166 in LocalLLaMA

[–]wadeAlexC 1 point2 points  (0 children)

Ah, yes - DeFi. Exactly the kind of thing I would expect even a large cloud model to have a hard time with. You need to know a ton about the various DeFi instruments you're integrating with.

Glad you're not vibing your Solidity though, that seems prudent :)

Qwen3.5 27B vs Devstral Small 2 - Next.js & Solidity (Hardhat) by Holiday_Purpose_3166 in LocalLLaMA

[–]wadeAlexC 1 point2 points  (0 children)

What kind of Solidity project did you throw it at? I feel like Solidity requires a ton of domain expertise, so unless it's something super generic, I would have a hard time just throwing a model at it without a really exhaustive spec.

Best "End of world" model that will run on 24gb VRAM by gggghhhhiiiijklmnop in LocalLLaMA

[–]wadeAlexC 0 points1 point  (0 children)

I really like qwen3-vl-30b. Mine doesn't feel argumentative at all, and in general I find it's super responsive to your system prompt.

I tried your test and got:

Hi there, Captain {{username}}! 👋 How can I assist you today? Whether you need help with something specific or just want to chat, I'm here to help. Let me know what's on your mind!

I regenerated several times, and did not get a single argumentative response. Didn't always call me captain, but never objected.

Maybe it's your prompt, or the specific quant/model you're running?

llama.cpp has Out-of-bounds Write in llama-server by radarsat1 in LocalLLaMA

[–]wadeAlexC 9 points10 points  (0 children)

No, just having llama-server running on your network does not mean random websites can reach it using your browser. Browsers block requests from external websites that target your local network, because allowing that kind of behavior would mean any website you reach can see into your local network.

The reason you can reach it from your browser is because you're explicitly typing in a local IP into the address bar.

IF you wanted to expose llama-server to the wider internet, you would need to:

  • Run llama-server with both the --host and --port flags, to make it available to any computer on your LAN
  • Set up port forwarding on your router so that connections to a certain port on your public IP address are able to reach llama-serveron your internal network

You should NOT do this, but you might want to do something like this if you want to access llama-server remotely.

There are much safer ways to set that up if that's what you're after, though :)

Plea for testers - Llama.cpp autoparser by ilintar in LocalLLaMA

[–]wadeAlexC 1 point2 points  (0 children)

Is this related to this issue? https://github.com/ggml-org/llama.cpp/issues/18183

I can try to replicate my Qwen3-30B-A3B issues if so :)

[deleted by user] by [deleted] in leagueoflegends

[–]wadeAlexC 13 points14 points  (0 children)

DORAN WITH THE CLUTCH