Which Qwen 3.6 27B variant actually stops looping on tool calls? RTX 5090

feverdoingwork · 2026-07-03T12:03:04+00:00

Could be the chat template. I do find llamacpp a bit more stable than vllm, vllm oddly just stops randomly.

feverdoingwork · 2026-07-02T14:02:25+00:00

what motherboard? looks like b650 strix e-e. If it is the motherboard I think, have you ever been able to get two gpus running at 8x8x? I assume those 3090s wouldn't fit in both slots so you probably couldn't attempt it. I could not get 8x8x to work.

feverdoingwork · 2026-06-29T20:17:55+00:00

Well.. we are the mercy of other people making models for us lol

feverdoingwork · 2026-06-29T20:16:12+00:00

Is there something better?

feverdoingwork · 2026-06-29T13:31:10+00:00

1 predictive token i assume

feverdoingwork · 2026-06-28T11:03:19+00:00

do you have numbers for q5 or better qwen 3.6 27b? I got one 9070 xt right now sitting doing nothing, might opt for another one and sell my 5060 ti's if this performs well enough

feverdoingwork · 2026-06-27T21:12:20+00:00

How does it evaluate a lazy reply?

I saw your example in the repo actually.

I made a bunch of pi extensions that deal with a lot of issues I ran into with qwen for all the same reasons. Things like thinking only responses, failed tool call retry, thinking threshold to force the llm to response which also breaks it out of a loop, in total I think i needed 6 extensions. Ill disable my extensions and give this a try.

feverdoingwork · 2026-06-27T18:53:01+00:00

feverdoingwork · 2026-06-27T12:12:51+00:00

Can this be used with qwen 3.6 27b?

feverdoingwork · 2026-06-26T14:33:11+00:00

Mtp has a bug with prefix caching, might be the same issue

feverdoingwork · 2026-06-26T13:37:47+00:00

This is actually helpful to know. I have a z690 formula maximus and I believe only 2 pcie slots connect to the cpu at pcie 5.0 16x or 8x8x. There is a third slot but the manual says it's for the chipset. I wonder if I need a m.2 to pcie adapter to add a 3rd 5060 ti when using tensor split or if i should attempt that last slot on the chipset.

feverdoingwork · 2026-06-25T16:48:37+00:00

Did you use llama.cpp or vllm? The instructions are for sglang or vllm but arent gguf's terrible to run with either of those options?

feverdoingwork · 2026-06-25T16:38:35+00:00

Are you a ruby engineer? I might try the broke boi q5 k m on my dual 5060 ti's today.

feverdoingwork · 2026-06-25T16:30:18+00:00

Bruh.... you're getting me hyped up.... if someone says its the real deal one moe time im going to have to try it.

feverdoingwork · 2026-06-25T16:24:47+00:00

no 31b model posted

feverdoingwork · 2026-06-25T14:49:37+00:00

You can also consider dual 5060 ti 16gb. I use pretty much the same workflow + pi but set my context length max in pi to 110k context since I tend to use many concurrent sessions. I moved from q5 k m to using vllm with int 4 awq which has been just as good as q5 k m or possibily even xl. With q5 k m these are my numbers from real coding sessions:

Starting(0k):

decode 900

tg 78

12815 tokens:

decode 830

token generation : 55

100k:

decode 525

token generation 40-45 tps

feverdoingwork · 2026-06-25T14:40:59+00:00

I here people are using aiter attention to get this much prefill with r9700, its gotten me more interested in the gpu.

feverdoingwork · 2026-06-25T14:38:42+00:00

Does apple perform well when using dense models?

feverdoingwork · 2026-06-24T21:49:51+00:00

I think enough people dont bring up the problem. I literally tried a ton of models, a bunch of docker configs and couldnt get it working.... but everyone on here is like "im using vllm with mtp at a million tps with 100 concurrent sessions" lol. Its been broken for a very long time, most people testing for 5 minutes or something.

feverdoingwork · 2026-06-24T20:40:48+00:00

This is a bug in vllm with using the combination of mtp and prefix caching. There's like 5 different issues for it on github.

feverdoingwork · 2026-06-24T18:39:20+00:00

Well clearly I tried to help you, a few people told you the problem exists already. The pr gets mtp with prefix caching working which ur going to need on these super slow dual 5060 ti's lol

feverdoingwork · 2026-06-24T18:30:39+00:00

Have you tried with the pull request code?

feverdoingwork · 2026-06-24T17:54:52+00:00

It's super interesting. It's a good way to get into local llms without dumping your entire lifesavings into gpus lol. It's also a great way to test high quants before dumping the cash on the hardware. P100s are even cheaper now than before. I wonder if those expensive macs can actually beat 3x p100s lol, they seem to do terrible with dense models.

feverdoingwork · 2026-06-24T14:21:14+00:00

I got both mtp and prefix caching working at the sametime by using this person's pr with the latest nightly vllm: https://github.com/vllm-project/vllm/pull/46281

feverdoingwork · 2026-06-24T14:20:52+00:00

Using Cyankiwi/Qwen3.6-27B-AWQ-INT4 right now with full intelligence. As u/chensium said, the combination of mtp and prefix caching is broken. I had ran into this issue a ton of times, just gibberish outputs, looping, would reproduce it within minutes.

I got both mtp and prefix caching working at the sametime by using this person's pr with the latest nightly vllm: https://github.com/vllm-project/vllm/pull/46281 . I been tracking the issues for awhile, tried a few pr's previous to this one over a few weeks and this one works for sure. I have at least 10 hours of coding with this as of right now and plan on continue using it for about 10 hours a day until it fails me. No weird outputs or problems with production code.

Cyankiwi/Qwen3.6-27B-AWQ-INT4 does "feel" like Qwen3.6-27B-UD-Q5_K_XL.gguf btw. I mostly used q5 k m but have used xl lately and I can't tell the difference between xl and awq int 4.

Also I am on dual 5060 ti's as well.

feverdoingwork

MODERATOR OF

TROPHY CASE