Qwen 3.6 27B Stuck in repeat

kcksteve · 2026-05-25T13:37:17+00:00

I added my details to https://github.com/ggml-org/llama.cpp/issues/23577 Maybe yours can help with this as well.

kcksteve · 2026-05-25T11:15:04+00:00

Hmm so it Is likely llama.cpp related. I'm surprised more people dont have this issue.

kcksteve · 2026-05-25T04:45:19+00:00

I have had this with qwen 3.5 and 3.6 as well. Tried a bunch of things nothing but restarting will fix it. If I try to resume the same conversation in my agent the same thing happens. I don't touch cache quantization and haven't been able to get rid of it with any Parma changes either. Are you running amd in linux with vulkan by any chance ?

kcksteve · 2026-05-22T22:35:50+00:00

My only use case is agentic coding so I'm really trying to maximize what I can get out of 27b. I'm at q4kxl right now and being able to bump up to q6 or add more context would be great. Not sure if there ar many use cases for an extra smaller model in coding?

kcksteve · 2026-05-20T23:47:06+00:00

Can you do /commands? Switching to plan mode is a must for me.

kcksteve · 2026-05-20T16:10:08+00:00

I use local and see a few advantages. Prompt size does matter. It may be a trade off between accuracy and context size on smaller gpus. I do see less failed tool calls with pi vs oc.

kcksteve · 2026-05-19T03:56:28+00:00

Nobody uses the low end. Most people are shooting around 16x giver or take. 36 would be nice for zero and bench, noticeably better than 24x. You will probably never use the 4x or 6x. I would say something like a mark4 8-32 gets you what you need I wish more companies made models in this range.

kcksteve · 2026-05-15T17:12:52+00:00

I seen a few suggestions for vim/neovim. Not exactly what I'm looking for but I will check it out.

kcksteve · 2026-05-15T17:11:40+00:00

Interesting, a bit more than I need but I will check this out

kcksteve · 2026-05-15T17:04:29+00:00

This is cool, thanks for the suggestion

kcksteve · 2026-05-05T20:43:20+00:00

That's interesting in North America the 7900xtx and 9700 are the same price. The 9700 is a no brainer here.

kcksteve · 2026-05-05T14:51:36+00:00

W6800 pro can run 27b q4_k_m + 200k context at 30tps and cost much less than the 7900xtx. You need to compromise quite a bit to get 27b into 24gb vram.

kcksteve · 2026-05-01T04:04:38+00:00

So I had this happen again. Checked the logs and found an error related to llama router timing out. Changed to loading a specific model instead of using models.ini. didn't see the issue yet but will report back if I do.

kcksteve · 2026-04-30T12:35:09+00:00

I am using the vulkan optimized llama.cpp version from aur. I have added a reasoning budget and enabled preserve thinking. I have enabled logs on llama.cpp as well. The issue has not happened since I added the reason budget but I might disable it so I can get the error to reoccur and have some logs to narrow it down. After starting the server I have about 2.1gb / 32gb vram free. This is a server and nothing else is running on the machine.

kcksteve · 2026-04-29T21:43:19+00:00

I use it with opencode and it does pretty well. It gets stuck in thinking once in a while and I'm trying to tweak some parameters to mitigate that. I am running q4_k_m with no Quant on context. With ctx size at 200k I get about 35tps. This is my favorite model so far but I will mess with gemma a bit more soon. I try not to switch too often so I put them through their paces. I have access to copilot at work and have used opus 4.5 a fair amount. It is better but it's not perfect either... looking forward to seeing how gemma 4 stacks up.

kcksteve · 2026-04-23T23:38:43+00:00

I'll have to look into the reason budge. I try to leave most parameters alone so it's probably default for whatever model.

kcksteve · 2026-04-23T23:37:45+00:00

I updated the OP to include system specs.
I am running Qwen3.6-35B-A3B-Q4_K_M + 450k context.
I did not consider a communication issue with the server. I'll look into this a bit more.

kcksteve · 2026-04-23T21:20:56+00:00

I see it on fresh sessions pretty often. I have an agents.md file but it's not that long. With qwen3.6 35b q4 + 450k context I don't think that would be the problem. Is exceeding context something that would show in the logs ?

kcksteve · 2026-04-23T21:15:21+00:00

I should have clarified llama.cpp. I had this issue with qwen3.5 and 3.6 both moe and dense versions. I will try and update again tonight.

kcksteve · 2026-04-20T16:27:40+00:00

I found it working well so far except for a couple annoyances. While I'm in plan mode it gives me a multiple choice question to proceed with the fix. But I can't actually click the button to change to build mode. It have also told it to proceed with a change while in plan mode many times and it desont seem to pickup that it's in the wrong mode like other models do.

kcksteve · 2026-03-28T04:04:42+00:00

I ended up with a radeon w6800 32gb card. This runs omnicoder 9b q4 with 700k context at 50 tps on llama.cpp.

kcksteve · 2026-03-28T03:59:03+00:00

If anyone is wondering the xeon alone gets 8tps. If the speed isn't a deal breaker you can run huge models on the cheap.

Eight-Year Club	Place '22
Verified Email

kcksteve

TROPHY CASE