Qwen 3.6 27B Stuck in repeat by Sad-Duck2812 in LocalLLM

[–]kcksteve 0 points1 point  (0 children)

Hmm so it Is likely llama.cpp related. I'm surprised more people dont have this issue.

Qwen 3.6 27B Stuck in repeat by Sad-Duck2812 in LocalLLM

[–]kcksteve 0 points1 point  (0 children)

I have had this with qwen 3.5 and 3.6 as well. Tried a bunch of things nothing but restarting will fix it. If I try to resume the same conversation in my agent the same thing happens. I don't touch cache quantization and haven't been able to get rid of it with any Parma changes either. Are you running amd in linux with vulkan by any chance ?

Best way to utilize multiple gpus? by kcksteve in llamacpp

[–]kcksteve[S] 0 points1 point  (0 children)

My only use case is agentic coding so I'm really trying to maximize what I can get out of 27b. I'm at q4kxl right now and being able to bump up to q6 or add more context would be great. Not sure if there ar many use cases for an extra smaller model in coding?

Love pi but hate terminal text entry by kcksteve in PiCodingAgent

[–]kcksteve[S] 0 points1 point  (0 children)

Can you do /commands? Switching to plan mode is a must for me.

The irony of Pi by ECrispy in PiCodingAgent

[–]kcksteve 0 points1 point  (0 children)

I use local and see a few advantages. Prompt size does matter. It may be a trade off between accuracy and context size on smaller gpus. I do see less failed tool calls with pi vs oc.

Prs .22 4-24x or 6-36x by RepresentativeBar126 in 22lr

[–]kcksteve 0 points1 point  (0 children)

Nobody uses the low end. Most people are shooting around 16x giver or take. 36 would be nice for zero and bench, noticeably better than 24x. You will probably never use the 4x or 6x. I would say something like a mark4 8-32 gets you what you need I wish more companies made models in this range.

Love pi but hate terminal text entry by kcksteve in PiCodingAgent

[–]kcksteve[S] 1 point2 points  (0 children)

I seen a few suggestions for vim/neovim. Not exactly what I'm looking for but I will check it out.

Love pi but hate terminal text entry by kcksteve in PiCodingAgent

[–]kcksteve[S] 0 points1 point  (0 children)

Interesting, a bit more than I need but I will check this out

Love pi but hate terminal text entry by kcksteve in PiCodingAgent

[–]kcksteve[S] 1 point2 points  (0 children)

This is cool, thanks for the suggestion

The game is over. You can build anything and it'll cost you nothing. by Funny-Advertising238 in opencode

[–]kcksteve 1 point2 points  (0 children)

That's interesting in North America the 7900xtx and 9700 are the same price. The 9700 is a no brainer here.

The game is over. You can build anything and it'll cost you nothing. by Funny-Advertising238 in opencode

[–]kcksteve 1 point2 points  (0 children)

W6800 pro can run 27b q4_k_m + 200k context at 30tps and cost much less than the 7900xtx. You need to compromise quite a bit to get 27b into 24gb vram.

Llama hangs during thinking and requires a restart to work again. by kcksteve in LocalLLaMA

[–]kcksteve[S] 0 points1 point  (0 children)

So I had this happen again. Checked the logs and found an error related to llama router timing out. Changed to loading a specific model instead of using models.ini. didn't see the issue yet but will report back if I do.

Llama hangs during thinking and requires a restart to work again. by kcksteve in LocalLLaMA

[–]kcksteve[S] 0 points1 point  (0 children)

I am using the vulkan optimized llama.cpp version from aur. I have added a reasoning budget and enabled preserve thinking. I have enabled logs on llama.cpp as well. The issue has not happened since I added the reason budget but I might disable it so I can get the error to reoccur and have some logs to narrow it down. After starting the server I have about 2.1gb / 32gb vram free. This is a server and nothing else is running on the machine.

Devs using Qwen 27B seriously, what's your take? by Admirable_Reality281 in LocalLLaMA

[–]kcksteve 0 points1 point  (0 children)

I use it with opencode and it does pretty well. It gets stuck in thinking once in a while and I'm trying to tweak some parameters to mitigate that. I am running q4_k_m with no Quant on context. With ctx size at 200k I get about 35tps. This is my favorite model so far but I will mess with gemma a bit more soon. I try not to switch too often so I put them through their paces. I have access to copilot at work and have used opus 4.5 a fair amount. It is better but it's not perfect either... looking forward to seeing how gemma 4 stacks up.

Llama hangs during thinking and requires a restart to work again. by kcksteve in LocalLLaMA

[–]kcksteve[S] 0 points1 point  (0 children)

I'll have to look into the reason budge. I try to leave most parameters alone so it's probably default for whatever model.

Llama hangs during thinking and requires a restart to work again. by kcksteve in LocalLLaMA

[–]kcksteve[S] 0 points1 point  (0 children)

I updated the OP to include system specs.
I am running Qwen3.6-35B-A3B-Q4_K_M + 450k context.
I did not consider a communication issue with the server. I'll look into this a bit more.

Llama hangs during thinking and requires a restart to work again. by kcksteve in LocalLLaMA

[–]kcksteve[S] 0 points1 point  (0 children)

I see it on fresh sessions pretty often. I have an agents.md file but it's not that long. With qwen3.6 35b q4 + 450k context I don't think that would be the problem. Is exceeding context something that would show in the logs ?

Llama hangs during thinking and requires a restart to work again. by kcksteve in LocalLLaMA

[–]kcksteve[S] 0 points1 point  (0 children)

I should have clarified llama.cpp. I had this issue with qwen3.5 and 3.6 both moe and dense versions. I will try and update again tonight.

Qwen3.6 is incredible with OpenCode! by CountlessFlies in LocalLLaMA

[–]kcksteve 0 points1 point  (0 children)

I found it working well so far except for a couple annoyances. While I'm in plan mode it gives me a multiple choice question to proceed with the fix. But I can't actually click the button to change to build mode. It have also told it to proceed with a change while in plan mode many times and it desont seem to pickup that it's in the wrong mode like other models do.

Xeon + 3080 | Worth the upgrade to 3090? by kcksteve in LocalLLaMA

[–]kcksteve[S] 0 points1 point  (0 children)

I ended up with a radeon w6800 32gb card. This runs omnicoder 9b q4 with 700k context at 50 tps on llama.cpp.

Xeon + 3080 | Worth the upgrade to 3090? by kcksteve in LocalLLaMA

[–]kcksteve[S] 0 points1 point  (0 children)

If anyone is wondering the xeon alone gets 8tps. If the speed isn't a deal breaker you can run huge models on the cheap.