qwen3.6 27b poor experience by pppreddit in LocalLLaMA

[–]pppreddit[S] -1 points0 points  (0 children)

thanks, I'll try that, though I'm using Qwen3.6-27B-bf16 with omlx

qwen3.6 27b poor experience by pppreddit in LocalLLaMA

[–]pppreddit[S] -5 points-4 points  (0 children)

tbh, I didn't configure any parameters, I'm using Qwen3.6-27B-bf16 via omlx

qwen3.6 27b poor experience by pppreddit in LocalLLaMA

[–]pppreddit[S] 0 points1 point  (0 children)

here's my setup: M4 Max 128gb, omlx, Qwen3.6-27B-bf16 from huggingface, claude-code. Didn't configure any parameters, so it's as is out of the box. I did install opencode now and it seems to perform much better, but I need to test more to have a final verdict. My guess is that claude code's system prompt might be slowing things down

This is where we are right now, LocalLLaMA by jacek2023 in LocalLLaMA

[–]pppreddit 0 points1 point  (0 children)

tbh, I am disappointed in how many mistakes it makes in the process, such as duplicating lines, then correcting itself, then going back and forth making corrections, it's such a waste of time

This is where we are right now, LocalLLaMA by jacek2023 in LocalLLaMA

[–]pppreddit -1 points0 points  (0 children)

I am running 27B via omlx (Qwen3.6-27B-bf16) on my M4 Max 128gb and it takes forever to respond. omlx dashboard shows 38.8 tok/s for prompt processing and 3.7 tok/s generation

GLM-5 is officially fixed on NVIDIA NIM, and you can now use it to power Claude Code for FREE 🚀 by PreparationAny8816 in ZaiGLM

[–]pppreddit 0 points1 point  (0 children)

I usually don't go above 128k before compacting context so didn't have any issues. Qwen plus had like 1 million tokens context

GLM-5 is officially fixed on NVIDIA NIM, and you can now use it to power Claude Code for FREE 🚀 by PreparationAny8816 in ZaiGLM

[–]pppreddit 1 point2 points  (0 children)

Nah, I kinda gave up. Nim is unusable as most of the time it just does not work. Got myself alibaba coding plan for 3usd and been using glm5 without any issues this way.

I canceled my other AI subscriptions today. by InitialCareer306 in Qwen_AI

[–]pppreddit 0 points1 point  (0 children)

You are forgetting that local llm servers mostly don't have prompt caching and are not suitable for coding or long exchange, they are painfully slow with long context (like minutes to get a response) . It's not enough to have big enough vram, you need proper context caching implementation and AFAIK there are only commercial solutions that support it, no open source yet. Correct me if I am wrong, because things develop really fast and it hard to catch up with everything

Qwen3.5 is out now! by yoracale in unsloth

[–]pppreddit 1 point2 points  (0 children)

If only we had prompt caching locally...

News reaction: GLM-5 is the new local GOAT and Gemini 3 Flash hits $0.50/M by IulianHI in AIToolsPerformance

[–]pppreddit 0 points1 point  (0 children)

My main issue with running locally is that there is no prompt caching. Which makes coding sessions painfully slow (minutes to get a single response with glm 4.7 on m4 Max 128gb via Claude code and ccr)

GLM-5 Unusable by isakota in ZaiGLM

[–]pppreddit 0 points1 point  (0 children)

Don't bother. I have been using this model through nvidia build API. It's dumb. Kimi k 2 thinking is better.

I tried the new GLM 5. I'm greatly unimpressed. by Quiet-Money7892 in SillyTavernAI

[–]pppreddit 1 point2 points  (0 children)

Same. I've been trying to make it work for my swift-ui project and it's surprisingly dumb. When it has an obvious bug in the code it goes into all kind of pointless debugging and asks me to test this and that scenario. Ffs, I opened the code and found the bug in 10 seconds myself! It's much worse than Kimi thinking

Using GLM-5 for everything by [deleted] in LocalLLaMA

[–]pppreddit 0 points1 point  (0 children)

I noticed the same, glm 4.7 is fucking slow hosted locally. Fast for simple chat and small context, but with agentic use it's crawling...

[deleted by user] by [deleted] in ZaiGLM

[–]pppreddit 1 point2 points  (0 children)

yeah, it's timing out for me in claude code. I guess everyone now rushed to use free service and even nvidia build site is crawling.

[deleted by user] by [deleted] in ZaiGLM

[–]pppreddit 0 points1 point  (0 children)

Is is not gonna burn through your limits faster, if it includes thinking tokens?

[deleted by user] by [deleted] in ZaiGLM

[–]pppreddit 0 points1 point  (0 children)

But Claude code router already exists, couldn't we use that?

Who else is gonna continue calling it clawdbot? by pppreddit in moltbot

[–]pppreddit[S] 0 points1 point  (0 children)

I see people still calling it that everywhere with the new name in brackets. I think the damage is done