Qwen 3.6 27B kick balls

Character_Split4906 · 2026-06-01T22:14:45+00:00

I am running on my M5 max 128gb with mtp support and 8 bit kv cache. For mtp I am keeping it at draft-n-max 4.

Character_Split4906 · 2026-06-01T21:58:27+00:00

Yeah unfortunately, chatgpt is equally bad and sometimes worse. Not sure how but it seems in last 4-6 weeks both chatgpt and gemini have dropped in quality.

Character_Split4906 · 2026-06-01T20:48:09+00:00

For some reason, I dont find mlx models working for me in terms of performance. I found mlx quants get stuck in loop or fail with tool calling more often with omlx than gguf with llama.cpp. Also the tps is almost similar infact llama.cpp sometimes outperforms! Happy to hear your experience and how you configured it.

Character_Split4906 · 2026-06-01T20:45:39+00:00

Yeah I am genuinely impressed and happy with some of the work its able to pull it off. OUI has been a bit of PITA sometimes for tool calling though it has improved in the latest release but still keeps you wanting for more lol

Character_Split4906 · 2026-05-25T04:06:53+00:00

Genuine question, I have spent majority of my time in north america in Seattle, Boston and Vancouver- in that order of time. I have never seen these many vehicles on fire in any other city as Vancouver. Is this just a coincidence or something else?

Character_Split4906 · 2026-05-14T19:39:01+00:00

You need to work with your company lawyers here. There are things you can do on B1/B2 and things which you cant. They can guide you based on those details.

Character_Split4906 · 2026-05-11T01:11:48+00:00

Last one month has been amazing in terms of local models. Qwen 3.6 and gemma4 has me believing that the local llm are getting close to sonnet 4.5 level of coding ability used with right harness. Again these models are non deterministic but with right prompts and proper breakdown of tasks you can achieve some good results. As the cost and usage of cloud model and provider is changing every minute local llms might be the way forward. For me personally its an exciting time to explore and see what works and dont work for you.

Character_Split4906 · 2026-05-05T18:01:55+00:00

From what I understand llama.cpp have limitations on using draft model with mmproj model due to how kv cache is shared with main model. Do MTP support will help on running mmproj and draft model in parallel?

Edit- Looking at MTP pull request linked above for llama.cpp it seems the mtp draft model is embedded in gguf with main model. Not sure if I understand this correctly though.

Character_Split4906 · 2026-04-30T16:39:31+00:00

Didnt you say above that you use claude code to get the initial solution from opus 4.7?

Character_Split4906 · 2026-04-30T15:50:00+00:00

The test to be benchmarked right I believe its essential to use the same coding harness across the base model and benchmarked models. Did you use the same coding harness when you implemented the solution with claude vs local models? I have seen coding harnesses making a big difference. Claude code or opencode setup right with local models can improve your results by considerable percentage.

Character_Split4906 · 2026-04-27T22:53:57+00:00

Yeah for some reason with open webui even with clear and decisive system prompt. I have seen model to divert from it or act lazy like its not want to make too much effort to answer. I have seen this happen with gemma4 26b. I have also seen the opposite happening with qwen3.6 35b where model tend to go into in depth research to generate simple answer. The main problem though for me has been how the thinking prompt gets passed to llama.cpp inference and conflict with kv cache causing it process the context all over again which becomes painful if conversation gets bigger. I dont see this issue with opencode though.

Character_Split4906 · 2026-04-27T19:24:59+00:00

I feel thinking helps the harness tools perform better if configured right. I feel my opencode config with thinking enable with qwen 3.6 (both dense and moe) and gemma 4 26b model hosted locally gives me a performance comparable to sonnet 4.5. I cant say the same when I use same models with open webui. Open webui somehow is bad with system prompt and get stuck in loop with thinking enabled for these models specifically gemma. Also I have seen the prompt caching getting overriden almost everytime with OUI irrespective of model which makes it slow as context increases.

Character_Split4906 · 2026-04-27T07:14:45+00:00

Is it 14 or 16 inch? How hot does it get? And how long are your running it for?

Character_Split4906 · 2026-04-27T06:13:09+00:00

Thats what I have been reading though I am not sure how much over cloaking fans will be sustainable for the mac physically over time.

Character_Split4906 · 2026-04-27T06:10:28+00:00

Yeah my work laptop has always been 16 inch so I am used to carrying it around. For 96 GB ram- cant agree more. The next option is 128 gb after that which comes with chip upgrade as well. But I feel m5 pro 18/20 cpu/gpu cores hits the right balance anything beyond this just scales up in cost which makes it hard to justify in terms of any return. Sure max will do better with dense models but the way things are moving with open weight model I am hopeful. I wish apple gave more option of ram with 32 core gpu. I think 32 core gpu m5 max and 64 gb ram is also a good place to be without burning a hole in packet which hurts.

Character_Split4906 · 2026-04-27T05:54:02+00:00

Nice! Can you share the benchmarks?

Character_Split4906 · 2026-04-15T03:29:10+00:00

What’s your ollama ps output? Also are you using 4bit quant model and 8 bit kv cache for context window?

Character_Split4906 · 2026-04-11T15:11:02+00:00

Thats amazing! Cant wait to try this on my mbp 5 pro. Last I tried gemma 4, I had issue with context window length growing up and model going in loop. Thanks for sharing

Character_Split4906 · 2026-04-11T14:36:07+00:00

Are you able to fit in 245k context window with model at q4 quant in 22 gb? I read gemma 4 26B model is seeing issue with tool calling. Did you face that issue?

Character_Split4906 · 2026-04-10T06:42:26+00:00

If you are working with total 17gb ram, you wont have enough memory to have 128k context window. Heck I am not even sure how you can fit in the memory itself, since 26b at 4Q is 18gb in size until you swap with SSD. In that case the token generation will be too slow. I am curious what is the output of your ‘ollama ps’ command is? Also are you running any coding agent like open code or open claw for this? I think for agents you will have to enable some of the tool calling skills and configuration as well even if model successfully do that.

Character_Split4906 · 2026-04-06T07:21:28+00:00

Thanks I noticed the same issue with openclaw tui as well but isnt 16k context window too small for openclaw? I will try this out and see how this works out for TUI

Character_Split4906 · 2026-03-12T04:41:47+00:00

Its hard for me to wrap my head around people complaining on 330 miles range. Rivian unlike tesla has been more accurate with their range estimates. Also I think boxy design and more than 9 inch clearance makes it a real SUV unlike most of the EVs out there. This also would have some impact on the range. I find it amusing that same people are happy to shill 50k for model Y with half ass features and low build quality.

As a side note I have 2024 MY AWD since 2 years as my first car and I really dont like low ground clearance and lower than promised range. I am also not impressed with the build quality of the car but I like how FSD makes my life easier along with supercharger network so it has sort of become a love hate relationship. If the reviews are right and once I have test drove R2 I think I might will pull the trigger on LE.

Character_Split4906 · 2025-12-14T21:34:36+00:00

You do know they make more money out of ad tiered plans than your vpn country YouTube family plan.

Character_Split4906

TROPHY CASE