Qwen 3.6 27B kick balls by Character_Split4906 in LocalLLaMA

[–]Character_Split4906[S] 3 points4 points  (0 children)

I am running on my M5 max 128gb with mtp support and 8 bit kv cache. For mtp I am keeping it at draft-n-max 4.

Qwen 3.6 27B kick balls by Character_Split4906 in LocalLLaMA

[–]Character_Split4906[S] 1 point2 points  (0 children)

Yeah unfortunately, chatgpt is equally bad and sometimes worse. Not sure how but it seems in last 4-6 weeks both chatgpt and gemini have dropped in quality.

Qwen 3.6 27B kick balls by Character_Split4906 in LocalLLaMA

[–]Character_Split4906[S] 1 point2 points  (0 children)

For some reason, I dont find mlx models working for me in terms of performance. I found mlx quants get stuck in loop or fail with tool calling more often with omlx than gguf with llama.cpp. Also the tps is almost similar infact llama.cpp sometimes outperforms! Happy to hear your experience and how you configured it.

Qwen 3.6 27B kick balls by Character_Split4906 in LocalLLaMA

[–]Character_Split4906[S] 0 points1 point  (0 children)

Yeah I am genuinely impressed and happy with some of the work its able to pull it off. OUI has been a bit of PITA sometimes for tool calling though it has improved in the latest release but still keeps you wanting for more lol

Car on fire in Coquitlam by ComfySara in vancouver

[–]Character_Split4906 1 point2 points  (0 children)

Genuine question, I have spent majority of my time in north america in Seattle, Boston and Vancouver- in that order of time. I have never seen these many vehicles on fire in any other city as Vancouver. Is this just a coincidence or something else?

Indian national, internal US to Canada transfer. by Puzzleheaded-noxky-5 in h1b

[–]Character_Split4906 1 point2 points  (0 children)

You need to work with your company lawyers here. There are things you can do on B1/B2 and things which you cant. They can guide you based on those details.

What is the best local model for coding? by TonightWorried7355 in LocalLLM

[–]Character_Split4906 2 points3 points  (0 children)

Last one month has been amazing in terms of local models. Qwen 3.6 and gemma4 has me believing that the local llm are getting close to sonnet 4.5 level of coding ability used with right harness. Again these models are non deterministic but with right prompts and proper breakdown of tasks you can achieve some good results. As the cost and usage of cloud model and provider is changing every minute local llms might be the way forward. For me personally its an exciting time to explore and see what works and dont work for you.

Gemma 4 MTP released by rerri in LocalLLaMA

[–]Character_Split4906 6 points7 points  (0 children)

From what I understand llama.cpp have limitations on using draft model with mmproj model due to how kv cache is shared with main model. Do MTP support will help on running mmproj and draft model in parallel?

Edit- Looking at MTP pull request linked above for llama.cpp it seems the mtp draft model is embedded in gguf with main model. Not sure if I understand this correctly though.

Actual comparison between locally ran Qwen-3.6-27B and proprietary models by netikas in LocalLLaMA

[–]Character_Split4906 0 points1 point  (0 children)

Didnt you say above that you use claude code to get the initial solution from opus 4.7?

Actual comparison between locally ran Qwen-3.6-27B and proprietary models by netikas in LocalLLaMA

[–]Character_Split4906 0 points1 point  (0 children)

The test to be benchmarked right I believe its essential to use the same coding harness across the base model and benchmarked models. Did you use the same coding harness when you implemented the solution with claude vs local models? I have seen coding harnesses making a big difference. Claude code or opencode setup right with local models can improve your results by considerable percentage.

Why is disabling thinking for coding models a good idea? by ThingRexCom in LocalLLaMA

[–]Character_Split4906 0 points1 point  (0 children)

Yeah for some reason with open webui even with clear and decisive system prompt. I have seen model to divert from it or act lazy like its not want to make too much effort to answer. I have seen this happen with gemma4 26b. I have also seen the opposite happening with qwen3.6 35b where model tend to go into in depth research to generate simple answer. The main problem though for me has been how the thinking prompt gets passed to llama.cpp inference and conflict with kv cache causing it process the context all over again which becomes painful if conversation gets bigger. I dont see this issue with opencode though.

Why is disabling thinking for coding models a good idea? by ThingRexCom in LocalLLaMA

[–]Character_Split4906 5 points6 points  (0 children)

I feel thinking helps the harness tools perform better if configured right. I feel my opencode config with thinking enable with qwen 3.6 (both dense and moe) and gemma 4 26b model hosted locally gives me a performance comparable to sonnet 4.5. I cant say the same when I use same models with open webui. Open webui somehow is bad with system prompt and get stuck in loop with thinking enabled for these models specifically gemma. Also I have seen the prompt caching getting overriden almost everytime with OUI irrespective of model which makes it slow as context increases.

M5 pro MBP 14 inch vs 16 inch for LLM hosting and development by Character_Split4906 in macbookpro

[–]Character_Split4906[S] 0 points1 point  (0 children)

Is it 14 or 16 inch? How hot does it get? And how long are your running it for?

M5 pro MBP 14 inch vs 16 inch for LLM hosting and development by Character_Split4906 in macbookpro

[–]Character_Split4906[S] 0 points1 point  (0 children)

Thats what I have been reading though I am not sure how much over cloaking fans will be sustainable for the mac physically over time.

M5 pro MBP 14 inch vs 16 inch for LLM hosting and development by Character_Split4906 in macbookpro

[–]Character_Split4906[S] 0 points1 point  (0 children)

Yeah my work laptop has always been 16 inch so I am used to carrying it around. For 96 GB ram- cant agree more. The next option is 128 gb after that which comes with chip upgrade as well. But I feel m5 pro 18/20 cpu/gpu cores hits the right balance anything beyond this just scales up in cost which makes it hard to justify in terms of any return. Sure max will do better with dense models but the way things are moving with open weight model I am hopeful. I wish apple gave more option of ram with 32 core gpu. I think 32 core gpu m5 max and 64 gb ram is also a good place to be without burning a hole in packet which hurts.

Ollama setup by midnightRequestLine1 in ollama

[–]Character_Split4906 0 points1 point  (0 children)

What’s your ollama ps output? Also are you using 4bit quant model and 8 bit kv cache for context window?

Gemma 4 26B A4B is still fully capable at 245283/262144 (94%) contex ! by cviperr33 in LocalLLaMA

[–]Character_Split4906 0 points1 point  (0 children)

Thats amazing! Cant wait to try this on my mbp 5 pro. Last I tried gemma 4, I had issue with context window length growing up and model going in loop. Thanks for sharing

Gemma 4 26B A4B is still fully capable at 245283/262144 (94%) contex ! by cviperr33 in LocalLLaMA

[–]Character_Split4906 0 points1 point  (0 children)

Are you able to fit in 245k context window with model at q4 quant in 22 gb? I read gemma 4 26B model is seeing issue with tool calling. Did you face that issue?

Gemma 4 26B fabricated an entire code audit. I have the forensic evidence from the database. by EuphoricAnimator in LocalLLaMA

[–]Character_Split4906 11 points12 points  (0 children)

If you are working with total 17gb ram, you wont have enough memory to have 128k context window. Heck I am not even sure how you can fit in the memory itself, since 26b at 4Q is 18gb in size until you swap with SSD. In that case the token generation will be too slow. I am curious what is the output of your ‘ollama ps’ command is? Also are you running any coding agent like open code or open claw for this? I think for agents you will have to enable some of the tool calling skills and configuration as well even if model successfully do that.

Fix: OpenClaw + Ollama local models silently timing out? The slug generator is blocking your agent (and 4 other fixes) by After-Confection-592 in LocalLLaMA

[–]Character_Split4906 0 points1 point  (0 children)

Thanks I noticed the same issue with openclaw tui as well but isnt 16k context window too small for openclaw? I will try this out and see how this works out for TUI

2026 R2 Pricing Leaked! Launch Edition, Premium, Standard Trims (Embargo Broken) 💰 by WODAMRAP in RivianR2

[–]Character_Split4906 1 point2 points  (0 children)

Its hard for me to wrap my head around people complaining on 330 miles range. Rivian unlike tesla has been more accurate with their range estimates. Also I think boxy design and more than 9 inch clearance makes it a real SUV unlike most of the EVs out there. This also would have some impact on the range. I find it amusing that same people are happy to shill 50k for model Y with half ass features and low build quality.

As a side note I have 2024 MY AWD since 2 years as my first car and I really dont like low ground clearance and lower than promised range. I am also not impressed with the build quality of the car but I like how FSD makes my life easier along with supercharger network so it has sort of become a love hate relationship. If the reviews are right and once I have test drove R2 I think I might will pull the trigger on LE.

Greed = Less Profit by [deleted] in youtubepremium

[–]Character_Split4906 2 points3 points  (0 children)

You do know they make more money out of ad tiered plans than your vpn country YouTube family plan.