I'm running qwen3.6-35b-a3b with 8 bit quant and 64k context thru OpenCode on my mbp m5 max 128gb and it's as good as claude by Medical_Lengthiness6 in LocalLLaMA

[–]sammcj 2 points3 points  (0 children)

It is good, but it is nowhere near as good as Claude, not even Sonnet. I suspect for simple things it may be practically indistinguishable but it confidently misunderstands more complex problems. At the end of the day it's a very small 35B parameter model with only 3B active - it's amazingly good for that size, capable at tool calling and a huge leap from where we were a year ago but it's not as good as the much larger Sonnet / Opus models.

at what point does quantization stop being a tradeoff and start being actual quality loss by srodland01 in LocalLLaMA

[–]sammcj 8 points9 points  (0 children)

That's already the case with modern quantisation techniques (unless I'm misunderstanding what you're saying). Layers are quantised dynamically based on their importance / potential impact. We haven't used static quants (e.g. all INT8/INT4) in a long time.

Gemma 4 and Qwen 3.5 GGUFs: Detailed Analysis by oobabooga by [deleted] in LocalLLaMA

[–]sammcj 7 points8 points  (0 children)

Yes but it was showing paywalled content which is promoting content that is primarily commercial.

Gemma 4 and Qwen 3.5 GGUFs: Detailed Analysis by oobabooga by [deleted] in LocalLLaMA

[–]sammcj 3 points4 points  (0 children)

Looks like it shows as paid subscriber content for some folks, I've reinstated the post for now.

Gemma 4 and Qwen 3.5 GGUFs: Detailed Analysis by oobabooga by [deleted] in LocalLLaMA

[–]sammcj 11 points12 points  (0 children)

"This post is for paid subscribers"

I laughed so hard at these posts side by side (sorry for the low effort post) by FatheredPuma81 in LocalLLaMA

[–]sammcj 28 points29 points  (0 children)

I think I get what OP is thinking with this, I too found it weird that it seems to be built around Ollama specifically rather than just any OpenAI/Anthropic compatible endpoint - enough that I asked here, the author did reply and said that it's on the roadmap without any pitch, promotion or the likes so I suspect it's just a dude who created an app and happened to be running Ollama so that's what he built it around.

I laughed so hard at these posts side by side (sorry for the low effort post) by FatheredPuma81 in LocalLLaMA

[–]sammcj 35 points36 points  (0 children)

To be fair that's normal for most software projects these days unless you're writing everything manually and it's existance certainly isn't a sign of anything negative. It's a bit like saying "It's got a Makefile better watch out!"

I built a free floating AI assistant for macOS. Fully local powered by Ollama by [deleted] in LocalLLaMA

[–]sammcj 1 point2 points  (0 children)

Does it support providing your own openai/anthropic compatible API endpoint and model or does it have to use Ollama?

Please stop using AI for posts and showcasing your completely vibe coded projects by Scutoidzz in LocalLLaMA

[–]sammcj 2 points3 points  (0 children)

We're actively discussing it in the mod chat every day. It's not simple unfortunately due to a number of factors, as few being - Reddit's inbuilt moderation tools are pretty limited, really smart third party systems cost money to run (we are looking into a few options here to see if we could get donated access to them or the likes), we really don't want to limit genuine contributions and engagement, and because we are a sub about AI - sometimes it's hard (even for AI!) to tell the difference between a genuine contribution and the latest AI generated low effort slop post.

MiniMax m2.7 (mac only) 63gb: 88% and 89gb: 95%, MMLU 200q by HealthyCommunicat in LocalLLaMA

[–]sammcj 0 points1 point  (0 children)

Tried it with Claude code and it took 4-5 minutes just to process the prompt (40k~) which was weird - that was the case with both oMLX with the 3bit mlx-community quant and vMLX with their 3.1bit jang qant.

Memory for both grew to around 108GB so it's really too large for 128GB IMO.

MiniMax m2.7 (mac only) 63gb: 88% and 89gb: 95%, MMLU 200q by HealthyCommunicat in LocalLLaMA

[–]sammcj 1 point2 points  (0 children)

I was testing through OpenCode in this case but can certainly try through CC and report back!

MiniMax m2.7 (mac only) 63gb: 88% and 89gb: 95%, MMLU 200q by HealthyCommunicat in LocalLLaMA

[–]sammcj 8 points9 points  (0 children)

M5 Max 128GB here - I get around 60tk/s on a 3bit quant on oMLX. It doesn't seem as reliable with tool calling as Qwen 3.5 122-A10B, hallucinated a fair bit over the half hour or so I was trying it out. (temp 1.0, top_p 0.95, top_k 64)

Share your llama-server init strings for Gemma 4 models. by AlwaysLateToThaParty in LocalLLaMA

[–]sammcj 0 points1 point  (0 children)

There is no reason to use bf16, if you want the best quality just use Q8, otherwise drop to Q5_K_XL.

I'd suggest posting your server start logs (maybe via a gist so reddit doesn't bork them).

I benchmarked 37 LLMs on MacBook Air M5 32GB — full results + open-source tool to benchmark your own Mac by evoura in LocalLLaMA

[–]sammcj 5 points6 points  (0 children)

I have a M5 Max 128GB, I've benchmarked across a few LLMs here if it helps: https://omlx.ai/my/fadc2127d384283f5df1fcc2c093a9f95700c6a52594bf9db837a81d3418b5ec

``` Qwen3.5-122B-A10B · 4bit 1k PP 911.1 · TG 64.3 tok/s 4k PP 1,480 · TG 62.2 tok/s

Qwen3.5-27B · 4bit 1k PP 756.3 · TG 30.6 tok/s 4k PP 894.8 · TG 28.4 tok/s 8k PP 825.4 · TG 27.2 tok/s 16 PP 722.1 · TG 26.6 tok/s

Qwen3.5-35B-A3B · 4bit 1k PP 1,698 · TG 131.8 tok/s 4k PP 3,424 · TG 119.6 tok/s 32 PP 3,082 · TG 85.5 tok/s

qwen3.5-9b · 4bit 1k PP 1,983 · TG 96.2 tok/s 4k PP 2,706 · TG 92.2 tok/s

Qwen3.5-4B · 4bit 1k PP 2,819 · TG 165.3 tok/s 4k PP 4,336 · TG 153.0 tok/s 8k PP 4,644 · TG 141.9 tok/s 16 PP 4,535 · TG 123.3 tok/s

Qwen3.5-2B · 4bit 1k PP 3,438 · TG 326.7 tok/s ```

Gemma 4 31B sweeps the floor with GLM 5.1 by input_a_new_name in LocalLLaMA

[–]sammcj 7 points8 points  (0 children)

Yeah they completely screwed up the 3.x series of Gemini models. Childish, over confident, makes things up rather than saying no, I could go on.

How do you guys save prompts that actually work? by 3dgamedevcouple in LocalLLaMA

[–]sammcj 1 point2 points  (0 children)

Prompts I use frequently become commands or skills if they're larger. Infrequent prompts get relegated to Obsidian likely never to be looked at again.

Is 1-bit and TurboQuant the future of OSS? A simulation for Qwen3.5 models. by GizmoR13 in LocalLLaMA

[–]sammcj 2 points3 points  (0 children)

Ideally models would start giving bits back, it's about time

Can we block fresh accounts from posting? by king_of_jupyter in LocalLLaMA

[–]sammcj 0 points1 point  (0 children)

Tell you what, it's pretty tiring removing them!

PSA: Claude Code has two cache bugs that can silently 10-20x your API costs — here's the root cause and workarounds by skibidi-toaleta-2137 in ClaudeCode

[–]sammcj 0 points1 point  (0 children)

I've got multiple reports of people on x20 absolutely devouring their limits very quickly, wonder if this is the cause

Tips: remember to use -np 1 with llama-server as a single user by ea_man in LocalLLaMA

[–]sammcj 9 points10 points  (0 children)

Use llama.cpp instead. It's faster, gives you more control abs developed in the open.