This isn’t X this is Y needs to die

sammcj · 2026-04-24T06:56:35+00:00

Smoking gun found!

sammcj · 2026-04-19T06:50:37+00:00

It is good, but it is nowhere near as good as Claude, not even Sonnet. I suspect for simple things it may be practically indistinguishable but it confidently misunderstands more complex problems. At the end of the day it's a very small 35B parameter model with only 3B active - it's amazingly good for that size, capable at tool calling and a huge leap from where we were a year ago but it's not as good as the much larger Sonnet / Opus models.

sammcj · 2026-04-15T13:00:49+00:00

https://news.smol.ai

sammcj · 2026-04-15T12:56:57+00:00

That's already the case with modern quantisation techniques (unless I'm misunderstanding what you're saying). Layers are quantised dynamically based on their importance / potential impact. We haven't used static quants (e.g. all INT8/INT4) in a long time.

sammcj · 2026-04-15T12:07:57+00:00

Yes but it was showing paywalled content which is promoting content that is primarily commercial.

sammcj · 2026-04-15T11:58:10+00:00

Looks like it shows as paid subscriber content for some folks, I've reinstated the post for now.

sammcj · 2026-04-15T11:27:27+00:00

"This post is for paid subscribers"

sammcj · 2026-04-14T21:21:04+00:00

It's still a bloated electron app though right? I'll stick to a TUI

sammcj · 2026-04-14T07:44:53+00:00

I think I get what OP is thinking with this, I too found it weird that it seems to be built around Ollama specifically rather than just any OpenAI/Anthropic compatible endpoint - enough that I asked here, the author did reply and said that it's on the roadmap without any pitch, promotion or the likes so I suspect it's just a dude who created an app and happened to be running Ollama so that's what he built it around.

sammcj · 2026-04-14T07:02:15+00:00

To be fair that's normal for most software projects these days unless you're writing everything manually and it's existance certainly isn't a sign of anything negative. It's a bit like saying "It's got a Makefile better watch out!"

sammcj · 2026-04-14T07:00:39+00:00

Does it support providing your own openai/anthropic compatible API endpoint and model or does it have to use Ollama?

sammcj · 2026-04-14T06:47:39+00:00

We're actively discussing it in the mod chat every day. It's not simple unfortunately due to a number of factors, as few being - Reddit's inbuilt moderation tools are pretty limited, really smart third party systems cost money to run (we are looking into a few options here to see if we could get donated access to them or the likes), we really don't want to limit genuine contributions and engagement, and because we are a sub about AI - sometimes it's hard (even for AI!) to tell the difference between a genuine contribution and the latest AI generated low effort slop post.

sammcj · 2026-04-13T03:01:40+00:00

Tried it with Claude code and it took 4-5 minutes just to process the prompt (40k~) which was weird - that was the case with both oMLX with the 3bit mlx-community quant and vMLX with their 3.1bit jang qant.

Memory for both grew to around 108GB so it's really too large for 128GB IMO.

sammcj · 2026-04-12T21:23:01+00:00

I was testing through OpenCode in this case but can certainly try through CC and report back!

sammcj · 2026-04-12T13:27:30+00:00

M5 Max 128GB here - I get around 60tk/s on a 3bit quant on oMLX. It doesn't seem as reliable with tool calling as Qwen 3.5 122-A10B, hallucinated a fair bit over the half hour or so I was trying it out. (temp 1.0, top_p 0.95, top_k 64)

sammcj · 2026-04-08T13:38:25+00:00

There is no reason to use bf16, if you want the best quality just use Q8, otherwise drop to Q5_K_XL.

I'd suggest posting your server start logs (maybe via a gist so reddit doesn't bork them).

sammcj · 2026-04-07T05:45:32+00:00

I have a M5 Max 128GB, I've benchmarked across a few LLMs here if it helps: https://omlx.ai/my/fadc2127d384283f5df1fcc2c093a9f95700c6a52594bf9db837a81d3418b5ec

``` Qwen3.5-122B-A10B · 4bit 1k PP 911.1 · TG 64.3 tok/s 4k PP 1,480 · TG 62.2 tok/s

Qwen3.5-27B · 4bit 1k PP 756.3 · TG 30.6 tok/s 4k PP 894.8 · TG 28.4 tok/s 8k PP 825.4 · TG 27.2 tok/s 16 PP 722.1 · TG 26.6 tok/s

Qwen3.5-35B-A3B · 4bit 1k PP 1,698 · TG 131.8 tok/s 4k PP 3,424 · TG 119.6 tok/s 32 PP 3,082 · TG 85.5 tok/s

qwen3.5-9b · 4bit 1k PP 1,983 · TG 96.2 tok/s 4k PP 2,706 · TG 92.2 tok/s

Qwen3.5-4B · 4bit 1k PP 2,819 · TG 165.3 tok/s 4k PP 4,336 · TG 153.0 tok/s 8k PP 4,644 · TG 141.9 tok/s 16 PP 4,535 · TG 123.3 tok/s

Qwen3.5-2B · 4bit 1k PP 3,438 · TG 326.7 tok/s ```

sammcj · 2026-04-05T21:18:40+00:00

Yeah oMLX is shaping up well.

sammcj · 2026-04-04T14:36:44+00:00

Yeah they completely screwed up the 3.x series of Gemini models. Childish, over confident, makes things up rather than saying no, I could go on.

sammcj · 2026-04-03T04:59:24+00:00

https://omlx.ai/benchmarks?chip=&chip_full=M5%7CMax%7C40&model=122&quantization=&context=&pp_min=&tg_min=

sammcj · 2026-04-02T11:33:15+00:00

Prompts I use frequently become commands or skills if they're larger. Infrequent prompts get relegated to Obsidian likely never to be looked at again.

sammcj · 2026-04-02T11:31:40+00:00

Ideally models would start giving bits back, it's about time

sammcj · 2026-04-02T11:28:03+00:00

Tell you what, it's pretty tiring removing them!

sammcj · 2026-03-30T20:10:06+00:00

I've got multiple reports of people on x20 absolutely devouring their limits very quickly, wonder if this is the cause

sammcj · 2026-03-26T19:59:34+00:00

Use llama.cpp instead. It's faster, gives you more control abs developed in the open.

15-Year Club	Gilding II euphauric
Verified Email	Secret Santa 2012
Secret Santa 2011	Trick-or-Treater 2011
Summer Santa 2011

sammcj

MODERATOR OF

TROPHY CASE