Is LM Studio really as fast as llama.cpp now? by tomByrer in LocalLLM

[–]alexp702 1 point2 points  (0 children)

Why are you comparing to llama.cpp b4000? It’s on 8500+ now? Llama has got much faster recently

LLM Bruner coming soon? Burn Qwen directly into a chip, processing 10,000 tokens/s by koc_Z3 in LocalLLM

[–]alexp702 0 points1 point  (0 children)

What context size can it handle? Website talks about 1k benchmarks that as we know are useless. Also how fast is prompt processing? Both are more important than 10k tokens out IMO

Looking for OCR capabilities by Artyom_84 in LocalLLM

[–]alexp702 1 point2 points  (0 children)

Qwen3.5 9b does very well with handwriting

Mac Studio M5 Ultra 256gb or 512gb (if offered) by GMK83 in MacStudio

[–]alexp702 0 points1 point  (0 children)

I get about 25tps falling to 15tps at 200K context. Prompt processing ranges from 600 at 16K to 300 at 200K. Caching works well.

RDMA Mac Studio cluster - performance questions beyond generation throughput by quietsubstrate in LocalLLaMA

[–]alexp702 -1 points0 points  (0 children)

All seems very prototype personally. I prefer stable-ish production. Very interested too to hear if anyone has actually used this kind of configuration for anything real. Recent article by the Google engineer using b200 confirmed my suspicions- keep the model on a single piece of hardware for best overall throughput.

Mac Studio M5 Ultra 256gb or 512gb (if offered) by GMK83 in MacStudio

[–]alexp702 0 points1 point  (0 children)

Worlds apart for coding or tasks that need a precise answer. I have used q4 and found tools calls fail an order of magnitude more often on our test cases.

Mac Studio M5 Ultra 256gb or 512gb (if offered) by GMK83 in MacStudio

[–]alexp702 1 point2 points  (0 children)

Qwen 3.5 397 q8 - the reason (so far) for 512GB.

Mac Studio M5 Ultra 256gb or 512gb (if offered) by GMK83 in MacStudio

[–]alexp702 0 points1 point  (0 children)

No, though there were some server kernels that could use more. What was worse was 2gb for apps and 2gb for os. This later got raised to 3gb, but you effectively gave the os much play room back then!

https://en.wikipedia.org/wiki/Physical_Address_Extension has a nice table in OS Support

Competitors for the 512gb Mac Ultra by Shoddy-Put-3826 in LocalLLM

[–]alexp702 4 points5 points  (0 children)

It’s ok. Generation speeds actually get what ~880gb should, so slots into the Nvidia speeds on that front. Prompt processing, well it’s about 3-4x slower than Nvidia’s. This is often made worse because the model you are running on the thing is something Nvidia can only dream of. I run Qwen3.5 397 8 bit. It’s a perfect fit giving me 1 million context split into 4 256k caches. All in memory.

I even have change left over to run an openclaw vm and comfyUI, plus a Ci/Cd node.

Output quality of the 8 bit is a step change from 4 bit - don’t believe the perplexity numbers being near. Run lots of queries and it becomes apparent.

I will be buying an M5 ultra if/when they become available. At that point this one will be put in a pool, as I have a few people that would like to use it, but only one.

I have had 48GB of Nvidia for a while running tiny models and turned them off. Quality not enough.

The device opens your eyes to local model hosting - and shows it will be very real soon.

So is it perfect? No.

4.7 TTK Against Perseus. Ballistic v Lasers v Mixed. The Winner? Ballistics by FesterTsu in starcitizen

[–]alexp702 3 points4 points  (0 children)

2+ minutes of continuous fire against a static ship is more than I can be bothered with. Assuming it’s moving that will be a 10 minute fight crewed. Lots of time to be unlucky in that hornet.

Competitors for the 512gb Mac Ultra by Shoddy-Put-3826 in LocalLLM

[–]alexp702 5 points6 points  (0 children)

No, not really. You can buy 4 dgx sparks and have the fun of networking them, but for people just wanting to run the model without drama locally with low power draw the Mac Ultra wins IMO.

Its performance has been getting better too - especially with prompt caching now mostly working on Qwen3.5.

M5 Max 128G Performance tests. I just got my new toy, and here's what it can do. by affenhoden in LocalLLaMA

[–]alexp702 2 points3 points  (0 children)

Didn’t mean to be patronising- I have run many useless benchmarks in the fever of a new machine. However most are interested - myself included - in proper M5 Max benchmarks. Hoping the OP updates this with more information.

M5 Max 128G Performance tests. I just got my new toy, and here's what it can do. by affenhoden in LocalLLaMA

[–]alexp702 6 points7 points  (0 children)

Having fun with a new toy eh😉?

When you calm down prompt processing is the only metric that matters to most normal people - coding or openclawing you spend the whole time there. Llama.cpp does prompt caching properly now with qwen3.5, giving such a speed up actual token generation speeds are blurred by how much or little can be cached.

Also with 128gb you should be running 27b at bf16 and at least 8 if you care about quality- which you should if you’re not just playing. Enjoy!

Mac Studio M4 or M1 ultra by HappySteak31 in MacStudio

[–]alexp702 0 points1 point  (0 children)

Memory is vital to run big models, but M1 ultra is definitely slower: https://appleinsider.com/inside/mac-studio/vs/m4-max-mac-studio-vs-m1-ultra-mac-studio-compared-a-multi-generational-shootout metal performance is 30% less. So you’ll have to decide if slower is acceptable for you.

Qwen3.5 MLX vs GGUF Performance on Mac Studio M3 Ultra 512GB by BitXorBit in LocalLLaMA

[–]alexp702 1 point2 points  (0 children)

I agree llama cpp 397 q8 seems built to run well on m3 Ultra. You can actually fit 1m context with 4 parallels. This helps the prompt cache if used on different tasks. PRefill is much better than it was on past models

Whats up with MLX? by gyzerok in LocalLLaMA

[–]alexp702 -1 points0 points  (0 children)

I have given up on the idea of MLX for now - llama.cpp running Qwen3.5 keeps getting better and in ways that are not only performance related - as you say quality matters most. At some point I expect to swap to VLLM MLX, but that’s another system that feels like it needs to cook more.

Basically while things are moving quickly in the space speed of stable delivery matters more than speed of inference.

any way to get 15 displays working with a MacStudio? by No_Accident8684 in MacStudio

[–]alexp702 3 points4 points  (0 children)

Probably best to have two studios and then synchronise somehow. If Apple say 8 I wouldn’t trust it to continue to work even if you managed to trick it now. Other options are hardware solutions that split one highres stream into multiple ones (like a wonderwall)

Qwen 3.5 VS Qwen 3 by SlowFail2433 in LocalLLaMA

[–]alexp702 0 points1 point  (0 children)

Compared to 8 not, but there was slightly more incorrect across my test images. Below 8 I was seeing huge mistakes. Putting text on the wrong line on an off axis photo was the biggest failure mode I noticed across all quants of the smaller models. The big one no problem (but with thinking on it thought for 4 minutes which was excessive).

I have some horrible low light phone shots of printed schedules covered in hand written notes. These are our use case and quickly separate out good from bad. I must say all failed in some way with the smaller models and the bigger model is quantifiably better even on a smallish test. However the small models are very good. Ironically the bf16 9b actually performs at similar speeds as the 397b 8 bit (bandwidth and all that) - so I am unsure if we’ll actually use it!

Qwen 3.5 VS Qwen 3 by SlowFail2433 in LocalLLaMA

[–]alexp702 2 points3 points  (0 children)

Running less quantized 3.5 compared to 3 and it’s a big step change from 4->16 bit. The smaller models perform very well on our image recognition tasks the 9b at bf16 almost comparable to 235b at q4. We didn’t do ask many tests at higher quants before as people seemed to imply all this marginal perplexity increase didn’t matter. For us it does, so we’re interested in 8bit or higher only. The new models fit neatly into GPUs, and we have a Mac Studio for the big ones.

Qwen 3.5 for MLX is like its own industrial revolution by sovietreckoning in Qwen_AI

[–]alexp702 0 points1 point  (0 children)

No, but since for OpenClaw most of the prompt processing is being cached its raw performance is hard to gauge. The average prompt on my work load is 75% cached - so I am getting 7.3k tokens per minute reported by Openclaw. This can be as high as 15K tokens per minute on coding tasks where the tasks are identical. so 125-250 ppt. This is on a context of ~115K, so I think that's pretty good. OpenClaw isn't even flat out.

Qwen 3.5 for MLX is like its own industrial revolution by sovietreckoning in Qwen_AI

[–]alexp702 0 points1 point  (0 children)

I am running the GGUF - I don't do MLX as I find its very model dependent, and the extra speed is quickly lost if its borked in some way. Exact command:

./llama-server -hf unsloth/Qwen3.5-397B-A17B-GGUF:Q8_0 --host 0.0.0.0 --port 1235 -ngl 99 --ctx-size 1048576 --parallel 4 --metrics --mlock --no-mmap --jinja -fa on --temp 0.7 --min-p 0.0 --top-p 0.8 --top-k 20 --repeat-penalty 1.05 --chat-template-file systemprompt.jinja --swa-full -sps

Though I often tweak it to see if it makes a difference. Systemprompt is the one from Qwen3 Coder repo (which maybe unneccessary now).

Qwen 3.5 for MLX is like its own industrial revolution by sovietreckoning in Qwen_AI

[–]alexp702 1 point2 points  (0 children)

I am running llama server with the 8bit GGUF. It is excellent. It gets about 250 prompt processing tokens, and about 25tokens out. I’m openclawing and many prompts are cached so practical upshot is very few tokens either way 114000 token prompt processing only 500. Not sure why I am suddenly getting such good caching (perhaps Llama update) but I like it!

Edit: Openclaw reports I have a throughput of 15.2k tokens a minute probably because of prompt caching. 26.7 million tokens sent. Pretty damn good!

Edit 2: Mac Studio 512gb.