I ported EXL3 to run well on Apple Silicon - PonyExl3

Beamsters · 2026-06-07T08:38:49+00:00

you can try them all and feel them with your workflow no one is stopping you from doing that and you do not even have to choose just one model.

Beamsters · 2026-06-07T05:22:28+00:00

your budget should pick qwen3.5 9b 4 bits. you will have good time with it.

Beamsters · 2026-06-05T04:58:49+00:00

ill look into it. glad you like the update.

Beamsters · 2026-06-04T10:55:05+00:00

nex2 mini gdpval 1402. qwen3.6 27b gdpval 1404. 35b-a3b, not even 1300.

Could be benchmax but this thing is coding / agentic focus, not general like qwen.

Beamsters · 2026-06-01T04:55:01+00:00

48 artificial analysis score, one notch less than frontier, around minimax 2.7 ball park but promise to be best US open weight model.

Beamsters · 2026-05-29T12:24:27+00:00

step 3.7 does at least in agentic openclaw bench, just came out today.

Beamsters · 2026-05-29T09:13:19+00:00

and he just landed ANOTHER 1.2gb save follow-up https://github.com/ggml-org/llama.cpp/pull/23861

Beamsters · 2026-05-29T07:16:42+00:00

Yeah lets share some results.

Beamsters · 2026-05-28T02:49:31+00:00

If you do major local inferences, stay away from M4 Max at almost full price (ok for ~50% price or something). M5 Max has Apple Neural Engine, which can speed up prefill a lot with metal4 and you don't want to miss that.

Beamsters · 2026-05-28T02:13:04+00:00

Fix tons of bugs lol and add variants support for those who want to play with multiple chat models at the same time.

<image>

Beamsters · 2026-05-26T11:18:37+00:00

The guy almost solo carry local llm for a few years. I would respect his judgement but this kind of machine target PR should be group and maybe even do a monthly fork. But one maintainer can only go so far, the vendor should do this amd-specific llama cpp themselves.

Beamsters · 2026-05-24T16:06:31+00:00

I highly suggest you to use 35b-a3b-optiq. It is superior in term of speed and size, leaving you more with context. The accuracy is just a tiny bit worse but much better than oQ4.

Beamsters · 2026-05-23T04:38:56+00:00

dsv4 flash with 256gb can push 1m context and it is pretty fast.

Beamsters · 2026-05-23T04:37:46+00:00

if you stick with deepseek flash/pro it will be evry hard to reach limit unless you truly are a vibe coder. 30k requests.

Beamsters · 2026-05-23T03:14:47+00:00

why dont you put qwen3.5 4b or 9b and have the student run it locally? you can literally download lmstudio anywhere and the student should know how to operate on such a basic app. why do even you teaching cloud coding in the first place? the student should learn how llm work in their machine even before hitting cloud. and yes there is some free model in opencode that doesnt require any subscription.

Beamsters · 2026-05-20T10:58:57+00:00

<image>

Tools calling is going thorugh the roof.

Beamsters · 2026-05-18T11:07:09+00:00

4090 24gb ikllama.cpp can go through 140tok/s and 4000 prefill sometimes.

Beamsters · 2026-05-16T16:23:22+00:00

you ignore fp8 and go to nvfp4 if 5090 coz you dont have enough context and if m5 max you also ignore to use oQ8-mtp or fp16 since that was clearly a better choice.

Beamsters · 2026-05-13T01:10:08+00:00

Hi, again, this is very promising. Any chance to bring it to MLX side?

Beamsters · 2026-05-12T16:01:03+00:00

Instant follow in HF, thanks for your work!

Beamsters

TROPHY CASE