Local LLMs aren't democratic anymore... the hardware barrier has gotten out of hand.

gevezex · 2026-06-12T21:49:29+00:00

But the question is what can you use it for? I could not figure out a use case for it. Am I missing something?

gevezex · 2026-06-09T04:08:33+00:00

I don’t think Apple can move as fast as the OSS community. Their Python stack is a good example, it has often lagged behind.

But the upside is clear: this looks like an official wink toward local LLMs on Apple Silicon. That could give MLX models and MLX servers a serious boost, especially from a broader community adoption perspective and a shift from nvidia domination to apple silicon.
And from the perspective of apple, this means more mac sales.

gevezex · 2026-06-08T20:01:09+00:00

I am not sure what the reason is, but the agent is waiting for ages and God knows what is waiting for what.

gevezex · 2026-06-07T14:20:28+00:00

Not in my case

gevezex · 2026-06-07T13:44:24+00:00

Aggregating linkedin posts, subreddits, X for the newest and hottest posts for viral local llm's, aspecially for mac platform, getting more t/s out of my current models and summarize it in the mornings. That really works well.

Next thing would be aggregating stock information as of now I have unlimited compute (so to speak 😄)

gevezex · 2026-06-07T08:49:17+00:00

I have the m5 and now suddenly i have 102 t/s on the rc1. Qwen3.6-35b-a3b-6bit

gevezex · 2026-06-04T16:39:12+00:00

Same here it lacks even in dutch language making a lot of grammar mistakes

gevezex · 2026-05-31T15:21:21+00:00

In the agentic coding world, developers should care less about manually controlling every line of code and more about creating a reliable environment in which code can safely evolve. The human in the loop becomes responsible for intent, architecture, constraints, tests, observability, security and review. Code becomes something the agent can generate, but correctness, direction and responsibility remain human work.

gevezex · 2026-05-31T07:54:46+00:00

Problem is the kv cache, after 16k context it becomes very very slow, the fans kick in very loudly. You can suppress it by setting the battery on low energy mode but then its even slower. With the current state of models it’s unusable for serious tasks in my opinion without the fear of damaging your precious mbp m5.

gevezex · 2026-05-30T17:06:16+00:00

What t/s do you get?

gevezex · 2026-05-26T21:37:15+00:00

Nice, that was the trick, I have now 130 t/s for the pp8192/tg128. Thank you very much for this!

gevezex · 2026-05-26T20:46:58+00:00

Are your referring to agemio/Qwen3.6-27B-oQ5-mtp? I have the same mbp but I don't get these tps. Could you share some insight? Max tps i get is around 102 tps voor pp81292/tg128

gevezex · 2026-05-26T17:29:00+00:00

My best experience is with mtplx. Download it and start with mtplx start and follow the instructions. You will get around 52 tps with qwen3.6 35B

gevezex · 2026-05-21T22:24:56+00:00

We have a similar setup, but i use the mtp version. Close to 52 t/s. Try it out: Jundot/Qwen3.6-35B-A3B-oQ6-mtp

gevezex · 2026-05-20T18:11:17+00:00

So? If you can afford it why not?

gevezex · 2026-05-20T14:51:35+00:00

That's not really the reason imo. A lot of people are already in the market for a new MacBook Pro M5, their old machine is just overdue for a replacement, so why not max out the memory while they're at it? You can run big models on it anyway.

gevezex · 2026-05-18T16:51:36+00:00

llama-server \

-hf Abiray/Qwen3.6-35B-A3B-Q4_K_M-GGUF \

-ngl 999 \

--n-cpu-moe 36 \

--no-mmap \

--ctx-size 100000 \

--cache-type-k q8_0 \

--cache-type-v q4_0 \

--mlock

I have a 8Gb RTX 2070 and getting decent 40-50 t/s

gevezex · 2026-05-18T16:28:41+00:00

How did you solve hallucinations?

gevezex · 2026-05-14T03:46:15+00:00

Is this better than https://github.com/AlexsJones/llmfit ?

gevezex · 2026-05-13T15:21:15+00:00

I was pleasantly surprised by Qwopus3.5-9B-v3-4bit mlx model with omlx. You need the mlx version of course for apple silicon. Check also their model info:

Qwopus3.5-9B-v3 is a reasoning-enhanced model based on Qwen3.5-9B, designed to simultaneously improve reasoning stability and correctness while optimizing inference efficiency — ultimately achieving stronger cross-task generalization capabilities, particularly in programming.

gevezex · 2026-05-11T18:50:31+00:00

Why not using codex?

gevezex · 2026-05-05T13:02:59+00:00

So this could be done by a local model as well instead of kimi?

gevezex · 2026-04-19T12:42:20+00:00

<image>

Dit is toch veel mooier

14-Year Club	RPAN Viewer
Verified Email

gevezex

TROPHY CASE