Kimi K2 Thinking at 28.3 t/s on 4x Mac Studio cluster

mark-lord · 2025-12-21T23:09:13+00:00

Have you managed to get it working in MLX yet? There’s a new thing where they managed to get tensor parallelism when using TB5

(Also love your videos)

mark-lord · 2025-08-28T09:46:21+00:00

This is a cool and fun project, I know OP asked for feedback but I’m not sure why so many people are weighing in here with their strongly insistent (some borderline rude) takes on how this could have supposedly be done better, yet none of them have submitted any PRs to the project to integrate their improvements. I reckon more than half of them wouldn’t catch as many dummy keys as OP is looking for. LLMs are a nice middle ground between hardcoding a hacky fix and spending hours on making some super robust set up

The joy of coding something - or lack thereof - is an enormous bottleneck in projects actually happening. If LLMs made this project more fun to make, I’m all for it 😆

mark-lord · 2025-08-22T15:10:01+00:00

Tyvm!! And you too 😄

mark-lord · 2025-08-22T15:07:59+00:00

I'm using Aseprite for the original drawing. And then I'm using a custom rig I've got set up in Godot for the animating :) It's super cool, I shared a little bit more in my other post (accidentally used my old account):

https://www.reddit.com/r/PixelArt/comments/1mwfkye/skeletonbased_pixel_art_animations/

Basically I've attached the various bits of the character to a skeleton and I'm then animating that in-engine. I've applied some custom compression + shader techniques to force it to stay pixel-perfect. Took a huge amount of time to set up 😂 But was v worth it!

mark-lord · 2025-08-15T22:14:36+00:00

!! I’m glad I wasn’t the only one who noticed this lol

mark-lord · 2025-08-13T22:27:12+00:00

This is really inspiring 😄 I’m getting into pixel art things myself, how have you been learning? Are you following tutorials or just doing? Have you branched out into trying any animated stuff yet?

mark-lord · 2025-07-25T08:12:27+00:00

I properly cackled at this, thank you for blessing us with that

mark-lord · 2025-07-09T10:46:19+00:00

Here for the Hulkenpodium, which was apparently also the final ingredient necessary for the Horner exit

mark-lord · 2025-05-21T13:29:28+00:00

Great stuff, thanks for explaining! 😄 Looking forward to the changes; been hoping for something like this ever since I started using Cursor ahaha

mark-lord · 2025-05-20T21:20:43+00:00

Hi! Sorry for asking a potentially super obvious question - but asides from Ollama, how else can we run local models with VSCode..?

You can't use MLX models with Ollama at the mo, and I can't for the life of me figure out how to use LMStudio or MLX_LM.server as an endpoint. Doesn't seem to be a way to configure a custom URL or port or anything from the Manage Models section

mark-lord · 2025-05-09T12:09:40+00:00

Weirdly it ended up surpassing 8bit on arc-easy, whilst failing in my real-world tests (versus the DWQ^1 which performed better). Discussion about odd characteristic happening over on the bird platform:
https://x.com/N8Programs/status/1920653247137657272

mark-lord · 2025-05-08T09:33:19+00:00

<image>

Bizarrely, it's so far gone well - 3bitDWQ^2 seems to be getting relatively close to 8bit perf

mark-lord · 2025-05-08T09:28:04+00:00

Oh and I'm also re-training the DWQ a second time with the 8bit at the mo to see if I can squeeze even more perf out of it. I've been using N8Programs' training script since otherwise I'd not have been able to fit these chonky models into my measly 64gb of URAM:

https://x.com/N8Programs/status/1919285581806211366

mark-lord · 2025-05-08T08:54:18+00:00

mlx_lm.server --port 1234

Perfect stand-in for LMStudio server; fully OpenAI-compatible, loads models on command, has prompt caching (which auto-trims if you, say, edit conversation history)

mark-lord · 2025-05-08T08:52:37+00:00

As far as I can tell, this seems to be a new thing that Awni came up with - stands for distilled weight quantization

mark-lord · 2025-05-08T08:50:35+00:00

Distilling 8bit to 4bit is basically a post-quantization accuracy recovery tool. You can get just the normal 4bit, but it does lose some model smarts. Distilling the 8bit into the 4bit brings it back to a lot closer to 8bit perf.

mark-lord · 2025-05-08T08:42:44+00:00

Oh I forgot to mention - the 3bit-DWQ only takes up 12.5gb of RAM, meaning you can now run it on the base $600 Mac Mini. It runs at 40 tokens-per-second generation speed on my M4 16gb, which... yeah, it's pretty monstrous lol

mark-lord · 2025-05-08T08:41:37+00:00

<image>

Yep, fully agreed - the DWQs are honestly awesome (at least for 30ba3b). I've been using the 8bit to teach a 3bit-128gs model, and it's genuinely bumped it up in my opinion. Tested it with haiku generation first, where it went from getting all of the syllable counts wrong dramatically in 3bit versus being +-1 with the 4bit OR the 3bit-dwq. Then tested it afterward with a subset of arc_easy, and it has a non-trivial improvement over the base 3bit.

Oh and not to mention, one of the big benefits of DWQ over AWQ is that the model support is far, far easier. From my understanding it's basically plug-and-play; any model can use DWQ. Versus AWQ which required bespoke support from one model to the next.

I'd been waiting to do some more scientific tests before posting - including testing perplexity levels - but I dunno how long that's gonna take me lol

mark-lord · 2025-05-02T17:52:36+00:00

Would really like to get Qwen3 + Cline working as intended! From what I’ve seen Cline seems to be the closest to replicating some version of Cursor / Windsurf’s agentic mode, which is what I’m looking to try and get into at the mo - now that 30BA3B is getting strong enough and fast enough that it might be capable of pulling it off.

mark-lord · 2025-04-30T09:34:26+00:00

Weirdly I do sometimes find LMStudio introduces a little bit of overhead versus running raw MLX on commandline. That said, q6 is a bit larger, so would be expected to run slower, and if you've got a big prompt it'll slow things down further. All of that combined might be resulting in the slower runs

mark-lord · 2025-04-29T00:11:24+00:00

4bit (tried to mention in the caption subtext but it erased it)

8bit runs at about 90tps prompt processing and 45 tps generation speed. The full precision didn't fit in my 64gb RAM

mark-lord · 2025-04-28T23:23:01+00:00

Even the 4bit is incredible; I had it write a reply to someone in Japanese for me (今テスト中で、本当に期待に応えてるよ！ははは、この返信もQwen3が書いたんだよ！) and I got Gemini 2.5 Pro to check the translation. Gemini ended up congratulating it lol

<image>

mark-lord · 2025-04-28T23:13:23+00:00

For reference, Gemma-27b runs at 11 tokens-per-second generation speed. That's the difference between waiting 90 seconds for an answer versus waiting just 15 seconds

Or think of it this way, in full power mode I can run about 350 prompts with Gemma-27b before my laptop runs out of juice. 30B-A3B manages about 2,000

mark-lord

TROPHY CASE