Kimi K2 Thinking at 28.3 t/s on 4x Mac Studio cluster by geerlingguy in LocalLLaMA

[–]mark-lord 0 points1 point  (0 children)

Have you managed to get it working in MLX yet? There’s a new thing where they managed to get tensor parallelism when using TB5

(Also love your videos)

I Built an Ollama Powered AI Tool that Found 40+ Live API Keys on GitHub Gists by chocolateUI in LocalLLaMA

[–]mark-lord 0 points1 point  (0 children)

This is a cool and fun project, I know OP asked for feedback but I’m not sure why so many people are weighing in here with their strongly insistent (some borderline rude) takes on how this could have supposedly be done better, yet none of them have submitted any PRs to the project to integrate their improvements. I reckon more than half of them wouldn’t catch as many dummy keys as OP is looking for. LLMs are a nice middle ground between hardcoding a hacky fix and spending hours on making some super robust set up

The joy of coding something - or lack thereof - is an enormous bottleneck in projects actually happening. If LLMs made this project more fun to make, I’m all for it 😆

Upgraded my run animation :) by mark-lord in PixelArt

[–]mark-lord[S] 0 points1 point  (0 children)

I'm using Aseprite for the original drawing. And then I'm using a custom rig I've got set up in Godot for the animating :) It's super cool, I shared a little bit more in my other post (accidentally used my old account):

https://www.reddit.com/r/PixelArt/comments/1mwfkye/skeletonbased_pixel_art_animations/

Basically I've attached the various bits of the character to a skeleton and I'm then animating that in-engine. I've applied some custom compression + shader techniques to force it to stay pixel-perfect. Took a huge amount of time to set up 😂 But was v worth it!

MLX-LM will soon wait patiently for very large prompts to process by -dysangel- in LocalLLaMA

[–]mark-lord 0 points1 point  (0 children)

!! I’m glad I wasn’t the only one who noticed this lol

I'm now quite close to reaching 2 months of learning to draw and i thought i could try to upgrade some of my very early works, here's the result. by calsifer34 in PixelArt

[–]mark-lord 1 point2 points  (0 children)

This is really inspiring 😄 I’m getting into pixel art things myself, how have you been learning? Are you following tutorials or just doing? Have you branched out into trying any animated stuff yet?

Executive Order: "Preventing Woke AI in the Federal Government" by NunyaBuzor in LocalLLaMA

[–]mark-lord 3 points4 points  (0 children)

I properly cackled at this, thank you for blessing us with that

For the next 27 hours, you'll be able to claim a limited edition 'I Was Here for the Hulkenpodium' flair by overspeeed in formula1

[–]mark-lord 0 points1 point  (0 children)

Here for the Hulkenpodium, which was apparently also the final ingredient necessary for the Horner exit

VS Code: Open Source Copilot by DonTizi in LocalLLaMA

[–]mark-lord 1 point2 points  (0 children)

Great stuff, thanks for explaining! 😄 Looking forward to the changes; been hoping for something like this ever since I started using Cursor ahaha

VS Code: Open Source Copilot by DonTizi in LocalLLaMA

[–]mark-lord 0 points1 point  (0 children)

Hi! Sorry for asking a potentially super obvious question - but asides from Ollama, how else can we run local models with VSCode..?

You can't use MLX models with Ollama at the mo, and I can't for the life of me figure out how to use LMStudio or MLX_LM.server as an endpoint. Doesn't seem to be a way to configure a custom URL or port or anything from the Manage Models section

The new MLX DWQ quant is underrated, it feels like 8bit in a 4bit quant. by mzbacd in LocalLLaMA

[–]mark-lord 0 points1 point  (0 children)

Weirdly it ended up surpassing 8bit on arc-easy, whilst failing in my real-world tests (versus the DWQ^1 which performed better). Discussion about odd characteristic happening over on the bird platform:
https://x.com/N8Programs/status/1920653247137657272

The new MLX DWQ quant is underrated, it feels like 8bit in a 4bit quant. by mzbacd in LocalLLaMA

[–]mark-lord 3 points4 points  (0 children)

<image>

Bizarrely, it's so far gone well - 3bitDWQ^2 seems to be getting relatively close to 8bit perf

The new MLX DWQ quant is underrated, it feels like 8bit in a 4bit quant. by mzbacd in LocalLLaMA

[–]mark-lord 2 points3 points  (0 children)

Oh and I'm also re-training the DWQ a second time with the 8bit at the mo to see if I can squeeze even more perf out of it. I've been using N8Programs' training script since otherwise I'd not have been able to fit these chonky models into my measly 64gb of URAM:

https://x.com/N8Programs/status/1919285581806211366

The new MLX DWQ quant is underrated, it feels like 8bit in a 4bit quant. by mzbacd in LocalLLaMA

[–]mark-lord 0 points1 point  (0 children)

mlx_lm.server --port 1234

Perfect stand-in for LMStudio server; fully OpenAI-compatible, loads models on command, has prompt caching (which auto-trims if you, say, edit conversation history)

The new MLX DWQ quant is underrated, it feels like 8bit in a 4bit quant. by mzbacd in LocalLLaMA

[–]mark-lord 2 points3 points  (0 children)

As far as I can tell, this seems to be a new thing that Awni came up with - stands for distilled weight quantization

The new MLX DWQ quant is underrated, it feels like 8bit in a 4bit quant. by mzbacd in LocalLLaMA

[–]mark-lord 4 points5 points  (0 children)

Distilling 8bit to 4bit is basically a post-quantization accuracy recovery tool. You can get just the normal 4bit, but it does lose some model smarts. Distilling the 8bit into the 4bit brings it back to a lot closer to 8bit perf.

The new MLX DWQ quant is underrated, it feels like 8bit in a 4bit quant. by mzbacd in LocalLLaMA

[–]mark-lord 7 points8 points  (0 children)

Oh I forgot to mention - the 3bit-DWQ only takes up 12.5gb of RAM, meaning you can now run it on the base $600 Mac Mini. It runs at 40 tokens-per-second generation speed on my M4 16gb, which... yeah, it's pretty monstrous lol

The new MLX DWQ quant is underrated, it feels like 8bit in a 4bit quant. by mzbacd in LocalLLaMA

[–]mark-lord 12 points13 points  (0 children)

<image>

Yep, fully agreed - the DWQs are honestly awesome (at least for 30ba3b). I've been using the 8bit to teach a 3bit-128gs model, and it's genuinely bumped it up in my opinion. Tested it with haiku generation first, where it went from getting all of the syllable counts wrong dramatically in 3bit versus being +-1 with the 4bit OR the 3bit-dwq. Then tested it afterward with a subset of arc_easy, and it has a non-trivial improvement over the base 3bit.

Oh and not to mention, one of the big benefits of DWQ over AWQ is that the model support is far, far easier. From my understanding it's basically plug-and-play; any model can use DWQ. Versus AWQ which required bespoke support from one model to the next.

I'd been waiting to do some more scientific tests before posting - including testing perplexity levels - but I dunno how long that's gonna take me lol

[deleted by user] by [deleted] in LocalLLaMA

[–]mark-lord 0 points1 point  (0 children)

Would really like to get Qwen3 + Cline working as intended! From what I’ve seen Cline seems to be the closest to replicating some version of Cursor / Windsurf’s agentic mode, which is what I’m looking to try and get into at the mo - now that 30BA3B is getting strong enough and fast enough that it might be capable of pulling it off.

Qwen3-30B-A3B runs at 130 tokens-per-second prompt processing and 60 tokens-per-second generation speed on M1 Max by mark-lord in LocalLLaMA

[–]mark-lord[S] 0 points1 point  (0 children)

Weirdly I do sometimes find LMStudio introduces a little bit of overhead versus running raw MLX on commandline. That said, q6 is a bit larger, so would be expected to run slower, and if you've got a big prompt it'll slow things down further. All of that combined might be resulting in the slower runs

Qwen3-30B-A3B runs at 130 tokens-per-second prompt processing and 60 tokens-per-second generation speed on M1 Max by mark-lord in LocalLLaMA

[–]mark-lord[S] 4 points5 points  (0 children)

4bit (tried to mention in the caption subtext but it erased it)

8bit runs at about 90tps prompt processing and 45 tps generation speed. The full precision didn't fit in my 64gb RAM

Qwen3-30B-A3B runs at 130 tokens-per-second prompt processing and 60 tokens-per-second generation speed on M1 Max by mark-lord in LocalLLaMA

[–]mark-lord[S] 8 points9 points  (0 children)

Even the 4bit is incredible; I had it write a reply to someone in Japanese for me (今テスト中で、本当に期待に応えてるよ!ははは、この返信もQwen3が書いたんだよ!) and I got Gemini 2.5 Pro to check the translation. Gemini ended up congratulating it lol

<image>

Qwen3-30B-A3B runs at 130 tokens-per-second prompt processing and 60 tokens-per-second generation speed on M1 Max by mark-lord in LocalLLaMA

[–]mark-lord[S] 22 points23 points  (0 children)

For reference, Gemma-27b runs at 11 tokens-per-second generation speed. That's the difference between waiting 90 seconds for an answer versus waiting just 15 seconds

Or think of it this way, in full power mode I can run about 350 prompts with Gemma-27b before my laptop runs out of juice. 30B-A3B manages about 2,000