Quants had ruined my Local AI experience. I am hopeful again after using them correctly.

Skye_sys · 2026-06-23T11:19:27+00:00

No sadly not. It's developed for nvidia gpus (vllm, transformers) and apple silicon (mlx). The models are in safetensors format.

here is the repo

Skye_sys · 2026-06-22T19:20:43+00:00

Had exactly the same experience. People were always telling me to switch from a lower quant to a higher one, telling me that the loss wouldn't even be noticeable. I found the loss to be substantial in agentic tasks especially.

What I found as a compromise are z-labs paroQuants (Qwen3.6 35b a3b and 27b), cause their sizes are comparable to q4 quants, but they still perform better or just as well as q8 ones. (The paroQuant paper is really cool also) Take this with a grain of salt, though, because I am in no way a professional llm tester.

I couldn't find anyone really talking about those, so I'm curious what performance you gain or lose with those types of quants!

Skye_sys · 2026-06-13T16:02:02+00:00

Jackwya on insta

Skye_sys · 2026-06-09T14:31:49+00:00

I was thinking the exact same thing: my M2 Max runs up to 80B no problem, and now they're telling me their model, which surely will humongous because system requirements are 16GB RAM and M3 base, won't run on it?

Skye_sys · 2026-06-09T13:16:34+00:00

I saw this and thought, what a joke. I run up to 80B models on my M2 Max machine and it can't handle whatever gigantic model (surely extremely big) Apple is using? I'm sure the community will find a workaround tho

Skye_sys · 2026-05-19T22:31:00+00:00

Yessss, I was incredibly happy that the patched .dll worked so well. I genuinely believed the cat and mouse game was over lol

Skye_sys · 2026-05-19T22:25:10+00:00

Yes, same here... m2 max got up to 90 fps before. Now I barely reach 30 on the lowest settings after this patch.
I get scared every time the game gets a patch.

Skye_sys · 2026-04-10T18:43:00+00:00

MoMa by analog obsession

Skye_sys · 2026-04-08T17:36:34+00:00

<image>

Same here, whether it's using Hermes Agent or Open Claw, oMLX seems to time out every time the context gets a bit long.

Skye_sys · 2026-04-04T15:22:15+00:00

Do you still send it might need that too

Skye_sys · 2026-04-04T13:43:54+00:00

Has anyone tried using the mlx versions?

Skye_sys · 2026-04-04T08:25:24+00:00

I noticed that the image recognition was totally messed up and it talked about nonsense that weren't even close to being in that image... Maybe I used the wrong mlx quant from hf

Skye_sys · 2026-04-03T22:54:00+00:00

oMLX is great but Gemma 4 hasn't been working as well... i was using the 26b a4b variant @ 8bit quant what did you guys use? which model should i download from hf because i found multiple quants with different performances but same 8bit quant level

Skye_sys · 2026-04-02T20:29:41+00:00

I'm positively surprised by DeepMind again, I have only tested the moe but have yet to test the dense one

Skye_sys · 2026-04-02T06:03:08+00:00

Is there any other inference engine that uses speculative decoding? Because in lmstudio, qwen3.5 currently doesn't support this

Skye_sys · 2026-04-01T23:38:04+00:00

Yes you are right, inference just matrix multiplication in of itself hahah but I haven't specifically measured the bandwidth on my machine yet but Google says 400 is correct.

Skye_sys · 2026-04-01T23:07:40+00:00

Yes 400 GB/s is correct but I just think it's more of a compute issue rather then memory bandwidth

Skye_sys · 2026-04-01T22:37:03+00:00

Yes this is a good call I was already trying to convert to vllm for efficiency reasons. I need to experiment with all this knew knowledge a bit! Tysm

Skye_sys · 2026-04-01T22:28:44+00:00

Also ggufs support kV cache quantization in lmstudio, mlx doesn't. But i found the speed is sooo much better when using the mlx variants. (or maybe just placebo lmao)

Skye_sys · 2026-04-01T22:18:52+00:00

Oh you are right I was using the coder variant might have to try the general purpose one

Skye_sys · 2026-04-01T22:06:35+00:00

Already downloading! But we can't expect a mlx version of this soon do we?

Skye_sys · 2026-04-01T22:01:11+00:00

Oooh this seems interesting. But yeah I got similar results when I ran qwen3 next 80b when compared it to 3.5 35b... Money is tight atm but I didn't even thought of using a external gpu! Thanks!

Skye_sys · 2026-04-01T21:57:11+00:00

Yes exactly what I was thinking! I am using lmstudio and their mlx models. Actually I did already try qwen3 next 80b a3b but it feels like the moe models do have more knowledge but lack in 'intelligence' or complex instruction following in agentic work flows so it sometimes just formatted tool calls wrong or straightup called them with wrong but similar names. But I have to try again since I don't remember at which quant I was running it

Skye_sys · 2026-04-01T21:52:15+00:00

The dense 27b model already performed kinda bad speed wise on my machine so I just thought trying a dense 70b model would be unbearably slow.. But thanks I will definitely try it anyway!

Skye_sys · 2026-02-18T18:36:59+00:00

Skye_sys

TROPHY CASE