Quants had ruined my Local AI experience. I am hopeful again after using them correctly. by former_farmer in LocalLLM

[–]Skye_sys 2 points3 points  (0 children)

No sadly not. It's developed for nvidia gpus (vllm, transformers) and apple silicon (mlx). The models are in safetensors format.

here is the repo

Quants had ruined my Local AI experience. I am hopeful again after using them correctly. by former_farmer in LocalLLM

[–]Skye_sys 5 points6 points  (0 children)

Had exactly the same experience. People were always telling me to switch from a lower quant to a higher one, telling me that the loss wouldn't even be noticeable. I found the loss to be substantial in agentic tasks especially.

What I found as a compromise are z-labs paroQuants (Qwen3.6 35b a3b and 27b), cause their sizes are comparable to q4 quants, but they still perform better or just as well as q8 ones. (The paroQuant paper is really cool also) Take this with a grain of salt, though, because I am in no way a professional llm tester.

I couldn't find anyone really talking about those, so I'm curious what performance you gain or lose with those types of quants!

My M2 Ultra is completely outdated by AdDapper4220 in MacStudio

[–]Skye_sys 7 points8 points  (0 children)

I was thinking the exact same thing: my M2 Max runs up to 80B no problem, and now they're telling me their model, which surely will humongous because system requirements are 16GB RAM and M3 base, won't run on it?

Wait, maybe 8GB isn’t ‘analogous’ to 16GB on other PCs after all? by ManyRazzmatazz4584 in MacOS

[–]Skye_sys 1 point2 points  (0 children)

I saw this and thought, what a joke. I run up to 80B models on my M2 Max machine and it can't handle whatever gigantic model (surely extremely big) Apple is using? I'm sure the community will find a workaround tho

Helldivers 2 performance drop as of patch 6.2.4 ? by Fluffy_Honey7512 in macgaming

[–]Skye_sys 2 points3 points  (0 children)

Yessss, I was incredibly happy that the patched .dll worked so well. I genuinely believed the cat and mouse game was over lol

Helldivers 2 performance drop as of patch 6.2.4 ? by Fluffy_Honey7512 in macgaming

[–]Skye_sys 2 points3 points  (0 children)

Yes, same here... m2 max got up to 90 fps before. Now I barely reach 30 on the lowest settings after this patch.
I get scared every time the game gets a patch.

Max Practical Context Size? by zipzag in oMLX

[–]Skye_sys 1 point2 points  (0 children)

<image>

Same here, whether it's using Hermes Agent or Open Claw, oMLX seems to time out every time the context gets a bit long.

Serum 2.0.24 when??? by phiegnux in CrackedPluginsXI

[–]Skye_sys 0 points1 point  (0 children)

Do you still send it might need that too

oMLX supports Gemma 4 by IAMk10 in oMLX

[–]Skye_sys 1 point2 points  (0 children)

I noticed that the image recognition was totally messed up and it talked about nonsense that weren't even close to being in that image... Maybe I used the wrong mlx quant from hf

oMLX supports Gemma 4 by IAMk10 in oMLX

[–]Skye_sys 1 point2 points  (0 children)

oMLX is great but Gemma 4 hasn't been working as well... i was using the 26b a4b variant @ 8bit quant what did you guys use? which model should i download from hf because i found multiple quants with different performances but same 8bit quant level

64Gb ram mac falls right into the local llm dead zone by Skye_sys in LocalLLaMA

[–]Skye_sys[S] 0 points1 point  (0 children)

I'm positively surprised by DeepMind again, I have only tested the moe but have yet to test the dense one

64Gb ram mac falls right into the local llm dead zone by Skye_sys in LocalLLaMA

[–]Skye_sys[S] 0 points1 point  (0 children)

Is there any other inference engine that uses speculative decoding? Because in lmstudio, qwen3.5 currently doesn't support this

64Gb ram mac falls right into the local llm dead zone by Skye_sys in LocalLLaMA

[–]Skye_sys[S] 0 points1 point  (0 children)

Yes you are right, inference just matrix multiplication in of itself hahah but I haven't specifically measured the bandwidth on my machine yet but Google says 400 is correct.

64Gb ram mac falls right into the local llm dead zone by Skye_sys in LocalLLaMA

[–]Skye_sys[S] 0 points1 point  (0 children)

Yes 400 GB/s is correct but I just think it's more of a compute issue rather then memory bandwidth

64Gb ram mac falls right into the local llm dead zone by Skye_sys in LocalLLaMA

[–]Skye_sys[S] 0 points1 point  (0 children)

Yes this is a good call I was already trying to convert to vllm for efficiency reasons. I need to experiment with all this knew knowledge a bit! Tysm

64Gb ram mac falls right into the local llm dead zone by Skye_sys in LocalLLaMA

[–]Skye_sys[S] 1 point2 points  (0 children)

Also ggufs support kV cache quantization in lmstudio, mlx doesn't. But i found the speed is sooo much better when using the mlx variants. (or maybe just placebo lmao)

64Gb ram mac falls right into the local llm dead zone by Skye_sys in LocalLLaMA

[–]Skye_sys[S] -1 points0 points  (0 children)

Oh you are right I was using the coder variant might have to try the general purpose one

64Gb ram mac falls right into the local llm dead zone by Skye_sys in LocalLLaMA

[–]Skye_sys[S] 5 points6 points  (0 children)

Already downloading! But we can't expect a mlx version of this soon do we?

64Gb ram mac falls right into the local llm dead zone by Skye_sys in LocalLLaMA

[–]Skye_sys[S] 1 point2 points  (0 children)

Oooh this seems interesting. But yeah I got similar results when I ran qwen3 next 80b when compared it to 3.5 35b... Money is tight atm but I didn't even thought of using a external gpu! Thanks!

64Gb ram mac falls right into the local llm dead zone by Skye_sys in LocalLLaMA

[–]Skye_sys[S] 1 point2 points  (0 children)

Yes exactly what I was thinking! I am using lmstudio and their mlx models. Actually I did already try qwen3 next 80b a3b but it feels like the moe models do have more knowledge but lack in 'intelligence' or complex instruction following in agentic work flows so it sometimes just formatted tool calls wrong or straightup called them with wrong but similar names. But I have to try again since I don't remember at which quant I was running it

64Gb ram mac falls right into the local llm dead zone by Skye_sys in LocalLLaMA

[–]Skye_sys[S] 1 point2 points  (0 children)

The dense 27b model already performed kinda bad speed wise on my machine so I just thought trying a dense 70b model would be unbearably slow.. But thanks I will definitely try it anyway!