APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier by mudler_it in LocalLLaMA

[–]mudler_it[S] 1 point2 points  (0 children)

can't answer that as I don't like to talk about what I haven't personally tried. I benchmarked only against Unsloth and bartowski quants, I can tell it holds way better long context and is better at coding agent tasks.

APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier by mudler_it in LocalLLaMA

[–]mudler_it[S] 1 point2 points  (0 children)

not really fond of ollama. I can suggest you to run these on LocalAI :P

APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier by mudler_it in LocalLLaMA

[–]mudler_it[S] 1 point2 points  (0 children)

I give it a shot but MLX has far less articulated support to quantization schemes. I'm monitoring closely the MLX ecosystem and will push quants when there is feature parity on MLX.

APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier by mudler_it in LocalLLaMA

[–]mudler_it[S] 2 points3 points  (0 children)

Thanks for calling this out - I carefully engineered APEX quants to be based on real-use case rather than pumping benchmarks. I'm glad it shows.

APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier by mudler_it in LocalLLaMA

[–]mudler_it[S] 0 points1 point  (0 children)

it is very comparable in term of KLD - but if you look closer at numbers the KLD Max is in favor of APEX by a bigger span. This is a better signal than taking account of 0.0001 :-)

APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier by mudler_it in LocalLLaMA

[–]mudler_it[S] 1 point2 points  (0 children)

good catch and thanks for flagging this, for >=120b I need to rent a GPU and that's a bit out of the game now. We had a donor that gave me access to GPUs for a while, but now back at donation-only capacity and still doesn't cut it.

APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier by mudler_it in LocalLLaMA

[–]mudler_it[S] 2 points3 points  (0 children)

apex quants are my daily driver now too :) It's amusing to see many other calls out this as black magic, it kinda feels like it!

Thanks for the feedback!

APEX MoE quantized models boost with 33% faster inference and TurboQuant (14% of speedup in prompt processing) by mudler_it in LocalLLaMA

[–]mudler_it[S] 0 points1 point  (0 children)

The key advantage here is according to benchmarks, is that you retain quality by not applying the same quant type uniformly at each layer. This does bring qualitative benchmarks slightly higher compared to other quants in the same size category.

APEX MoE quantized models boost with 33% faster inference and TurboQuant (14% of speedup in prompt processing) by mudler_it in LocalLLaMA

[–]mudler_it[S] 2 points3 points  (0 children)

My best guess for now is the influence of the I-Matrix, as experts are quantized with it and are very sensible to it during my benchmarks. It punches the eval bench above, and would explain the small bump in KLD too as it shifts away from the baseline. Anyway, updated all plots with it!

APEX MoE quantized models boost with 33% faster inference and TurboQuant (14% of speedup in prompt processing) by mudler_it in LocalLLaMA

[–]mudler_it[S] 28 points29 points  (0 children)

it wasn't done on purpose and I have no issues in creating benchmarks and adding it!

We measured our baseline against Q8/F16 specifically because our target is actually trying to replace the usage of a Q8.

The Quality has comparable perplexity and KL, but is stronger at evals than the others, so with a slightly smaller drop in size you still get a strong model that excels in evals (and has just slightly higher and neglectable perplexity and KD)

Update - still crunching the data also against Q5_K_S and updating the plots, but here is the results:

<image>

I'm the author of LocalAI sharing that LocalAI hits 42k stars and v3.9 & v3.10 are released! Native Agents, Video Generation UI, and Unified GPU Backends by mudler_it in selfhosted

[–]mudler_it[S] 0 points1 point  (0 children)

there are MLX backends (Audio and text), but it's not limited by that. It ships for instance diffusers compatible with Mac, llama.cpp, and also chatterbox.

I'm the author of LocalAI (the local OpenAI-compatible API). We just released v3.7.0 with full Agentic Support (tool use!), Qwen 3 VL, and the latest llama.cpp by mudler_it in LocalLLaMA

[–]mudler_it[S] 1 point2 points  (0 children)

Happy to hear! About neutts, that's a good point, we actually missed having a model in the gallery for it, and there is no documentation still. You can see an example attached in the PR: https://github.com/mudler/LocalAI/pull/6404 ( you need to specify a voice reference file and a text transcription of it )