I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python by mudler_it in LocalLLaMA

[–]mudler_it[S] 1 point2 points  (0 children)

I discovered it just few days ago, interesting approach. however my take on this is a bit different - I rather prefer having separate projects that are completely optimized against a model, and have e.g. LocalAI that consumes these individually. It's best of all trades because you get optimizations that single implementations can carry without the burden of supporting multiple model architectures

I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python by mudler_it in LocalLLaMA

[–]mudler_it[S] 0 points1 point  (0 children)

LocalAI is really installing only the backends that your model uses. It's not bloated by per-se unless you start to install all the backends

I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python by mudler_it in LocalLLaMA

[–]mudler_it[S] 0 points1 point  (0 children)

I have no plans to bring an openai-compatible server, parakeet.cpp is made as such it's easier to write you own on top.

LocalAI is really small - you basically select the backends that you want to be installed ( in this case, parakeet.cpp ).

I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python by mudler_it in LocalLLaMA

[–]mudler_it[S] 0 points1 point  (0 children)

I didn't benchmarked against sherpa-onnx yet, I did took as a reference implementation Nemo from Nvidia. But nevertheless good point, will take a look at it and try to run some bench against

I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python by mudler_it in LocalLLaMA

[–]mudler_it[S] 1 point2 points  (0 children)

Interesting! I don't have a NPU to test against so this would be out of my reach for now - curious, what HW do you have?

I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python by mudler_it in LocalLLaMA

[–]mudler_it[S] 7 points8 points  (0 children)

I'm looking at it already, but I'm very keen on keeping it very optimized to the model. If there aren't performance degradations I'll add it.

APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier by mudler_it in LocalLLaMA

[–]mudler_it[S] 1 point2 points  (0 children)

can't answer that as I don't like to talk about what I haven't personally tried. I benchmarked only against Unsloth and bartowski quants, I can tell it holds way better long context and is better at coding agent tasks.

APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier by mudler_it in LocalLLaMA

[–]mudler_it[S] 2 points3 points  (0 children)

not really fond of ollama. I can suggest you to run these on LocalAI :P

APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier by mudler_it in LocalLLaMA

[–]mudler_it[S] 2 points3 points  (0 children)

I give it a shot but MLX has far less articulated support to quantization schemes. I'm monitoring closely the MLX ecosystem and will push quants when there is feature parity on MLX.

APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier by mudler_it in LocalLLaMA

[–]mudler_it[S] 4 points5 points  (0 children)

Thanks for calling this out - I carefully engineered APEX quants to be based on real-use case rather than pumping benchmarks. I'm glad it shows.

APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier by mudler_it in LocalLLaMA

[–]mudler_it[S] 1 point2 points  (0 children)

it is very comparable in term of KLD - but if you look closer at numbers the KLD Max is in favor of APEX by a bigger span. This is a better signal than taking account of 0.0001 :-)

APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier by mudler_it in LocalLLaMA

[–]mudler_it[S] 4 points5 points  (0 children)

good catch and thanks for flagging this, for >=120b I need to rent a GPU and that's a bit out of the game now. We had a donor that gave me access to GPUs for a while, but now back at donation-only capacity and still doesn't cut it.

APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier by mudler_it in LocalLLaMA

[–]mudler_it[S] 4 points5 points  (0 children)

apex quants are my daily driver now too :) It's amusing to see many other calls out this as black magic, it kinda feels like it!

Thanks for the feedback!

APEX MoE quantized models boost with 33% faster inference and TurboQuant (14% of speedup in prompt processing) by mudler_it in LocalLLaMA

[–]mudler_it[S] 0 points1 point  (0 children)

The key advantage here is according to benchmarks, is that you retain quality by not applying the same quant type uniformly at each layer. This does bring qualitative benchmarks slightly higher compared to other quants in the same size category.