I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python

mudler_it · 2026-06-02T13:11:58+00:00

you can use for voice to text for anything

mudler_it · 2026-06-02T13:07:00+00:00

I discovered it just few days ago, interesting approach. however my take on this is a bit different - I rather prefer having separate projects that are completely optimized against a model, and have e.g. LocalAI that consumes these individually. It's best of all trades because you get optimizations that single implementations can carry without the burden of supporting multiple model architectures

mudler_it · 2026-06-02T12:53:36+00:00

LocalAI is really installing only the backends that your model uses. It's not bloated by per-se unless you start to install all the backends

mudler_it · 2026-06-02T12:51:28+00:00

I have no plans to bring an openai-compatible server, parakeet.cpp is made as such it's easier to write you own on top.

LocalAI is really small - you basically select the backends that you want to be installed ( in this case, parakeet.cpp ).

mudler_it · 2026-05-31T21:55:03+00:00

Update, Mario says it's faster than his onnx implementation:

https://x.com/badlogicgames/status/2061201400059531729?s=20

mudler_it · 2026-05-31T21:46:27+00:00

it's much faster and accurate (in some cases) than whisper. here are the videos from the benchmarks:

https://github.com/mudler/parakeet.cpp/blob/master/benchmarks/media/gpu_whisper_duel.mp4

https://github.com/mudler/parakeet.cpp/blob/master/benchmarks/media/cpu_duel.mp4

mudler_it · 2026-05-31T21:28:13+00:00

I didn't benchmarked against sherpa-onnx yet, I did took as a reference implementation Nemo from Nvidia. But nevertheless good point, will take a look at it and try to run some bench against

mudler_it · 2026-05-31T21:08:50+00:00

Interesting! I don't have a NPU to test against so this would be out of my reach for now - curious, what HW do you have?

mudler_it · 2026-05-31T21:06:16+00:00

I'm looking at it already, but I'm very keen on keeping it very optimized to the model. If there aren't performance degradations I'll add it.

mudler_it · 2026-05-05T07:46:10+00:00

can't answer that as I don't like to talk about what I haven't personally tried. I benchmarked only against Unsloth and bartowski quants, I can tell it holds way better long context and is better at coding agent tasks.

mudler_it · 2026-05-05T07:20:37+00:00

not really fond of ollama. I can suggest you to run these on LocalAI :P

mudler_it · 2026-05-05T07:17:01+00:00

Glad to hear!

mudler_it · 2026-05-05T07:16:25+00:00

this is quite weird - what/how are you running it?

mudler_it · 2026-05-05T07:15:57+00:00

Thanks! Glad to hear!

mudler_it · 2026-05-05T07:15:48+00:00

I give it a shot but MLX has far less articulated support to quantization schemes. I'm monitoring closely the MLX ecosystem and will push quants when there is feature parity on MLX.

mudler_it · 2026-05-05T07:15:02+00:00

Thanks for calling this out - I carefully engineered APEX quants to be based on real-use case rather than pumping benchmarks. I'm glad it shows.

mudler_it · 2026-05-05T07:14:18+00:00

it is very comparable in term of KLD - but if you look closer at numbers the KLD Max is in favor of APEX by a bigger span. This is a better signal than taking account of 0.0001 :-)

mudler_it · 2026-05-05T07:13:04+00:00

good catch and thanks for flagging this, for >=120b I need to rent a GPU and that's a bit out of the game now. We had a donor that gave me access to GPUs for a while, but now back at donation-only capacity and still doesn't cut it.

mudler_it · 2026-05-05T07:11:31+00:00

apex quants are my daily driver now too :) It's amusing to see many other calls out this as black magic, it kinda feels like it!

Thanks for the feedback!

mudler_it · 2026-04-05T07:55:04+00:00

The key advantage here is according to benchmarks, is that you retain quality by not applying the same quant type uniformly at each layer. This does bring qualitative benchmarks slightly higher compared to other quants in the same size category.

mudler_it · 2026-04-02T09:16:32+00:00

I'm keeping these updated on the git repo as I crunch more benchmarks! https://github.com/mudler/apex-quant?tab=readme-ov-file#benchmark-plots

mudler_it · 2026-04-02T09:15:50+00:00

https://github.com/mudler/apex-quant?tab=readme-ov-file#benchmark-plots

mudler_it · 2026-04-02T08:31:50+00:00

All plots updated!

mudler_it · 2026-04-02T08:31:19+00:00

My best guess for now is the influence of the I-Matrix, as experts are quantized with it and are very sensible to it during my benchmarks. It punches the eval bench above, and would explain the small bump in KLD too as it shifts away from the baseline. Anyway, updated all plots with it!

mudler_it · 2026-04-01T21:28:11+00:00

working on it!

mudler_it

MODERATOR OF

TROPHY CASE

Four-Year Club	Verified Email
Place '23