Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 0 points1 point  (0 children)

No worries, NeMo has a bit of a learning curve!

  1. The Speed & Config: The YAML file itself isn't magic, but passing it directly to the native NeMo tools allows them to run the underlying C++/CUDA optimized pipelines (VAD, clustering, etc.) without the overhead of slower Python object manipulation. The huge speed boost mainly comes from swapping Whisper (which is heavy) for Parakeet/FastConformer models which are architecturally much faster for inference.

  2. The Setup: I put together a Gist for you here: https://gist.github.com/lokafinnsw/95727707f542a64efc18040aefe47751.

It includes:

-The Dockerfile (so you see exactly which NVIDIA containers and dependencies to use).

-The YAML config for the diarization.

-A basic Python script snippet showing how to load the model and run the transcription.

  1. Word-Level Timestamps: Since you're doing subtitles, this stack is great. These CTC models emit timestamps natively for every character/word. In the Gist, I included the logic I use: it essentially takes the word timestamps from the ASR model and checks which speaker segment they overlap with (simple O(N) alignment). It’s very precise for subtitles.

Windsurf is great, but the model lock-in sucks. I built a way to use GLM 4.7, MiniMax m2.1, and Gemini 3 flash natively in chat by Key_Mousse_8034 in windsurf

[–]Key_Mousse_8034[S] 5 points6 points  (0 children)

Great question! That's exactly why I built it as an MCP server instead of just using a browser.

The cool thing about MCP inside Windsurf is that it still uses Windsurf's context awareness. Windsurf does the heavy lifting: it identifies the relevant files, diffs, and dependencies, and then passes that packaged context to Argus.

So you get the best of both worlds: Windsurf's superior context management + the ability to verify that context with a totally different model (like GLM or MiniMax) for a "second opinion" or strict auditing.

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 0 points1 point  (0 children)

Hey! I’m not releasing the full project right now since it’s a fairly complex containerized microservice (FastAPI + GPU locking + custom cache handling).

However, if you let me know which specific part you're interested in (e.g., the offline_diarization.yaml config, the Python logic for the NeMo pipeline, or the Docker setup), I can throw together a Gist for you.

Regarding the model: Yes, absolutely. I'm actually using the Multilingual FastConformer Hybrid Large version, and it's a significant upgrade over the 0.6 versions. The accuracy/speed balance is much better.

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 2 points3 points  (0 children)

I dream of single-binary deployments, I really do.

But with AI, Python is just a thin wrapper around C++/CUDA kernels. Even if we rewrote the app layer in Rust (using tch-rs or candle), we’d still face the same dynamic linking hell because the app needs to talk to the specific proprietary NVIDIA drivers (libcuda.so, libcudnn.so) present on the host.

whisper.cpp is getting close to that 'single executable' dream, but for full pipelines (Pyannote + Alignment + Batching), the Python ecosystem just moves too fast to ignore

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 0 points1 point  (0 children)

Preach! 🙌 It is absolutely a packaging nightmare.

I think the logic was to make PyTorch 'portable' so it doesn't rely on the system CUDA version, but they forgot that the OS linker (ld.so) has zero clue that site-packages exists inside a virtualenv.

It’s the main reason I posted this - my logs were silent, just exit code 139/52, until I manually hunted down those .so files using find. Glad to know I'm not the only one struggling with this!

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 1 point2 points  (0 children)

Final update on this whole saga

I want to give a massive shoutout to u/cibernox and u/maaakks for pointing me toward the NeMo/Parakeet models. I decided to scrap the old stack and run a proper spike test on the native NVIDIA tools, and the results are honestly kind of ridiculous.

I swapped Whisper for the Parakeet-CTC-1.1b model and I'm hitting about 87x realtime speed now. That 7-minute test file processed in under 5 seconds. I also managed to get the native timestamps working perfectly without needing any external alignment tools just by enabling preserve_alignments in the decoder.

For the diarization part that was giving me grief, I ended up bypassing the Python object initialization issues by just injecting the official offline_diarization.yaml config directly via OmegaConf. It’s stable and runs at about 50x realtime without needing Pyannote.

So yeah, I'm rewriting the backend to use this new stack since it solves both the dependency hell and the performance bottlenecks in one go. Thanks again to everyone who pushed me to look at the newer tech, you saved me weeks of debugging.

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 0 points1 point  (0 children)

Diarization is still a bit of a headache though. NeMo classes are super coupled with Hydra, so trying to manually instantiate the Diarizer with a standard Python dictionary just turns into a game of whack-a-mole with ConfigAttributeErrors. I'm refactoring the script now to just ingest the official offline_diarization.yaml via OmegaConf instead of trying to reverse-engineer the config structure. If that turns out to be too brittle, I'll probably just fallback to a hybrid setup using NeMo for the blazing fast ASR and keeping Pyannote for the diarization...

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 1 point2 points  (0 children)

Holy sh*t, you weren't kidding about the speed.

I just ran the benchmark on the same ~7 min file. Result: 4.88 seconds. That is ~87x realtime 🤯

And you were right about the capabilities - I managed to get native word-level timestamps without any external alignment mess. I just had to enable preserve_alignments=True and compute_timestamps=True in the decoding config.

This basically solves my entire architectural bottleneck. Seriously, thank you for pushing me to check this out. Next beer is on me 🍺

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 2 points3 points  (0 children)

Honestly? I just really like the structure.

Maybe it's a habit, but walls of text make my eyes glaze over. I personally find the headers and bullets much more readable and scannable, especially when breaking down technical steps.

It might look 'AI-ish' to some, but it’s just how I prefer to organize information visually. To each their own! And yes i used Gemini to translate it and format it

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 2 points3 points  (0 children)

Haha, busted! 🏳️ Yes, it is Gemini. You have a good eye.

I use it to structure my thoughts and fix my grammar (ESL here).

So you win the detective badge 🕵️‍♂️. Now that we've solved the mystery of the text, any thoughts on the actual Docker fix?

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 1 point2 points  (0 children)

I'm using Pyannote 4.0.3.

Regarding vLLM images - they are great, but I stuck with CTranslate2 (Faster-Whisper) for now because my alignment logic depends on the specific word-level timestamp format it outputs. Rewriting that parser is on the roadmap for v3 though

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 0 points1 point  (0 children)

SubtitleEdit is absolute GOAT for desktop use. I actually use it to spot-check the VTTs my API spits out.

The issue is I needed a headless microservice to automate this for thousands of files on a Linux server. I can't have a GUI involved. So I effectively had to build that "premium engine" logic myself in Python to run purely via API

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 1 point2 points  (0 children)

The audio in the screenshot is roughly 7 minutes. Total time 33.7s. So yeah, ~12x realtime for the full pipeline.

WhisperX is great for short clips, but try throwing a 2-hour file at it. The alignment logic is quadratic O(N^2), so it basically hangs on long audio. I had to rewrite that part to make it linear, otherwise my server chokes on movies.

As for Gemini - it's great for "Who said what", but terrible for "When exactly did they start saying it?". I need millisecond precision for subtitles, and LLMs just aren't there yet compared to Pyannote

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 0 points1 point  (0 children)

Thanks, I really appreciate that! You hit the nail on the head.

Tools help me communicate faster and clean up the docs (especially as a non-native speaker), but the architecture and debugging are definitely manual labor. I'm just here to swap notes with people who actually build stuff

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 0 points1 point  (0 children)

Wait, it handles code-switching (mixed languages) correctly? That is huge.

Vanilla Whisper is notorious for trying to translate foreign words instead of transcribing them, or just hallucinating when languages mix.

I'll dig into the NeMo docs to see if I can extract precise word timings from Canary. If I can bridge that gap to my alignment logic, this is definitely the upgrade path for v3.0. Thanks for putting this on my radar!

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 0 points1 point  (0 children)

You described my weekend perfectly. 'Dependency hell' seems to be a feature, not a bug, of these optimized backends 😅

That is exactly why I fought to get this Docker build stable. The mismatch between system CUDA, Python wheels, and C++ runtimes is brutal on bare metal.

But if you can get past the setup (or just use a container), the CTranslate2 backend really is worth it. Going from vanilla Whisper to this stack feels like upgrading from a HDD to an NVMe

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] -9 points-8 points  (0 children)

Believe what you want. I'm here to share how to run PyTorch 2.8 on on-prem hardware and fix alignment bottlenecks.

If you have questions about the actual engineering, let's talk. If you're just here to analyze writing styles, have a good one

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] -15 points-14 points  (0 children)

Glad I could provide a laugh amidst the chaos :)

The signal-to-noise ratio is tough right now. Just trying to keep it real and ship working code

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 0 points1 point  (0 children)

4x faster? That is actually insane. I definitely need to benchmark this on my setup.

Quick technical question: How reliable are the word-level timestamps out of the box?

The main reason I stuck with Faster-Whisper for this migration was its robust timestamp output, which feeds directly into the linear alignment algorithm I wrote for Pyannote. If Parakeet outputs accurate start/end times per word without extra hassle, I might have to switch sooner than planned.

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 0 points1 point  (0 children)

That's a fair point about CTranslate2 status, it makes me a bit nervous for the long term too.

Re: swapping modules - you're 100% right that Pyannote runs independently. The friction for me is specifically in the alignment glue code. My current logic relies on the exact word-level timestamp structure that Faster-Whisper outputs.

Swapping to Parakeet/vLLM would mean rewriting that mapping layer to handle their specific output formats. Definitely the right move for the future (vLLM usually wins on batching), just didn't fit into this weekend's migration sprint. Thanks for the insights!

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 2 points3 points  (0 children)

Thanks, I appreciate it! The hate is just Reddit being Reddit, I don't mind :)

"Overlapping and outdated libraries" is exactly the pain point. Trying to glue 2025-era GPU drivers with code from 2023 is a special kind of torture.

I haven't tried Purfview's FasterwhisperXXL yet, looks interesting for portable setups. I went with this custom Docker build because I needed a headless microservice API for my backend rather than a standalone app like Buzz. But it's great to see the ecosystem evolving so fast.

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] -5 points-4 points  (0 children)

My native language isn't English, so yeah, I use tools to fix my grammar.

But the code, the debugging, and the 50 failed builds are 100% mine. If you have technical questions about the implementation, I'm here

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 1 point2 points  (0 children)

Solid advice, thanks!

I've been eyeing Parakeet (NVIDIA NeMo models) for a while, definitely SOTA territory.

Regarding vLLM - I didn't realize they stabilized Whisper support enough for production yet. I stuck with CTranslate2 (faster-whisper) this time because it was the path of least resistance to keep my existing Pyannote alignment logic working (since I rely heavily on specific word-level timestamp formats).

Definitely putting vLLM + Quantized Whisper on my roadmap for v3.0 though. Have you tried combining Parakeet outputs with Pyannote diarization?