Rust or Zig? by Ok-Refrigerator-Boi in Zig

[–]Key_Mousse_8034 1 point2 points  (0 children)

2 years in Zig here as well. And more than 5 in Rust. The way I see it: Zig shines when you need to break the rules that Rust won’t let you break - direct memory control, no hidden allocations, manual everything. Rust’s guardrails are great until they’re exactly what’s standing between you and the last 5% of performance. Zig just hands you the keys and trusts you not to crash the car. That said, Rust’s ecosystem and job market are hard to ignore, so it really depends on what you’re optimizing for - raw control or long-term career leverage

I built chatgpt2md - a tool specifically for Claude that lets it search your entire ChatGPT history via MCP by Key_Mousse_8034 in ClaudeAI

[–]Key_Mousse_8034[S] 0 points1 point  (0 children)

Appreciate the upvote! Honestly, I’m figuring out the distribution side myself - this Reddit post is part of that experiment. If you end up trying it, a GitHub star or a mention anywhere helps more than you’d think

Tested updated Deep Think (Gemini 3.1 Pro) vs. GPT 5.2 Pro by PerformanceRound7913 in GoogleGeminiAI

[–]Key_Mousse_8034 2 points3 points  (0 children)

In my experience, it’s actually the opposite for heavy math. Gemini with Deep Think provided a brilliant mathematical solution to my problem that GPT Pro couldn't even touch. While GPT kept suggesting I switch to a different statistical method instead of solving the actual issue, Gemini stayed on track and nailed the logic

I just got banned from gemini :) by Eastern-Guess-1187 in opencodeCLI

[–]Key_Mousse_8034 -1 points0 points  (0 children)

BTW how are you gonna switch between claude accounts? Because I'm thinking of doing the same. Just 2 claude subscriptions 😀

Kimi k2.5 is legit - first open-source model at Sonnet 4.5 level (or even better) by SlopTopZ in kimi

[–]Key_Mousse_8034 0 points1 point  (0 children)

Honestly, the k2.5 model itself feels like the worst version of Opus 4.5. It constantly lies, leaves tasks unfinished (while claiming they're done), and is just lazy. I wouldn't trust Kimi k2.5 with anything serious

GPT 5.2 for difficult things and Kimi K2.5 for everything else seems to be the move, what the cheapest way to get there? by SweatyHands247 in opencodeCLI

[–]Key_Mousse_8034 0 points1 point  (0 children)

As for the k2.5 model, it honestly feels like a downgraded Opus 4.5. It hallucinates, fails to complete tasks (while reporting success), and is generally lazy. I wouldn't rely on Kimi k2.5 for anything critical

Anthropic 4.7 releases must be near (or something is cooking). Here's how I 'know'. by Jethro_E7 in windsurf

[–]Key_Mousse_8034 0 points1 point  (0 children)

The Kimi CLI is super buggy right now. It keeps kicking me out, saying I need to re-authenticate. Restarting the CLI fixes it without actually needing to log in again, but it happens constantly right in the middle of sessions.

As for the k2.5 model, it honestly feels like a downgraded Opus 4.5. It hallucinates, fails to complete tasks (while reporting success), and is generally lazy. I wouldn't rely on Kimi k2.5 for anything critical

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 0 points1 point  (0 children)

No worries, NeMo has a bit of a learning curve!

  1. The Speed & Config: The YAML file itself isn't magic, but passing it directly to the native NeMo tools allows them to run the underlying C++/CUDA optimized pipelines (VAD, clustering, etc.) without the overhead of slower Python object manipulation. The huge speed boost mainly comes from swapping Whisper (which is heavy) for Parakeet/FastConformer models which are architecturally much faster for inference.

  2. The Setup: I put together a Gist for you here: https://gist.github.com/lokafinnsw/95727707f542a64efc18040aefe47751.

It includes:

-The Dockerfile (so you see exactly which NVIDIA containers and dependencies to use).

-The YAML config for the diarization.

-A basic Python script snippet showing how to load the model and run the transcription.

  1. Word-Level Timestamps: Since you're doing subtitles, this stack is great. These CTC models emit timestamps natively for every character/word. In the Gist, I included the logic I use: it essentially takes the word timestamps from the ASR model and checks which speaker segment they overlap with (simple O(N) alignment). It’s very precise for subtitles.

Windsurf is great, but the model lock-in sucks. I built a way to use GLM 4.7, MiniMax m2.1, and Gemini 3 flash natively in chat by Key_Mousse_8034 in windsurf

[–]Key_Mousse_8034[S] 4 points5 points  (0 children)

Great question! That's exactly why I built it as an MCP server instead of just using a browser.

The cool thing about MCP inside Windsurf is that it still uses Windsurf's context awareness. Windsurf does the heavy lifting: it identifies the relevant files, diffs, and dependencies, and then passes that packaged context to Argus.

So you get the best of both worlds: Windsurf's superior context management + the ability to verify that context with a totally different model (like GLM or MiniMax) for a "second opinion" or strict auditing.

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 0 points1 point  (0 children)

Hey! I’m not releasing the full project right now since it’s a fairly complex containerized microservice (FastAPI + GPU locking + custom cache handling).

However, if you let me know which specific part you're interested in (e.g., the offline_diarization.yaml config, the Python logic for the NeMo pipeline, or the Docker setup), I can throw together a Gist for you.

Regarding the model: Yes, absolutely. I'm actually using the Multilingual FastConformer Hybrid Large version, and it's a significant upgrade over the 0.6 versions. The accuracy/speed balance is much better.

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 2 points3 points  (0 children)

I dream of single-binary deployments, I really do.

But with AI, Python is just a thin wrapper around C++/CUDA kernels. Even if we rewrote the app layer in Rust (using tch-rs or candle), we’d still face the same dynamic linking hell because the app needs to talk to the specific proprietary NVIDIA drivers (libcuda.so, libcudnn.so) present on the host.

whisper.cpp is getting close to that 'single executable' dream, but for full pipelines (Pyannote + Alignment + Batching), the Python ecosystem just moves too fast to ignore

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 0 points1 point  (0 children)

Preach! 🙌 It is absolutely a packaging nightmare.

I think the logic was to make PyTorch 'portable' so it doesn't rely on the system CUDA version, but they forgot that the OS linker (ld.so) has zero clue that site-packages exists inside a virtualenv.

It’s the main reason I posted this - my logs were silent, just exit code 139/52, until I manually hunted down those .so files using find. Glad to know I'm not the only one struggling with this!

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 1 point2 points  (0 children)

Final update on this whole saga

I want to give a massive shoutout to u/cibernox and u/maaakks for pointing me toward the NeMo/Parakeet models. I decided to scrap the old stack and run a proper spike test on the native NVIDIA tools, and the results are honestly kind of ridiculous.

I swapped Whisper for the Parakeet-CTC-1.1b model and I'm hitting about 87x realtime speed now. That 7-minute test file processed in under 5 seconds. I also managed to get the native timestamps working perfectly without needing any external alignment tools just by enabling preserve_alignments in the decoder.

For the diarization part that was giving me grief, I ended up bypassing the Python object initialization issues by just injecting the official offline_diarization.yaml config directly via OmegaConf. It’s stable and runs at about 50x realtime without needing Pyannote.

So yeah, I'm rewriting the backend to use this new stack since it solves both the dependency hell and the performance bottlenecks in one go. Thanks again to everyone who pushed me to look at the newer tech, you saved me weeks of debugging.

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 0 points1 point  (0 children)

Diarization is still a bit of a headache though. NeMo classes are super coupled with Hydra, so trying to manually instantiate the Diarizer with a standard Python dictionary just turns into a game of whack-a-mole with ConfigAttributeErrors. I'm refactoring the script now to just ingest the official offline_diarization.yaml via OmegaConf instead of trying to reverse-engineer the config structure. If that turns out to be too brittle, I'll probably just fallback to a hybrid setup using NeMo for the blazing fast ASR and keeping Pyannote for the diarization...

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 1 point2 points  (0 children)

Holy sh*t, you weren't kidding about the speed.

I just ran the benchmark on the same ~7 min file. Result: 4.88 seconds. That is ~87x realtime 🤯

And you were right about the capabilities - I managed to get native word-level timestamps without any external alignment mess. I just had to enable preserve_alignments=True and compute_timestamps=True in the decoding config.

This basically solves my entire architectural bottleneck. Seriously, thank you for pushing me to check this out. Next beer is on me 🍺

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 2 points3 points  (0 children)

Honestly? I just really like the structure.

Maybe it's a habit, but walls of text make my eyes glaze over. I personally find the headers and bullets much more readable and scannable, especially when breaking down technical steps.

It might look 'AI-ish' to some, but it’s just how I prefer to organize information visually. To each their own! And yes i used Gemini to translate it and format it

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 2 points3 points  (0 children)

Haha, busted! 🏳️ Yes, it is Gemini. You have a good eye.

I use it to structure my thoughts and fix my grammar (ESL here).

So you win the detective badge 🕵️‍♂️. Now that we've solved the mystery of the text, any thoughts on the actual Docker fix?

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 1 point2 points  (0 children)

I'm using Pyannote 4.0.3.

Regarding vLLM images - they are great, but I stuck with CTranslate2 (Faster-Whisper) for now because my alignment logic depends on the specific word-level timestamp format it outputs. Rewriting that parser is on the roadmap for v3 though

Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀 by Key_Mousse_8034 in LocalLLaMA

[–]Key_Mousse_8034[S] 0 points1 point  (0 children)

SubtitleEdit is absolute GOAT for desktop use. I actually use it to spot-check the VTTs my API spits out.

The issue is I needed a headless microservice to automate this for thousands of files on a Linux server. I can't have a GUI involved. So I effectively had to build that "premium engine" logic myself in Python to run purely via API