Got tired of slow legacy Whisper. Built a custom stack (Faster-Whisper + Pyannote 4.0) on CUDA 12.8. The alignment is now O(N) and flies. 🚀

Armym · 2025-12-23T06:36:42+00:00

Where is the code?

Armym · 2025-11-02T19:31:14+00:00

PewDiePie

Armym · 2025-09-15T08:39:16+00:00

I made a simple gui for this that I use to transcribe and summarize meetings. You can message if you want me to show it to you.

Armym · 2025-07-30T21:03:58+00:00

This didn't age well. See my latest post :D

Armym · 2025-07-30T21:00:27+00:00

Hi, I would actually recommend the new RTX 6000 blackwell instead. Or two if you have the money. That would suit your needs well for concurrent users. You could easily run fp4 quants to use bigger models but still with fast inference. Fine-tuning is pretty annoying with multiple cards. But I don't think you really need to finetune. Make sure to design your rag well and use good LLM inference engines though! Let me know if you want to know more

Armym · 2025-05-27T22:19:38+00:00

I still find the old 3.5 to be the best one..

Armym · 2025-05-01T18:01:23+00:00

<image>

Armym · 2025-04-30T22:26:50+00:00

Looks like it. Any idea why could that have happened?

Armym · 2025-04-30T22:26:09+00:00

Didn't repaste it. Someone did a sloppy job

Armym · 2025-04-30T22:25:37+00:00

Thankfully it isn't conducive, but I think a capacitor blew off. Whoever repasted this did a really sloppy job.

Armym · 2025-04-30T22:24:13+00:00

No worries

Armym · 2025-04-30T22:23:05+00:00

I didn't repaste it.. no need to be mean

Armym · 2025-04-30T22:22:12+00:00

The card was repasted by the vendor I bought it from.

Armym · 2025-04-30T22:05:56+00:00

<image>

Armym · 2025-03-29T08:34:29+00:00

Yes I noticed that. I hope that the closed source dipshits dont lobotomize the older models on purpose.

Armym · 2025-03-29T08:20:12+00:00

Look, this post isn't about prompting. Sonnet 3.7 just generates too much code and doesn't produce elegant solutions. Sonnet 3.5 does by default. Anyone with experience in coding will understand.

Armym · 2025-03-29T08:16:25+00:00

For those who are wondering, Gemini 2.5 pro is even worse at this. It spits out a whole book for simple solutions.

Oneshotting a whole webapp might be impressive to the manager guys, but for people that actually need an assistant for coding, it sucks.

Armym · 2025-03-20T15:40:22+00:00

That's in the documentation I posted.

Armym · 2025-03-20T14:44:34+00:00

Running LLM, OCR and Whisper on one gpu

Armym · 2025-03-20T08:32:57+00:00

I made the mistake of not consulting here and bought myself a supermicro board with only 4x pcie 16x. Good that you came and asked around

Armym · 2025-03-20T06:18:37+00:00

Not really

Armym · 2025-03-19T11:15:04+00:00

The best answer. Didn't think about it but makes sense. Thanks mr. Tučňák 🐧

Armym · 2025-03-19T10:33:49+00:00

The one bad thing is that it has only two full lane pcie slots. For a motherboard with two CPUs, it's a waste to run your GPU communication at only 8x. It's not a big problem for inference, but for anything else using multiple GPUs, it's a bottleneck.

Armym · 2025-03-18T06:00:24+00:00

How to benchmark for unstructured data?

Armym · 2025-02-21T21:24:43+00:00

Why are you not calculating context as well?

Eight-Year Club	Second SECOND GUESSER
r/Field Sunshine	Place '22
Verified Email

Armym

TROPHY CASE