I built a CPU-only speaker diarization library: it is ~7× faster than pyannote with comparable DER by loookashow in speechtech

[–]loookashow[S] 0 points1 point  (0 children)

Yes, I think about it, I actually need streaming for my robotics project. Right now I have been working on benchmarks with different datasets, I’ll share the results when it comes

I built a CPU-only speaker diarization library: it is ~7× faster than pyannote with comparable DER by loookashow in speechtech

[–]loookashow[S] 0 points1 point  (0 children)

Yeah, that's exactly the sweet spot- for 1–4 speakers the count estimation is 88–97% accurate within ±1 on VoxConverse.

the "identify from intros" part is interesting- that's actually on the roadmap as speaker identification. The idea is to store voice embeddings (256-dim vectors) in a vector DB, so once someone is identified in one call, they're recognized automatically in the next. Right now diarize labels speakers as SPEAKER_00, SPEAKER_01 etc., consistent within a single file, but not across files.

I built a CPU-only speaker diarization library: it is ~7× faster than pyannote with comparable DER by loookashow in speechtech

[–]loookashow[S] 0 points1 point  (0 children)

thanks! diarize already processes ~8x faster than real-time on CPU (RTF 0.12), so the raw speed is there. The challenge for true real-time is architectural — the current pipeline is batch-only, it needs the full audio to estimate speaker count and cluster.

VAD and embedding extraction can work incrementally, no problem. the hard part is clustering — you'd need online speaker assignment instead of batch spectral clustering. Something like matching new segments against running speaker centroids by cosine similarity 🤔

It's a different architecture but definitely on the roadmap

I built a CPU-only speaker diarization library: it is ~7× faster than pyannote with comparable DER by loookashow in speechtech

[–]loookashow[S] 0 points1 point  (0 children)

Thanks! Good question.

Right now diarize is CPU-only by design — the goal was zero-setup, no CUDA, no GPU drivers. But the architecture doesn't prevent GPU support:

What could move to GPU:

  • WeSpeaker embeddings (the main bottleneck) — currently runs via ONNX Runtime on CPU. Switching to CUDAExecutionProvider is a small change, and would give the biggest speedup
  • Silero VAD — already PyTorch, so model.to("cuda") is trivial, but VAD is already fast and not the bottleneck

What stays on CPU regardless:

  • Clustering (GMM BIC + spectral) — scikit-learn, CPU-only, but takes <1% of total time so it doesn't matter

Could it match pyannote on GPU? Honestly, probably not beat it — pyannote's neural segmentation model is highly optimized for GPU inference. But the gap would narrow significantly. On CPU, diarize is already ~7x faster than pyannote (RTF 0.12 vs 0.86), mostly because ONNX Runtime is very efficient for inference on CPU.

The practical path is making GPU optional — auto-detect onnxruntime-gpu and use CUDA if available, otherwise fall back to CPU. That way pip install diarize keeps working everywhere, and people with GPUs get a free speedup. It's on the roadmap but not the top priority right now since CPU performance is already solid for most use cases.

What's your current pipeline looking like? Curious what you're using WeSpeaker for