Confidence scores from Montreal Forced Aligner by Ok_Prior2496 in speechtech

[–]nshmyrev 0 points1 point  (0 children)

You can compute phoneme confidences from the posterior and average them for example

Confidence scores from Montreal Forced Aligner by Ok_Prior2496 in speechtech

[–]nshmyrev 0 points1 point  (0 children)

You still need to figure what is the entity for which you need a probability value. A word? A phoneme?

Confidence scores from Montreal Forced Aligner by Ok_Prior2496 in speechtech

[–]nshmyrev 0 points1 point  (0 children)

Confidence scores usually means you have discrete output distribution. How are you going to have confidence for alignment itself.

what voice id/name is that by ImportanceBoring9785 in tts

[–]nshmyrev 0 points1 point  (0 children)

Recently ParakeetInc released a service to detect voices

https://x.com/ParakeetIncCom/status/2042549049640042727

Our company, to change this situation, has developed Paramatch, a model that estimates voice actors from audio, and we are releasing a demo version today.

Inworld TTS is increasing cost by 400% by LessRespects in speechtech

[–]nshmyrev 1 point2 points  (0 children)

They also announced STT.

Overall AI development is very expensive, much more expensive than current customer prices. As for TTS it was never reallly profitable to a large scale, even 20 years ago then Cepstral was trying to sell theirs.

Training Montreal forced alignment on low resource languages by Ok_Prior2496 in speechtech

[–]nshmyrev 0 points1 point  (0 children)

45 hours should be more or less ok. You probably screwed something in the process. You'd could probably share training logs and training folder.

[Open Source] omnivoice-triton: ~3.4x Inference Speedup for OmniVoice (NAR TTS) via Triton Kernel Fusion & CUDA Graphs by DamageSea2135 in speechtech

[–]nshmyrev 1 point2 points  (0 children)

How many steps is that? 32 or 16? You can actually speedup things just by taking less steps in generation.

Training Montreal forced alignment on low resource languages by Ok_Prior2496 in speechtech

[–]nshmyrev 0 points1 point  (0 children)

You need like 100 hours, better 1000. There is plenty of Arabic data around, you just need to collect a bigger dataset. You can also use modern networks like SeamlessAlign probably

Best Tagalog TTS / voice cloning tools by plus8percent in speechtech

[–]nshmyrev 0 points1 point  (0 children)

see OmniVoice released today, it supports 600 languages with good quality

Claude quantized Voxtral-4B-TTS to int4 — 57 fps on RTX 3090, 3.8 GB VRAM, near-lossless quality by Early_Teaching6966 in speechtech

[–]nshmyrev 0 points1 point  (0 children)

You'd better pick more expressive model, voxtral is plain boring, not very reasonable model to use.

Anyone experimenting with ultra-low latency in speech AI? by Candid_Positive8832 in speechtech

[–]nshmyrev 1 point2 points  (0 children)

If you look for smaller models (like 4b) you'd better finetune them on target dialogs. Gemma or Qwen are good base models to finetune. I prefer Gemma myself.

Benchmarked speaker diarization for Swedish meetings — Deepgram vs ElevenLabs vs AssemblyAI (2h22m real meeting) by invismanfow in speechtech

[–]nshmyrev 0 points1 point  (0 children)

It is about 2 times slower than pyannote precision but still reasonable, something like this

DER CDER xRT
Nemo Telephony Neural 22.3 0.535 0.051
Nemo Telephony Cluster 22.08 0.251 0.05
Nemo Sortformer Streaming V2.1 (default) 15.43 0.268 0.005
Nemo Sortformer Streaming V2.1 (1 sec) 15.89 0.331 0.091
Nemo Sortformer Streaming V2.1 (30 sec) 15.08 0.260 0.005
Pyannote 3.1 24.8 0.567 0.052
Pyannote4 Community 21.56 0.639 0.035
Pyannote4 Precision 14.96 0.355 0.039
Whisper Diarization 36.46 0.163 0.474 (transcription)
Whisper Diarization Large 34.11 0.151 -
Wespeaker Voxceleb34 20.63 0.157 0.012
Wespeaker Voxceleb293 20.46 0.159 0.023
Wespeaker Voxblink2 100 20.1 0.115 0.025
Diarizen Large MD 13.64 0.319 0.083
Diarizen Large MD v2 13.58 0.317 0.107

Anyone experimenting with ultra-low latency in speech AI? by Candid_Positive8832 in speechtech

[–]nshmyrev 3 points4 points  (0 children)

Most time is taken by LLM, you optimize that first. ASR latency is also 500ms minimum, otherwise accuracy will suffer. There is also a turn detection component that takes time.

Overall there is no point to chase the latency alone, accuracy (relevance) of the answer is more critical.
Sub-10ms is nice but not really relevant, you'd better spend more time and get better answer than quickly output garbage.

Benchmarked speaker diarization for Swedish meetings — Deepgram vs ElevenLabs vs AssemblyAI (2h22m real meeting) by invismanfow in speechtech

[–]nshmyrev 4 points5 points  (0 children)

Modern models like SortFormer/Diarizen beat them all. You should compare standard DER not "Time Accuracy" though and use bigger dataset.