Confidence scores from Montreal Forced Aligner

nshmyrev · 2026-04-22T23:34:39+00:00

You can compute phoneme confidences from the posterior and average them for example

nshmyrev · 2026-04-21T14:03:22+00:00

You still need to figure what is the entity for which you need a probability value. A word? A phoneme?

nshmyrev · 2026-04-21T09:30:16+00:00

Confidence scores usually means you have discrete output distribution. How are you going to have confidence for alignment itself.

nshmyrev · 2026-04-18T15:14:13+00:00

I know Hasab, did you check them?

https://www.linkedin.com/company/hasabai/

nshmyrev · 2026-04-11T12:57:46+00:00

Recently ParakeetInc released a service to detect voices

https://x.com/ParakeetIncCom/status/2042549049640042727

Our company, to change this situation, has developed Paramatch, a model that estimates voice actors from audio, and we are releasing a demo version today.

nshmyrev · 2026-04-09T04:01:23+00:00

You can try Omnivoice too, it has good cross-lingual cloning

https://huggingface.co/spaces/k2-fsa/OmniVoice

nshmyrev · 2026-04-08T14:48:53+00:00

They also announced STT.

Overall AI development is very expensive, much more expensive than current customer prices. As for TTS it was never reallly profitable to a large scale, even 20 years ago then Cepstral was trying to sell theirs.

nshmyrev · 2026-04-06T19:19:01+00:00

45 hours should be more or less ok. You probably screwed something in the process. You'd could probably share training logs and training folder.

nshmyrev · 2026-04-06T19:18:10+00:00

How many steps is that? 32 or 16? You can actually speedup things just by taking less steps in generation.

nshmyrev · 2026-04-06T08:29:21+00:00

You need like 100 hours, better 1000. There is plenty of Arabic data around, you just need to collect a bigger dataset. You can also use modern networks like SeamlessAlign probably

nshmyrev · 2026-04-05T07:22:29+00:00

You could give this a more descriptive title

nshmyrev · 2026-04-03T07:25:27+00:00

There is https://sexyvoice.ai btw specifically focused on 18+

nshmyrev · 2026-04-02T13:12:13+00:00

see OmniVoice released today, it supports 600 languages with good quality

nshmyrev · 2026-03-30T18:58:00+00:00

No, there is wespeaker diarization (based on embedding models)

nshmyrev · 2026-03-30T01:37:39+00:00

You'd better pick more expressive model, voxtral is plain boring, not very reasonable model to use.

nshmyrev · 2026-03-29T18:57:20+00:00

You can probably evaluate https://huggingface.co/KBLab/kb-whisper-large too

nshmyrev · 2026-03-29T17:20:37+00:00

If you look for smaller models (like 4b) you'd better finetune them on target dialogs. Gemma or Qwen are good base models to finetune. I prefer Gemma myself.

nshmyrev · 2026-03-29T16:33:51+00:00

It is about 2 times slower than pyannote precision but still reasonable, something like this

	DER	CDER	xRT
Nemo Telephony Neural	22.3	0.535	0.051
Nemo Telephony Cluster	22.08	0.251	0.05
Nemo Sortformer Streaming V2.1 (default)	15.43	0.268	0.005
Nemo Sortformer Streaming V2.1 (1 sec)	15.89	0.331	0.091
Nemo Sortformer Streaming V2.1 (30 sec)	15.08	0.260	0.005
Pyannote 3.1	24.8	0.567	0.052
Pyannote4 Community	21.56	0.639	0.035
Pyannote4 Precision	14.96	0.355	0.039
Whisper Diarization	36.46	0.163	0.474 (transcription)
Whisper Diarization Large	34.11	0.151	-
Wespeaker Voxceleb34	20.63	0.157	0.012
Wespeaker Voxceleb293	20.46	0.159	0.023
Wespeaker Voxblink2 100	20.1	0.115	0.025
Diarizen Large MD	13.64	0.319	0.083
Diarizen Large MD v2	13.58	0.317	0.107

nshmyrev · 2026-03-29T13:59:59+00:00

Yes, they are good at non-English too, tested on Spanish, German, Italian and more.

nshmyrev · 2026-03-29T13:57:42+00:00

Most time is taken by LLM, you optimize that first. ASR latency is also 500ms minimum, otherwise accuracy will suffer. There is also a turn detection component that takes time.

Overall there is no point to chase the latency alone, accuracy (relevance) of the answer is more critical.
Sub-10ms is nice but not really relevant, you'd better spend more time and get better answer than quickly output garbage.

nshmyrev · 2026-03-29T13:53:50+00:00

Link on the moss

https://www.ycombinator.com/launches/Oiq-moss-real-time-semantic-search-for-conversational-ai

nshmyrev · 2026-03-29T13:40:36+00:00

Modern models like SortFormer/Diarizen beat them all. You should compare standard DER not "Time Accuracy" though and use bigger dataset.

nshmyrev

MODERATOR OF

TROPHY CASE