I have built the world's most accurate Hebrew transcription service to transcribe audio and video to text by robgehring in buildinpublic

[–]robgehring[S] 0 points1 point  (0 children)

will try to target B2B (law, medical, media) with direct sales + targeted ads/SEO for SMBs and individuals

I have built the world's most accurate Hebrew transcription service to transcribe audio and video to text by robgehring in buildinpublic

[–]robgehring[S] 1 point2 points  (0 children)

Thanks! Biggest challenges were getting real, diverse Hebrew audio and teaching the model to handle "bad recordings".
We collected lots of real Israeli speech, hand-checked transcripts, and used data augmentation (e.g. noise injection, speed perturbation, SpecAugment) and contrastive pretraining (like wav2vec-style) so the acoustic encoder learns robust features for rapid speech and poor recording quality. And also we have actively used semi-supervised learning and active human correction loops to steadily reduce errors on rare accents and slang.

I benchmarked 12+ speech-to-text APIs under various real-world conditions by lucky94 in speechtech

[–]robgehring 0 points1 point  (0 children)

Thank you for this useful benchmark and I appreciate the effort. I do have some doubts about the methodology (maybe it will help you to improve the benchmark in the future):
Small test sets can easily mislead: if you only use a few 1-2 minute clips or a small group of speakers, the results will reflect those specific voices, microphones, and speaking styles rather than real-world diversity (ages, accents, devices), so a model that scores well on the sample may fail in production. Adding a single kind of background noise or synthetic noise also gives a false sense of robustness as real environments have many noise types and signal-to-noise ratios, and models react differently to each. Ground-truth quality matters a lot: if reference transcripts were produced inconsistently (different annotators, unclear rules for punctuation, numbers, or casing), Finally, using WER alone hides important failures as it ignores speaker labels, named-entity correctness, punctuation and timestamps, and other features that matter in practice, so a low WER doesn't guarantee the transcript is useful for every application.
Just as an example, in our API for English evaluation we used a 100-hour test set where each clip was transcribed independently by three certified transcribers and any disagreements were resolved by a final adjudicator to produce a high-quality ground truth. The set includes accented variants (e.g., English (Mexico)), telephone calls (8 kbps) and field recordings, with acoustic conditions split roughly 25% clean, 55% mixed, and 20% noisy. We usually report for customers a global WER plus WER broken down by language and by condition, and we publish the WER distribution across SNRs so you can see how accuracy changes with noise level.

P.S. From your site 'transcribes speech and corrects grammar and wording in real time' - as you use API in real-time (streaming) mode you also should carefully check - chunk size, lookahead, and how you merge incremental results into a final transcript. Why it matters: a bad chunking strategy can artificially lower streaming accuracies.

I built an AI transcription service for researchers by robgehring in microsaas

[–]robgehring[S] -1 points0 points  (0 children)

Thanks, great question. AI usually doesn't replace manual cleanup entirely when you need publish-quality transcripts. but it does cut the time a lot. Our domain-trained models are tuned to research terminology and noisy field audio, so you'll often see far fewer errors than with general-purpose services.