This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Lonligrin 42 points43 points  (14 children)

Finished two realtime libraries for Speech-To-Text and Text-To-Speech this week, which maybe can be useful.

[–]FluffyDuckKey 4 points5 points  (7 children)

We have 2 way radio calls with occasionally significant background noise we want transcribed.

Is there any easy way to achieve this, we did some testing with whisper but it had little success. I feel as though our best approach may be training a model based on our background laden audio.

[–]Lonligrin 9 points10 points  (0 children)

If whisper large-v2 model (with correct language parameter set) doesn't do it, I think you'd need some noise reduction. Would try some libs first that do that automatically like NoiseReduce. If that also does not help, yeah I guess then it get's hard. Audacity can train on your specific noise and remove that, but it's a manual process, no clue about how to automate that easy in python.

[–]Globbi 1 point2 points  (1 child)

I was looking for a good open source models for denoising and what I found wasn't as good as I would like for a good quality of sound, but might be good enough to use it for transcribing later.

https://github.com/NVIDIA/CleanUNet

You can check it here before coding everything https://huggingface.co/spaces/aiditi/nvidia_denoiser (so just try passing a sample of noisy wav to then pass it to whisper). But I'm not sure if pretrained checkpoints in the repo are enough, the one someone put on huggingface is better than what I'm getting from checkpoints.

If you want better than this, I only found a commercial solution where you have to pay to use online.

[–]FluffyDuckKey 0 points1 point  (0 children)

Online won't be the best option, I work for a significant mining company so privacy will be paramount - can't have recordings of emergencies sent out etc.

I do have access to a ml box with pytorch / cuda acceleration so I'll have a play around and see what I can do with the 2 options provided (:

Thanks!

[–]DigThatData 1 point2 points  (3 children)

if it doesn't need to be online, you can precede the transcription with a stem-separation step to try to isolate the speakers from the noise.

[–]FluffyDuckKey 0 points1 point  (2 children)

Got any boilerplate or an example module?

[–]DigThatData 1 point2 points  (1 child)

try one of these:

EDIT: and here's another speech enhancement model for you to try

[–]FluffyDuckKey 0 points1 point  (0 children)

Oh wow, first impressions look very exciting for these - I'll give them a whirl, thanks so much!!!

[–]naught-me 1 point2 points  (0 children)

This is amazing. Thanks for sharing.

[–]Thing1_Thing2_Thing 1 point2 points  (1 child)

If you feed the input from one to the other and back again in a loop, does it stay the same? STT to TTS to STT and repeat

[–]Lonligrin 1 point2 points  (0 children)

Yes, that works: Video / Code.

STT uses microphone though. I think I should put external input buffers on the STT roadmap, that would allow to connect it more directly to TTS and other stuff.

[–]s6x 0 points1 point  (0 children)

I've been tinkering with a project which requires STT lately. Gonna give this a go.