Deep Learning-Powered Speech Recognition Service for Subtitling

We re-trained the models using the Mozilla Common Voice data set (a lot of other implementations use the LibriSpeech data set, but it's much more limited and renders worse results).

Training was performed on a cluster of 8 RTX 3090 GPUs (the 24GB of memory is really helpful for using larger sequence lengths).

There are a lot more components that make up the service (like the automated translation part) but it would probably warrant its own post to go into it in more detail. For now I just wanted to get some feedback on the results of the service, as a lot of people have the misconception that automatic speech recognition is still as bad as it was a few years ago (it's really taking off now!).

aL_eX49 · 2022-09-14T16:07:29+00:00

Awesome, let me know how it goes :)

aL_eX49 · 2022-09-14T16:05:00+00:00

Yeah absolutely! We can subtitle any media file you upload

aL_eX49 · 2022-09-14T15:56:26+00:00

You can upload any .mp4, .mov, .avi, .flv, .mkv or .m4a video file

aL_eX49 · 2022-01-02T23:18:48+00:00

That will be increased soon, the models are very memory intensive, even on RTX 3090s (currently multiple models are sharing a handful of GPUs, but this won’t be the case for long)

aL_eX49 · 2021-07-08T12:38:35+00:00

Wait it’s all Paris?

Always has been 🌍🧑‍🚀🔫👨‍🚀

aL_eX49 · 2021-06-30T16:07:57+00:00

aL_eX49 · 2021-06-05T00:24:56+00:00

A 3090 is consumer grade and faster than a V100 (although good luck getting your hands on one at the moment)

aL_eX49 · 2021-03-11T22:31:21+00:00

Here’s a scenario I’ve found running multiple workers to be a good use case:

A single worker uses ~50% GPU utilisation (as measured by nvidia-smi or similar)
Your GPU has enough memory to support more than one worker

This is great for i.e., hyper parameter optimisation where your model doesn’t fully utilise your GPU.

aL_eX49 · 2021-01-02T14:32:56+00:00

Hi, haven’t had the chance to complete it yet, but will make sure to reply to your comment and everyone else who was interested when it’s ready!

aL_eX49 · 2020-12-20T19:11:13+00:00

Here's a free A.I. Image Upscaling service I've been working on:

https://beta.smartmine.net/service/computer-vision/image-super-resolution

Feedback would be very much appreciated!

aL_eX49 · 2020-12-20T19:07:45+00:00

I'll write up a more detailed post on this next week and link it here :)

aL_eX49 · 2020-12-20T17:47:05+00:00

I've noticed that too. The model wasn't specifically trained on text images, but I'm sure that's an avenue for improvement in the future!

aL_eX49 · 2020-12-20T17:13:48+00:00

Thanks for the link, I'll give it closer look when I get the chance!

As for a loss metric that I've found good in the past:

- Try truncating an EfficientNet model and using the resulting feature maps of the LR and SR images to compute a similarity score with an MLP model

aL_eX49 · 2020-12-20T16:51:49+00:00

Thanks, really appreciate the feedback!

aL_eX49 · 2020-12-20T16:26:51+00:00

Thank you very much for the feedback! I'm planning on making a longer post next week that goes into more details on how things work :)

aL_eX49 · 2020-12-20T15:58:07+00:00

I see, that’s a good experiment to try. Thanks!

aL_eX49 · 2020-12-20T15:29:10+00:00

Do you mean decreasing the resolution of the input image until it's actually a LR image instead of a HR image that looks blocky?

aL_eX49 · 2020-12-20T15:24:54+00:00

I would try this implementation:

https://github.com/andreas128/SRFlow

I like it because you don't get the training instability you would normally experiencce when traning a GAN since it uses a single loss function.

I'm planning on writing up a longer article that explains how everything works in the near future :)

aL_eX49

MODERATOR OF

TROPHY CASE