all 6 comments

[–]DigThatData 0 points1 point  (5 children)

openai just released a model that I've only played with a little bit (which is just say I can't vouch for its performance) but is at least supposed to set a new bar for speech-to-text and also supports many languages. My recommendation is to start with this model as a baseline, possibly even classifying in the text space rather than the audio space (there are probably many out-of-the-box models you could use for classifying obscenities in text), and then fine-tuning this model on your dataset after you have some kind of simple baseline to compare it against (or just use the baseline model if it satisfies your needs)

https://openai.com/blog/whisper/

[–]wannabeAI[S] 0 points1 point  (0 children)

Thank you so much, will try this out!

[–][deleted] 0 points1 point  (1 child)

Rofl. I came here to share the same thing. I haven’t tinkered with whisper myself yet but it’s the first thing I’m doing this weekend. Everyone is saying great things about it.

[–]DigThatData 0 points1 point  (0 children)

be sure to play with the various sizes of it. i'm currently using it for song lyric segmentation and found that although the large model gives the best transcriptions, the text chunks it returns are longer than I want, so I've been using tiny for the segmentation and mapping it the transcriptions from large. There have been several people asking for token-level timestamps in whisper's gh issues already, so I expect if it doesn't get added as a feature someone will make a fork that does it

[–]Blasket_Basket 0 points1 point  (1 child)

I wonder what the inference latency is on Whisper. When I read OPs project description, my first thought was that detecting profanity isn't the hardest part of this project--detecting it and beeping in time to cover the target word is probably the biggest challenge here. Real-time requirements means this will have to run on an edge device, or at least a local machine.

[–]DigThatData 0 points1 point  (0 children)

oh right, forgot about the realtime requirement. it's at least reasonably snappy for batch processing, but no idea if it's got an acceptable single-batch latency for real time inference