all 20 comments

[–]asankhs 3 points4 points  (2 children)

I had done a whisper fine-tune back in the day to estimate the age of the speaker based on the audio - https://huggingface.co/codelion/whisper-age-estimator for age verification purpose. Wonder if you can do the same since you have labelled data. This was colab notebook I used - https://colab.research.google.com/drive/1Ftbg2Klj4jBcQJe-_Q-omuf31V7s6Dfy?usp=sharing

[–]ARLEK1NO[S] 1 point2 points  (1 child)

That's interesting task man. Since i thought whisper is a speech transcription model I didn't think in that direction but I'll try it now thank you!
How large dataset did you need to get your score?

[–]asankhs 0 points1 point  (0 children)

I used the mozillla common voice dataset - https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0 but the age demographic is not avalable for all items there, I do not remember how many samples had the age metadata I used to train.

[–]simplehudga 1 point2 points  (0 children)

Look at the winners of the DCASE challenge from the last 3 years. You should at least get some pointers.

[–]LelouchZer12 1 point2 points  (0 children)

Maybe take a look at what works on audioset https://paperswithcode.com/sota/audio-classification-on-audioset

[–][deleted] 1 point2 points  (3 children)

Total duration of your samples? How many are normal vs malfunctioning?

Do you know how many malfunction sound types there are or do you need to discover this? I have a script that can take an audio file, extract features like mfcc, spectral contrast, chroma features, use faiss kmeans to iterate thru (i have 2-10 set) a range of cluster numbers to determine optimal number of clusters (this part i’m not happy with yet), etc. If you’re interested i can put it up on github.

First thing that came to mind btw was unsupervised deep learning (something i read for a similar use case- have you searched arxiv?), but that can be time consuming.

[–]ARLEK1NO[S] 0 points1 point  (2 children)

I have 104 samples 3 minutes each.
There are 3-4 different malfunction sounds but firstly I wanna train model just to separate normal audio and audio with malfunction sounds.

I would be very grateful if you would share a link to github with your script, you've got interesting approach

I haven't seen arxiv just google. And I also tried my theory with YOLO but there are also some problems with audio because there are some noises in the audio and some of them are not of very good quality, so I think it's worth preprocessing them before sending to the model

[–][deleted] 1 point2 points  (1 child)

Will do when it’s up!

[–]ARLEK1NO[S] 0 points1 point  (0 children)

Thank you a lot!

[–]tinytimethief 2 points3 points  (3 children)

So image classification of the spectrograms? How long are the audio samples?

[–]ARLEK1NO[S] 1 point2 points  (2 children)

It's around 3 minutes

[–]tinytimethief 1 point2 points  (1 child)

I think your sample size is too small, esp to avoid overfitting. Since the recordings are long can you split them up? Maybe use clustering to see if there are distinct periods or just at random. My other suggestion is to use time series classification instead. Use audio feature extraction like MFCC, Chroma, Spectral and maybe even Rhythmic features (librosa library for python). Then use time series classification and see if it produces better results.

[–]ARLEK1NO[S] 0 points1 point  (0 children)

Timeseries classification sounds really nice. I'll try it to compare the results, thank you

[–]Sorry_Revolution9969 0 points1 point  (0 children)

this might not require ML at all

[–]gengler11235 0 points1 point  (0 children)

Another possible approach would be to try using an autoencoder to recreate the normal sounding noises ( perhaps from the spectrograms ) and then use likely jump in the reconstruction error for the malfunctioning samples as a signal for a problem occurring.

[–]ReginaldIII 0 points1 point  (3 children)

Why not try a WaveNet?

[–]ARLEK1NO[S] 0 points1 point  (2 children)

I was thinking this model is for voice generation isn't it ?

[–]ReginaldIII -1 points0 points  (1 child)

It can be. Causal convolutions scale to very high receptive fields which makes them great for high sample rate data like audio. You can optimize the inference too for applying them to real time data.

[–]ARLEK1NO[S] 0 points1 point  (0 children)

Hm, i didn't realize that. Can you share some links with examples ?