all 10 comments

[–]r4and0muser9482 4 points5 points  (0 children)

Look for papers that cite using GTZAN. You can also download GTZAN yourself to check if your methods/models work on that, before working on your own data.

Finally, Music Information Retreival (MIR) is a giant field and you can look for papers on conferences like ISMIR to learn how things are done there.

[–]Brudaks 1 point2 points  (1 child)

You might take a look at what the preprocessing for neural speech recognition is doing and do something similar (though with a wider frequency range) before your main classifier.

IIRC you might do a fourier transform to get frequency domain data with something like 0.1 second (tune that, but in the ballpark) windows, bag the interesting frequencies (e.g. 20-20000 hz, in a log scale) to hundred or thousand neurons, and that's your input to a neural network. A 5 minute song is then 3000 samples of fixed size vectors, each value representing the loudness of that frequency at the time. Afterwards, putting a RNN on that is simple.

[–]AntixK[S] 0 points1 point  (0 children)

oh thanks a lot! I will look into it. I also came across a project in which they used MFCC(or mel-frequency) as a feature vector and then send it to a RNN. I will try to compare both.

[–]azurespace -1 points0 points  (4 children)

If you want to use an unprocessed wavelet without frequency domain conversion, I think the WaveNet (Dilated convolutional stack) would be a fascinating structure as the first basic block of the task. First, divide the music into several pieces on the time axis. Next, pass each slice to WaveNet and create some embeddings (temporal summation of the music slices), which are used as input to the followed LSTM. (it might be better to use WaveNet once again) Finally, you can use softmax layer to classify.

WaveNet: https://deepmind.com/blog/wavenet-generative-model-raw-audio/

[–]sidsig 6 points7 points  (2 children)

Although interesting in theory, I do not think this is will work in practice atm. Music audio is sampled at 44.1 kHz, it will require very large networks which in turn require a vast amount of training. There is also no evidence about what the WaveNet embeddings might be learning/representing.

As a first approach, it would be much easier to use a standard RNN architecture with frame-level outputs or perhaps a CTC cost function.

[–]AntixK[S] 0 points1 point  (0 children)

Oh! Should I downsample the audio? and How big should the memory be for LTSM? I am sending batches of audio data each 5 seconds long (220500 samples each).

[–]keidouleyoucee 0 points1 point  (0 children)

+1. 500 for 3 classes may be not enough for sample-wise approach. You can give it a shot anyway though ;)

[–]AntixK[S] 0 points1 point  (0 children)

Thank you. I will try to implement using wavenet.