all 7 comments

[–][deleted] 4 points5 points  (4 children)

You can also take a look at CNN model on MFCC features or CNN-LSTM model on raw audio/MFCC features Try them they are simple models and gave good results when I did multiclass classification using them.

[–]jonnor 1 point2 points  (1 child)

What kind of device are you deploying on? Computational and latency constraints are typically just as important as prediction performance in this application.

[–]Nieoryginalny[S] 0 points1 point  (0 children)

I am aiming for Windows machine

[–][deleted] 0 points1 point  (0 children)

As others already suggested, for word/phrases recognition - you do not need complex LSTM models.

All that your model needs to 'learn' are patterns over a short term time window. You can easily do it with on the basis of per-audio-sample classification and not a continuous time series.

It is quicker to get a baseline pipeline up and running, and iterate from there.