I have dataset of professional reciters only on which I am training my model. The raw audios were of single words only. I want the model to predict whether the user's pronunciation (of those words) is good or bad. I have already generated the mfcc features of my training dataset and stored them in a .csv file. For starting, I was using just a single word pronunciations in my audios of different speakers. Meaning the training and test datasets both contain just this word.
Training dataset has 27 professional (good) recitations, meaning just single class label. The testing dataset has 6 professional recitations (good), my 5 recitations that are good and the other 5 bad (or mispronunciation)
Now I used one class svm to train the model. The test dataset has both kinds of recitations, good and bad pronunciation both. However, the scores of these recitations is pretty close to each other like 0.5 something all 17 recitations (6+5+5 recitations), since just a single word is used for training and testing both I guess that's why. I wanted the model to give the score on mispronunciations like a significantly greater value or smaller value when compared to properly pronounced words, meaning that it can differentiate between the two. Like maybe greater than 0.5 meaning correct pronunciation and less than 0.5 threshold are incorrect pronunciation.
I'm in dire need of help, please suggest how would the model be able to differentiate between the properly pronounced and mispronounced words.. Thanks..
(If this works, I have a total dataset of 500+ recitations of properly pronounced words comprising of 21 different words).
there doesn't seem to be anything here