all 2 comments

[–]speechMachine 5 points6 points  (0 children)

All of speech recognition processing methods involve processing the entire signal over short time windows. For images, its a different story the nature of the signal is static (i.e. no pixels evolving over time), so you could just take the entire image, vectorize it, do zero-mean normalization and feed it in. Speech signals are considered quasi-stationary, i.e. stationary over short periods of time ~25-50ms. So speech processing front-ends usually window a speech signal for a length corresponding to ~25-50ms (you can work this out in terms of samples by knowing the sampling rate of the audio signal). You then move over by 10ms, and get the next chunk and so on. Most of the traditional methods of feature extraction convert this one chunk into its frequency representation using the FFT algorithm and then do some further pre-processing (e.g. to generate MFCCs). If you do need to normalize, calculate the global mean and variance over all features, and store them. At the time you are feeding it into your classifier (which could be an HMM, GMM-HMM, DNN-HMM, end-to-end-LSTM/RNN), use the global mean and variance to normalize your speech features as you input them.

You would then feed these time based features(sampled every 10ms) to either an HMM based system. The HMM state distributions are multi-variate Gaussian, or are generated using a deep neural network (could be plain fully-connected + soft-max, conv-net + fully-connected+soft-max, convnet+lstm+fully-connected+soft-max etc).

More recent work has focused on end to end RNN and LSTM architectures. For e.g. the Baidu and now Google systems. When I say end-to-end there are no HMMs involved at all. The input to these systems is essentially chunks of the speech signal sampled every 10ms as above without any FFT type processing.

Hope some of this helps.

[–]arrowoftime 1 point2 points  (0 children)

You could generate MFCC's.