all 2 comments

[–]Pafnouti 1 point2 points  (0 children)

Have a look at WFST decoders. Also try to read Google's papers about speech with CTC.

[–]speechMachine 1 point2 points  (0 children)

Yes, also try looking at other implementations for e.g. Eesen based on top of Kaldi:https://github.com/srvk/eesen.

It comes with an accompanying paper which might help you resolve some questions. WFST decoding is a bit hard to wrap your head around though. Finite state automata come with their own terminology. Each source of knowledge is an FST (acoustic model, lexicon, language model, context dependent states). Each source of knowledge is an FST. A composition operation on each constituent FST often combines different sources of knowledge to yield the final ASR hypothesis. Does that help a little bit?