Understanding human speech precisely by a machine has been a major challenge for many years.With Automatic Speech Recognition (ASR) being decades old and considering the advancement of the technology, where it is not at the point where machines understand all speech, it is used on a regular basis in many applications and services. Hence, to advance research it is important to identify significant research directions, specifically to those that have not been pursued or funded in the past. The performance of such ASR systems, traditionally build upon an Hidden Markov Model (HMM), has improved due to
the application of Deep Neural Networks (DNNs). Despite this progress, building an ASR system remained a challenging task requiring multiple resources and training stages. The idea of using DNNs for Automatic Speech Recognition has gone further from being a single component in a pipeline to building a system mainly based on such a network.
This paper provides a literature survey on state of the art researches on two major models, namely Deep Neural Network - Hidden Markov Model (DNN-HMM) and Recurrent Neural Networks trained with Connectionist Temporal Classification (RNN-CTC). It also provides the differences between these two models at the architectural level.
Mariette Awad and Rahul Khanna. “Hidden Markov Model”. In: Efficient Learning Machines: Theories, Concepts, and Applications for Engineers and System Designers. Berkeley, CA: Apress, 2015, pp. 81–104. ISBN: 978-1-4302-5990-9. DOI: 10.1007/978- 1- 4302-5990- 9 5. URL: https://doi.org/10.1007/978- 1- 4302-5990-9 5.
Lalit R Bahl et al. “Estimating hidden Markov model parameters so as to maximize speech recognition ac- curacy”. In: IEEE Transactions on Speech and Audio Processing 1.1 (1993), pp. 77–83.
James K Baker. Stochastic modeling as a means of automatic speech recognition. Tech. rep. CARNEGIE- MELLON UNIV PITTSBURGH PA DEPT OF COM- PUTER SCIENCE, 1975.
Anjan Basu and Torbjørn Svendsen. “A time-frequency segmental neural network for phoneme recognition”. In: 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing. Vol. 1. IEEE. 1993, pp. 509–512.
Leonard E Baum. “An inequality and associated max- imization technique in statistical estimation for proba- bilistic functions of Markov processes”. In: Inequalities 3.1 (1972), pp. 1–8.
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. “Learning long-term dependencies with gradient de- scent is difficult”. In: IEEE transactions on neural networks 5.2 (1994), pp. 157–166.
Judith C Brown and Paris Smaragdis. “Hidden Markov and Gaussian mixture models for automatic call classi- fication”. In: The Journal of the Acoustical Society of America 125.6 (2009), EL221–EL224.
Povey Daniel et al. “The Kaldi speech recognition toolkit”. In: IEEE 2011 workshop on automatic speech recognition and understanding. EPFL-CONF-192584.2011.
Arthur P Dempster, Nan M Laird, and Donald B Rubin. “Maximum likelihood from incomplete data via the EM algorithm”. In: Journal of the Royal Statistical Society: Series B (Methodological) 39.1 (1977), pp. 1–22.
Mark Gales and Steve Young. “The application of hidden Markov models in speech recognition”. In: Foundations and trends in signal processing 1.3 (2008), pp. 195–304.
J-L Gauvain and Chin-Hui Lee. “Maximum a posteriori estimation for multivariate Gaussian mixture observa- tions of Markov chains”. In: IEEE transactions on speech and audio processing 2.2 (1994), pp. 291–298.
Lawrence Gillick et al. “Application of large vocabulary continuous speech recognition to topic and speaker identification using telephone speech”. In: 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing. Vol. 2. IEEE. 1993, pp. 471–474.
A. Graves, N. Jaitly, and A. Mohamed. “Hybrid speech recognition with Deep Bidirectional LSTM”. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. 2013, pp. 273–278.
Alex Graves and Navdeep Jaitly. “Towards end-to-end speech recognition with recurrent neural networks”. In: International conference on machine learning. 2014, pp. 1764–1772.
Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. “Speech recognition with deep recurrent neural networks”. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE. 2013, pp. 6645–6649.
Alex Graves et al. “Connectionist temporal classifi- cation: labelling unsegmented sequence data with re- current neural networks”. In: Proceedings of the 23rd international conference on Machine learning. 2006, pp. 369–376.
Awni Y Hannun et al. “First-pass large vocabulary con- tinuous speech recognition using bi-directional recur- rent dnns”. In: arXiv preprint arXiv:1408.2873 (2014).
G. Hinton et al. “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups”. In: IEEE Signal Processing Magazine 29.6 (2012), pp. 82–97.
Geoffrey Hinton et al. “Deep neural networks for acous- tic modeling in speech recognition: The shared views of four research groups”. In: IEEE Signal processing magazine 29.6 (2012), pp. 82–97.
Sepp Hochreiter and Ju¨ rgen Schmidhuber. “Long short- term memory”. In: Neural computation 9.8 (1997), pp. 1735–1780.
Frederick Jelinek. “Continuous speech recognition by statistical methods”. In: Proceedings of the IEEE 64.4 (1976), pp. 532–556.
Frederick Jelinek. “Fast sequential decoding algorithm using a stack”. In: IBM journal of research and devel- opment 13.6 (1969), pp. 675–685.
Richard P Lippmann. “An introduction to computing with neural nets”. In: ACM SIGARCH Computer Archi- tecture News 16.1 (1988), pp. 7–25.
Andrew Maas et al. “Lexicon-free conversational speech recognition with neural networks”. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2015, pp. 345–354.
Yajie Miao, Mohammad Gowayyed, and Florian Metze. “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding”. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE. 2015, pp. 167–174.
Zaihu Pang et al. “Discriminative training of GMM- HMM acoustic model by RPCL learning”. In: Frontiers of Electrical and Electronic Engineering in China 6.2 (2011), pp. 283–290.
Lawrence R Rabiner and BH Juang. “Statistical meth- ods for the recognition and understanding of speech”. In: Encyclopedia of language and linguistics (2004).
Douglas Reynolds. “Gaussian Mixture Models”. In: Encyclopedia of Biometrics (Jan. 2008). DOI: 10.1007/978-0-387-73003-5 196.
Has¸im Sak, Andrew Senior, and Franc¸oise Beaufays. “Long short-term memory based recurrent neural net- work architectures for large vocabulary speech recogni- tion”. In: arXiv preprint arXiv:1402.1128 (2014).
Kristie Seymore, Andrew McCallum, Roni Rosenfeld, et al. “Learning hidden Markov model structure for information extraction”. In: AAAI-99 workshop on ma- chine learning for information extraction. 1999, pp. 37–
Ilya Sutskever et al. “On the importance of initialization and momentum in deep learning”. In: International conference on machine learning. 2013, pp. 1139–1147.
Ilkka Tuomi. “The lives and death of Moore’s Law”. In: First Monday (2002).
Dong Yu et al. “Feature learning in deep neural networks-studies on speech recognition tasks”. In: arXiv preprint arXiv:1301.3605 (2013).
Parham Zolfaghari and Tony Robinson. “Formant anal- ysis using mixtures of Gaussians”. In: Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP’96. Vol. 2. IEEE. 1996, pp. 1229–1232.
This work is licensed under a Creative Commons Attribution 4.0 International License.
The names and email addresses entered in this journal site will be used exclusively for the stated purposes of this journal and will not be made available for any other purpose or to any other party.
Submission of the manuscript represents that the manuscript has not been published previously and is not considered for publication elsewhere.