Large Vocabulary Continous Speech Recognition for Nepali Language using CNN and Transformer

December 2022 - October 2023



Description

<p>Despite the availability of various algorithms for speech recognition, their performance for low resource languages like Nepali is suboptimal. The Transformer architecture is a state-ofthe-art NLP deep learning algorithm that uses self attention to model temporal context information. Although it has shown promising results for English ASR systems, its performance for Nepali has not been extensively explored. This work implements an end to end CNNTransformer based ASR system to explore the potential of Transformer for building an ASR for the Nepali language. The study used around 159K datasets extracted from openSLR which was further complemented with original recordings that incorporated sentences representing different tenses, grammatical persons, inflections, direct-indirect speech, level of honorifics, etc to address the grammatical structures of the Nepali language. The end to end CNNTransformer architecture was trained with varying size of datasets, epochs and parameter tuning. The best resulting model achieved a CER of 11.14%.</p>