Amharic Speech Recognition using Power Spectrum Density Estimation and Discrete Wavelet Transform

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

ASTU

Abstract

This thesis work explored the possibility of developing Amharic Automatic Speech Recognition System. Towards this end, literature was reviewed on speech recognition, application of speech recognition, Amharic speech recognition. To develop and test the required Amharic speech recognition system, speech data were recorded, collected, transcribe to text format. Amharic phonetics and phonologies are widely differing from English language. This thesis work focuses on Automatic Speech Recognition for Amharic. Speech recognition is the ability of a machine or program to identify words and phrases in spoken language and convert them to a machine-readable format. Rudimentary speech recognition software has a limited vocabulary of words and phrases, and it may only identify these if they are spoken very clearly. English remains to be the most widely spoken languages of the world. While Automatic Speech Recognition research work prevails for English. However, there are few researches on speech technology in Ethiopian languages in general and Amharic in particular. Researches done for Amharic Speech Recognition are domain specific. These include Speech Translation for tourism areas, word recognition. The aim of this work is to enhance continues speech recognition for Amharic speech recognition by using a parametric method of Power Spectrum Density Estimation and Discrete Wavelet Transform. Speech dataset for Amharic language is a fundamental requirement for any development on Amharic Automatic Speech Recognition. Power Spectral Estimation method is to obtain an approximate estimation of the power spectral density of a given real random process. Discrete Wavelet Transform is applied to extract the approximation coefficients. Power spectral density is then applied to estimate the power spectrum density. The learning process is done by using Recurrent Neural Network (RNN) which is a class of Artificial Neural Networks that can process a sequence of inputs in deep learning and retain its state while processing the next sequence of inputs. For the training and evaluation purpose, we use 9340 audio files with their transcription which is 20hr audio file or around 5GB file is used. At the beginning of the training the Word Error Rate was high (19) and at the end of the training, word error rate is decreased to 7.02 and training loss decrease highly from the starting epoch to the end of the epoch which shows that the more we train the more decrease in Word Error Rate

Description

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By