Text to Speech Synthesizer for Afaan Oromo Using Deep Neural Network

dc.contributor.advisorMesfin Abebe (PhD)
dc.contributor.authorChala, Sembeta
dc.date.accessioned2025-12-17T10:54:10Z
dc.date.issued2021-09
dc.description.abstractIn a world where technology is emerging at an exponential rate, Speech synthesis technology is already a part of our everyday lives. Text-to-speech (TTS)synthesis systems are concerned with the artificial generation of a natural and intelligible human voice from given text transcriptions. The text-to-speech system serves as an assistive tool for people with visual impairments and reading disabilities allowing them to listen to written words such as online webpages, books, newspapers articles, and textbooks on a variety of devices. Despite the potential applications of the text-to-speech systems, it was a language-dependent discipline and most of the attempts are concerned with resourceful languages specifically the English language. Afaan Oromo is one of the under-resourced languages that have a shortage of previously existing language resources for developing a text-to-speech system. In this study, we have collected and prepared a speech dataset containing 8076 text and audio pairs from legitimate sources to develop a text-to-speech synthesizer for Afaan Oromo. Apart from standard words and names, the proposed model incorporates non-standard words including numbers, abbreviations, currency, and acronyms. The study focuses on the use of the deep neural network technique which is a machine learning algorithm implemented by several layers of neural networks. The deep neural network is chosen for this work because it can map complex linguistic features into acoustic feature parameters. Several experiments are conducted to determine the best performing model among the Tacotron 2 which is a recurrent neural network model and Deep Voice 3, a fully convolutional neural network model. The objective and subjective evaluations are carried out to assess the performance of the models. For objective evaluation, we used the attention error test and for subjective evaluation, we used the mean opinion score (MOS) test. From objective evaluation, Tacotron 2 made only 2 attention errors while Deep voice 3 made 16 out of 148 words in the evaluation sentence list. In addition, we found the MOS result of 4.32 and 4.21 out of five for Tacotron 2, and 3.28 and 3.02 out of five for Deep Voice 3 in terms of intelligibility and naturalness respectively. From evaluation results, we conclude that that the Tacotron 2, based on a recurrent neural network model provides an encouraging result, which makes our model appropriate for applications that need the text-to-speech service such as recommendation systems, telephone inquiry services, and smart educations.en_US
dc.description.sponsorshipASTUen_US
dc.identifier.urihttp://10.240.1.28:4000/handle/123456789/1548
dc.language.isoen_USen_US
dc.publisherASTUen_US
dc.subjecttext-to-speech, Tacotron 2, Deep voice 3, Afaan Oromo, Mean Opinion Score, Speech Processing, Deep Neural Networken_US
dc.titleText to Speech Synthesizer for Afaan Oromo Using Deep Neural Networken_US
dc.typeThesisen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Chala Sembeta.pdf
Size:
3.1 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Plain Text
Description:

Collections