Text to Speech Synthesizer for Afaan Oromo Using Deep Neural  Network

Chala, Sembeta

Text to Speech Synthesizer for Afaan Oromo Using Deep Neural Network

dc.contributor.advisor	Mesfin Abebe (PhD)
dc.contributor.author	Chala, Sembeta
dc.date.accessioned	2025-12-17T10:54:10Z
dc.date.issued	2021-09
dc.description.abstract	In a world where technology is emerging at an exponential rate, Speech synthesis technology is already a part of our everyday lives. Text-to-speech (TTS)synthesis systems are concerned with the artificial generation of a natural and intelligible human voice from given text transcriptions. The text-to-speech system serves as an assistive tool for people with visual impairments and reading disabilities allowing them to listen to written words such as online webpages, books, newspapers articles, and textbooks on a variety of devices. Despite the potential applications of the text-to-speech systems, it was a language-dependent discipline and most of the attempts are concerned with resourceful languages specifically the English language. Afaan Oromo is one of the under-resourced languages that have a shortage of previously existing language resources for developing a text-to-speech system. In this study, we have collected and prepared a speech dataset containing 8076 text and audio pairs from legitimate sources to develop a text-to-speech synthesizer for Afaan Oromo. Apart from standard words and names, the proposed model incorporates non-standard words including numbers, abbreviations, currency, and acronyms. The study focuses on the use of the deep neural network technique which is a machine learning algorithm implemented by several layers of neural networks. The deep neural network is chosen for this work because it can map complex linguistic features into acoustic feature parameters. Several experiments are conducted to determine the best performing model among the Tacotron 2 which is a recurrent neural network model and Deep Voice 3, a fully convolutional neural network model. The objective and subjective evaluations are carried out to assess the performance of the models. For objective evaluation, we used the attention error test and for subjective evaluation, we used the mean opinion score (MOS) test. From objective evaluation, Tacotron 2 made only 2 attention errors while Deep voice 3 made 16 out of 148 words in the evaluation sentence list. In addition, we found the MOS result of 4.32 and 4.21 out of five for Tacotron 2, and 3.28 and 3.02 out of five for Deep Voice 3 in terms of intelligibility and naturalness respectively. From evaluation results, we conclude that that the Tacotron 2, based on a recurrent neural network model provides an encouraging result, which makes our model appropriate for applications that need the text-to-speech service such as recommendation systems, telephone inquiry services, and smart educations.	en_US
dc.description.sponsorship	ASTU	en_US
dc.identifier.uri	http://10.240.1.28:4000/handle/123456789/1548
dc.language.iso	en_US	en_US
dc.publisher	ASTU	en_US
dc.subject	text-to-speech, Tacotron 2, Deep voice 3, Afaan Oromo, Mean Opinion Score, Speech Processing, Deep Neural Network	en_US
dc.title	Text to Speech Synthesizer for Afaan Oromo Using Deep Neural Network	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Chala Sembeta.pdf
Size:: 3.1 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Plain Text
Description:

Download

Collections

Thesis