Text to Speech Synthesizer for Afaan Oromo Using Deep Neural  Network

Chala, Sembeta

Text to Speech Synthesizer for Afaan Oromo Using Deep Neural Network

Files

Chala Sembeta.pdf (3.1 MB)

Date

2021-09

Authors

Chala, Sembeta

Publisher

ASTU

Abstract

In a world where technology is emerging at an exponential rate, Speech synthesis technology is already a part of our everyday lives. Text-to-speech (TTS)synthesis systems are concerned with the artificial generation of a natural and intelligible human voice from given text transcriptions. The text-to-speech system serves as an assistive tool for people with visual impairments and reading disabilities allowing them to listen to written words such as online webpages, books, newspapers articles, and textbooks on a variety of devices. Despite the potential applications of the text-to-speech systems, it was a language-dependent discipline and most of the attempts are concerned with resourceful languages specifically the English language. Afaan Oromo is one of the under-resourced languages that have a shortage of previously existing language resources for developing a text-to-speech system. In this study, we have collected and prepared a speech dataset containing 8076 text and audio pairs from legitimate sources to develop a text-to-speech synthesizer for Afaan Oromo. Apart from standard words and names, the proposed model incorporates non-standard words including numbers, abbreviations, currency, and acronyms. The study focuses on the use of the deep neural network technique which is a machine learning algorithm implemented by several layers of neural networks. The deep neural network is chosen for this work because it can map complex linguistic features into acoustic feature parameters. Several experiments are conducted to determine the best performing model among the Tacotron 2 which is a recurrent neural network model and Deep Voice 3, a fully convolutional neural network model. The objective and subjective evaluations are carried out to assess the performance of the models. For objective evaluation, we used the attention error test and for subjective evaluation, we used the mean opinion score (MOS) test. From objective evaluation, Tacotron 2 made only 2 attention errors while Deep voice 3 made 16 out of 148 words in the evaluation sentence list. In addition, we found the MOS result of 4.32 and 4.21 out of five for Tacotron 2, and 3.28 and 3.02 out of five for Deep Voice 3 in terms of intelligibility and naturalness respectively. From evaluation results, we conclude that that the Tacotron 2, based on a recurrent neural network model provides an encouraging result, which makes our model appropriate for applications that need the text-to-speech service such as recommendation systems, telephone inquiry services, and smart educations.

Keywords

text-to-speech, Tacotron 2, Deep voice 3, Afaan Oromo, Mean Opinion Score, Speech Processing, Deep Neural Network

URI

http://10.240.1.28:4000/handle/123456789/1548

Collections

Thesis

Full item page

Text to Speech Synthesizer for Afaan Oromo Using Deep Neural Network

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By