Text to Speech Synthesizer for Afaan Oromo Using Deep Neural Network
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
ASTU
Abstract
In a world where technology is emerging at an exponential rate, Speech synthesis technology
is already a part of our everyday lives. Text-to-speech (TTS)synthesis systems are concerned
with the artificial generation of a natural and intelligible human voice from given text
transcriptions. The text-to-speech system serves as an assistive tool for people with visual
impairments and reading disabilities allowing them to listen to written words such as online
webpages, books, newspapers articles, and textbooks on a variety of devices. Despite the
potential applications of the text-to-speech systems, it was a language-dependent discipline
and most of the attempts are concerned with resourceful languages specifically the English
language. Afaan Oromo is one of the under-resourced languages that have a shortage of
previously existing language resources for developing a text-to-speech system. In this study,
we have collected and prepared a speech dataset containing 8076 text and audio pairs from
legitimate sources to develop a text-to-speech synthesizer for Afaan Oromo. Apart from
standard words and names, the proposed model incorporates non-standard words including
numbers, abbreviations, currency, and acronyms. The study focuses on the use of the deep
neural network technique which is a machine learning algorithm implemented by several
layers of neural networks. The deep neural network is chosen for this work because it can
map complex linguistic features into acoustic feature parameters. Several experiments are
conducted to determine the best performing model among the Tacotron 2 which is a
recurrent neural network model and Deep Voice 3, a fully convolutional neural network
model. The objective and subjective evaluations are carried out to assess the performance
of the models. For objective evaluation, we used the attention error test and for subjective
evaluation, we used the mean opinion score (MOS) test. From objective evaluation, Tacotron
2 made only 2 attention errors while Deep voice 3 made 16 out of 148 words in the evaluation
sentence list. In addition, we found the MOS result of 4.32 and 4.21 out of five for Tacotron
2, and 3.28 and 3.02 out of five for Deep Voice 3 in terms of intelligibility and naturalness
respectively. From evaluation results, we conclude that that the Tacotron 2, based on a
recurrent neural network model provides an encouraging result, which makes our model
appropriate for applications that need the text-to-speech service such as recommendation
systems, telephone inquiry services, and smart educations.
