Transfer Speaker Voice to Amharic Text to Speech for Real-time Applications Using Deep Learning
Loading...
Files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
ASTU
Abstract
Achieving balanced and controlled voice variation is crucial in media translation,
particularly in video and audio dubbing. Existing approaches such as single voice text-to speech (TTS) systems lack speaker voice awareness, while voice cloning techniques include
variation to a single speaker, making them less suitable for this domain. To address this
challenge, we propose a three-way pipeline specifically designed for media translation
applications. Our approach incorporates three distinct levels of voice transfer: gender
similarity, 20-speaker set voice cloning, and complete reference voice cloning. At the gender
similarity level, a Speaker Verification module detects the gender of the reference speaker
and selects a gender-specific TTS model, ensuring natural and appropriate gender-based
variation in the synthesized speech. For the 20-speaker set voice cloning, we employ a
Generative Adversarial Network (GAN) approach that maps any speaker's voice to a
preselected set of 20 voices. By utilizing speaker verification and cross-similarity
calculations, we identify the voice within the set that closely resembles the reference speaker,
enabling speech synthesis with a balance between maintaining the essence of the reference
voice and introducing diversity. Additionally, our model offers complete reference voice
cloning, re-voicing the synthesized speech to match the reference audio. Leveraging transfer
learning and disentangling speaker and linguistic features, we achieve a high degree of
similarity between the synthesized speech and the reference audio, while preserving the
desired voice style. To ensure optimal voice representation, we introduce a 30-second
embedding technique that extracts vocal features from speaker utterance segments. This
technique enables the selection of the most representative segments, effectively capturing
the speaker's voice characteristics. Experimental evaluations demonstrate the effectiveness
of our proposed method in generating customized and personalized synthesized speech for
media translation purposes. We compare our system with previous methods on multiple
benchmarks, focusing on naturalness and intelligibility. The results show that our approach
achieves more efficient training configurations and is more suitable for voice variation for
media translation.
