Transfer Speaker Voice to Amharic Text to Speech for Real-time Applications Using Deep Learning

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

ASTU

Abstract

Achieving balanced and controlled voice variation is crucial in media translation, particularly in video and audio dubbing. Existing approaches such as single voice text-to speech (TTS) systems lack speaker voice awareness, while voice cloning techniques include variation to a single speaker, making them less suitable for this domain. To address this challenge, we propose a three-way pipeline specifically designed for media translation applications. Our approach incorporates three distinct levels of voice transfer: gender similarity, 20-speaker set voice cloning, and complete reference voice cloning. At the gender similarity level, a Speaker Verification module detects the gender of the reference speaker and selects a gender-specific TTS model, ensuring natural and appropriate gender-based variation in the synthesized speech. For the 20-speaker set voice cloning, we employ a Generative Adversarial Network (GAN) approach that maps any speaker's voice to a preselected set of 20 voices. By utilizing speaker verification and cross-similarity calculations, we identify the voice within the set that closely resembles the reference speaker, enabling speech synthesis with a balance between maintaining the essence of the reference voice and introducing diversity. Additionally, our model offers complete reference voice cloning, re-voicing the synthesized speech to match the reference audio. Leveraging transfer learning and disentangling speaker and linguistic features, we achieve a high degree of similarity between the synthesized speech and the reference audio, while preserving the desired voice style. To ensure optimal voice representation, we introduce a 30-second embedding technique that extracts vocal features from speaker utterance segments. This technique enables the selection of the most representative segments, effectively capturing the speaker's voice characteristics. Experimental evaluations demonstrate the effectiveness of our proposed method in generating customized and personalized synthesized speech for media translation purposes. We compare our system with previous methods on multiple benchmarks, focusing on naturalness and intelligibility. The results show that our approach achieves more efficient training configurations and is more suitable for voice variation for media translation.

Description

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By