Transfer Speaker Voice to Amharic Text to Speech for Real-time Applications Using Deep Learning

Daniel, Mitiku

Transfer Speaker Voice to Amharic Text to Speech for Real-time Applications Using Deep Learning

dc.contributor.advisor	Dr. Bahiru Asfaw
dc.contributor.author	Daniel, Mitiku
dc.date.accessioned	2025-12-17T10:54:29Z
dc.date.issued	2023-06
dc.description.abstract	Achieving balanced and controlled voice variation is crucial in media translation, particularly in video and audio dubbing. Existing approaches such as single voice text-to speech (TTS) systems lack speaker voice awareness, while voice cloning techniques include variation to a single speaker, making them less suitable for this domain. To address this challenge, we propose a three-way pipeline specifically designed for media translation applications. Our approach incorporates three distinct levels of voice transfer: gender similarity, 20-speaker set voice cloning, and complete reference voice cloning. At the gender similarity level, a Speaker Verification module detects the gender of the reference speaker and selects a gender-specific TTS model, ensuring natural and appropriate gender-based variation in the synthesized speech. For the 20-speaker set voice cloning, we employ a Generative Adversarial Network (GAN) approach that maps any speaker's voice to a preselected set of 20 voices. By utilizing speaker verification and cross-similarity calculations, we identify the voice within the set that closely resembles the reference speaker, enabling speech synthesis with a balance between maintaining the essence of the reference voice and introducing diversity. Additionally, our model offers complete reference voice cloning, re-voicing the synthesized speech to match the reference audio. Leveraging transfer learning and disentangling speaker and linguistic features, we achieve a high degree of similarity between the synthesized speech and the reference audio, while preserving the desired voice style. To ensure optimal voice representation, we introduce a 30-second embedding technique that extracts vocal features from speaker utterance segments. This technique enables the selection of the most representative segments, effectively capturing the speaker's voice characteristics. Experimental evaluations demonstrate the effectiveness of our proposed method in generating customized and personalized synthesized speech for media translation purposes. We compare our system with previous methods on multiple benchmarks, focusing on naturalness and intelligibility. The results show that our approach achieves more efficient training configurations and is more suitable for voice variation for media translation.	en_US
dc.description.sponsorship	ASTU	en_US
dc.identifier.uri	http://10.240.1.28:4000/handle/123456789/1623
dc.language.iso	en_US	en_US
dc.publisher	ASTU	en_US
dc.subject	Media translation, voice variation, text-to-speech synthesis, gender similarity, voice cloning, Generative Adversarial Network (GAN), complete reference voice cloning, 30-second embeddings, customized synthesized speech	en_US
dc.title	Transfer Speaker Voice to Amharic Text to Speech for Real-time Applications Using Deep Learning	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Daniel Mitiku.pdf
Size:: 6 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Plain Text
Description:

Download

Collections

Thesis