Transfer Speaker Voice to Amharic Text to Speech for Real-time Applications Using Deep Learning

dc.contributor.advisorDr. Bahiru Asfaw
dc.contributor.authorDaniel, Mitiku
dc.date.accessioned2025-12-17T10:54:29Z
dc.date.issued2023-06
dc.description.abstractAchieving balanced and controlled voice variation is crucial in media translation, particularly in video and audio dubbing. Existing approaches such as single voice text-to speech (TTS) systems lack speaker voice awareness, while voice cloning techniques include variation to a single speaker, making them less suitable for this domain. To address this challenge, we propose a three-way pipeline specifically designed for media translation applications. Our approach incorporates three distinct levels of voice transfer: gender similarity, 20-speaker set voice cloning, and complete reference voice cloning. At the gender similarity level, a Speaker Verification module detects the gender of the reference speaker and selects a gender-specific TTS model, ensuring natural and appropriate gender-based variation in the synthesized speech. For the 20-speaker set voice cloning, we employ a Generative Adversarial Network (GAN) approach that maps any speaker's voice to a preselected set of 20 voices. By utilizing speaker verification and cross-similarity calculations, we identify the voice within the set that closely resembles the reference speaker, enabling speech synthesis with a balance between maintaining the essence of the reference voice and introducing diversity. Additionally, our model offers complete reference voice cloning, re-voicing the synthesized speech to match the reference audio. Leveraging transfer learning and disentangling speaker and linguistic features, we achieve a high degree of similarity between the synthesized speech and the reference audio, while preserving the desired voice style. To ensure optimal voice representation, we introduce a 30-second embedding technique that extracts vocal features from speaker utterance segments. This technique enables the selection of the most representative segments, effectively capturing the speaker's voice characteristics. Experimental evaluations demonstrate the effectiveness of our proposed method in generating customized and personalized synthesized speech for media translation purposes. We compare our system with previous methods on multiple benchmarks, focusing on naturalness and intelligibility. The results show that our approach achieves more efficient training configurations and is more suitable for voice variation for media translation.en_US
dc.description.sponsorshipASTUen_US
dc.identifier.urihttp://10.240.1.28:4000/handle/123456789/1623
dc.language.isoen_USen_US
dc.publisherASTUen_US
dc.subjectMedia translation, voice variation, text-to-speech synthesis, gender similarity, voice cloning, Generative Adversarial Network (GAN), complete reference voice cloning, 30-second embeddings, customized synthesized speechen_US
dc.titleTransfer Speaker Voice to Amharic Text to Speech for Real-time Applications Using Deep Learningen_US
dc.typeThesisen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Daniel Mitiku.pdf
Size:
6 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Plain Text
Description:

Collections