Disentangled Representation Based One-Shot Realistic Neural Talking Head Synthesis
| dc.contributor.advisor | Getinet Yilma (PhD) | |
| dc.contributor.author | Adugna, Abe | |
| dc.date.accessioned | 2025-12-17T10:54:24Z | |
| dc.date.issued | 2023-09 | |
| dc.description.abstract | This study introduces a novel deep-learning model for realistic neural talking heads. Realistic neural talking heads synthesize a talking head video using the target person's appearance, which can be seen in the source image, and the output motion is controlled by a driving video. The primary focus of this study is to generate a video that preserves the original image's look by acquiring motion information from the driving video. Prior approaches in this area primarily depend on 2D representations, such as appearance and motion, acquired from the input image. Recent attempts have been made to execute motion transfer on any object utilizing unsupervised techniques without the use of prior information. However, the considerable difference in poses between the objects in the source and driving images remains a significant challenge for current unsupervised algorithms. Even the most recent method failed to achieve this correctly with good visual effects. In order to solve the problem of poor visual effects in the videos with the large scale pose change, the GAN-based one-shot realistic neural talking heads model is proposed to mitigate these aforementioned issues. The proposed model employs cross-modal attention to preserve identity-related information and enhance the quality of the generated images. And also, use background and warp loss to reducing the background's noisy motion and encouraging the network to produce high-quality images. Additionally, to provide more precise and vivid visual effects, the multi-scale occlusion restoration module used in this study upsamples the low-resolution occlusion map to produce a multi-resolution occlusion map. Finally, in this study, disentangled representations were employed to facilitate animation and prevent the leakage of the driving object's appearance or shape. The experimental outcomes demonstrated that the proposed approach led to enhancements in several evaluation metrics, and the visual quality of the animated videos notably surpassed that of the MRAA. This model improves the MRAA baseline work from 0.040 to 0.034, from 1.28 to 1.13, and from 0.133 to 0.115, respectively, based on L1, AKD, and AED. Experiments demonstrate the superior performance of the proposed solution over the existing state-of-the-art methods on the Voxceleb1 benchmarked dataset with cross-modal attention, background and warp loss, multi-occlusion network, and disentangled representation. | en_US |
| dc.description.sponsorship | ASTU | en_US |
| dc.identifier.uri | http://10.240.1.28:4000/handle/123456789/1606 | |
| dc.language.iso | en_US | en_US |
| dc.publisher | ASTU | en_US |
| dc.subject | - Generative Adversarial Network, One-shot, MRAA, Disentangled Representation, 2D Representations. | en_US |
| dc.title | Disentangled Representation Based One-Shot Realistic Neural Talking Head Synthesis | en_US |
| dc.type | Thesis | en_US |
