Multimodal Understanding Amharic Video Question Answering using Bidirectional Cross Modal Attention

Helina Tefera

Multimodal Understanding Amharic Video Question Answering using Bidirectional Cross Modal Attention

dc.contributor.author	Helina Tefera
dc.date.accessioned	2026-04-09T11:49:41Z
dc.date.issued	2025-11
dc.description.abstract	Amharic Video multi modal Understanding for Amharic Video Question Answering using Bidirectional Cross Modal Attention is a novel deep learning approach designed to enhance the comprehension of Amharic video content through a fusion of visual and textual modalities. One of the primary challenges in video question answering is the heterogeneous nature of visual and textual data, especially in low resource languages like Amharic. Conventional approaches often rely on randomly sample video frames, did not consider sematic relation between object ,and all of them are for English. To overcome these limitations, this study introduces a Bidirectional Cross Modal Attention mechanism with CLIP based best frame selection, which models fine grained interactions between video representations CLIP features, temporal embeddings, object features and the question encoding using BERT. Previous models either aggregate all visual features at once or treat the question as a global embedding, which results in loss of word level alignment and spatial temporal correspondence. In contrast, the Bidirectional Cross Modal Attention model allows both visual and textual tokens to attend to each other iteratively, improving semantic alignment between questions and relevant visual content. To further enhance understanding, multiple visual cues such as CLS tokens, CLIP embeddings, object detections from FastRCNN, and temporal spatial features are integrated. An bidirectional cores modal attention based fusion layer selectively combines these features. The Bidirectional Cross Modal Attention Bidirectional Cross Modal Attention VQA model not only introduces the first ever benchmark for Amharic Video Question Answering (Amharic VQA) but also achieves significant improvements over current state of the art methods on the English MSVD QA dataset. The Amharic Video Question Answering model achieved 48.21% accuracy, the English model using English MSVD QA reached 58.71%, showing a notable improvement compered with Yu et al. 2024 with final accuracy with 48.2% on MSVD QA and Tang et al. 2024 with final result 39.1%. These results highlight the effectiveness of fine grained, bidirectional attention in enhancing semantic fusion between video content and questions, improved Video Question Answering performance, particularly in English
dc.description.sponsorship	ASTU
dc.identifier.uri	https://etd.astu.edu.et/handle/123456789/3084
dc.language.iso	en_US
dc.publisher	ASTU
dc.subject	Amharic Video Question Answering
dc.subject	Multilingual Video Understanding
dc.subject	Bidirectional Cross Modal Attention
dc.subject	BERT
dc.subject	CLIP
dc.subject	FastRCNN
dc.subject	multi modal Fusion
dc.subject	MSVD QA
dc.subject	Low Resource Language Benchmarking.
dc.title	Multimodal Understanding Amharic Video Question Answering using Bidirectional Cross Modal Attention
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Helina Tefera.pdf
Size:: 8.07 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed to upon submission
Description:

Download

Collections

Information System Engineering