Multimodal Understanding Amharic Video Question Answering using Bidirectional Cross Modal Attention

dc.contributor.authorHelina Tefera
dc.date.accessioned2026-04-09T11:49:41Z
dc.date.issued2025-11
dc.description.abstractAmharic Video multi modal Understanding for Amharic Video Question Answering using Bidirectional Cross Modal Attention is a novel deep learning approach designed to enhance the comprehension of Amharic video content through a fusion of visual and textual modalities. One of the primary challenges in video question answering is the heterogeneous nature of visual and textual data, especially in low resource languages like Amharic. Conventional approaches often rely on randomly sample video frames, did not consider sematic relation between object ,and all of them are for English. To overcome these limitations, this study introduces a Bidirectional Cross Modal Attention mechanism with CLIP based best frame selection, which models fine grained interactions between video representations CLIP features, temporal embeddings, object features and the question encoding using BERT. Previous models either aggregate all visual features at once or treat the question as a global embedding, which results in loss of word level alignment and spatial temporal correspondence. In contrast, the Bidirectional Cross Modal Attention model allows both visual and textual tokens to attend to each other iteratively, improving semantic alignment between questions and relevant visual content. To further enhance understanding, multiple visual cues such as CLS tokens, CLIP embeddings, object detections from FastRCNN, and temporal spatial features are integrated. An bidirectional cores modal attention based fusion layer selectively combines these features. The Bidirectional Cross Modal Attention Bidirectional Cross Modal Attention VQA model not only introduces the first ever benchmark for Amharic Video Question Answering (Amharic VQA) but also achieves significant improvements over current state of the art methods on the English MSVD QA dataset. The Amharic Video Question Answering model achieved 48.21% accuracy, the English model using English MSVD QA reached 58.71%, showing a notable improvement compered with Yu et al. 2024 with final accuracy with 48.2% on MSVD QA and Tang et al. 2024 with final result 39.1%. These results highlight the effectiveness of fine grained, bidirectional attention in enhancing semantic fusion between video content and questions, improved Video Question Answering performance, particularly in English
dc.description.sponsorshipASTU
dc.identifier.urihttps://etd.astu.edu.et/handle/123456789/3084
dc.language.isoen_US
dc.publisherASTU
dc.subjectAmharic Video Question Answering
dc.subjectMultilingual Video Understanding
dc.subjectBidirectional Cross Modal Attention
dc.subjectBERT
dc.subjectCLIP
dc.subjectFastRCNN
dc.subjectmulti modal Fusion
dc.subjectMSVD QA
dc.subjectLow Resource Language Benchmarking.
dc.titleMultimodal Understanding Amharic Video Question Answering using Bidirectional Cross Modal Attention
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Helina Tefera.pdf
Size:
8.07 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed to upon submission
Description: