Multimodal Understanding Amharic Video Question Answering using Bidirectional Cross Modal Attention
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
ASTU
Abstract
Amharic Video multi modal Understanding for Amharic Video Question Answering
using Bidirectional Cross Modal Attention is a novel deep learning approach designed to
enhance the comprehension of Amharic video content through a fusion of visual and
textual modalities. One of the primary challenges in video question answering is the
heterogeneous nature of visual and textual data, especially in low resource languages
like Amharic. Conventional approaches often rely on randomly sample video frames, did
not consider sematic relation between object ,and all of them are for English. To
overcome these limitations, this study introduces a Bidirectional Cross Modal Attention
mechanism with CLIP based best frame selection, which models fine grained
interactions between video representations CLIP features, temporal embeddings, object
features and the question encoding using BERT. Previous models either aggregate all
visual features at once or treat the question as a global embedding, which results in loss
of word level alignment and spatial temporal correspondence. In contrast, the
Bidirectional Cross Modal Attention model allows both visual and textual tokens to
attend to each other iteratively, improving semantic alignment between questions and
relevant visual content. To further enhance understanding, multiple visual cues such as
CLS tokens, CLIP embeddings, object detections from FastRCNN, and temporal spatial
features are integrated. An bidirectional cores modal attention based fusion layer
selectively combines these features.
The Bidirectional Cross Modal Attention
Bidirectional Cross Modal Attention VQA model not only introduces the first ever
benchmark for Amharic Video Question Answering (Amharic VQA) but also achieves
significant improvements over current state of the art methods on the English MSVD
QA dataset. The Amharic Video Question Answering model achieved 48.21% accuracy,
the English model using English MSVD QA reached 58.71%, showing a notable
improvement compered with Yu et al. 2024 with final accuracy with 48.2% on MSVD
QA and Tang et al. 2024 with final result 39.1%. These results highlight the
effectiveness of fine grained, bidirectional attention in enhancing semantic fusion
between video content and questions, improved Video Question Answering performance,
particularly in English
