Multimodal Understanding Amharic Video Question Answering using Bidirectional Cross Modal Attention

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

ASTU

Abstract

Amharic Video multi modal Understanding for Amharic Video Question Answering using Bidirectional Cross Modal Attention is a novel deep learning approach designed to enhance the comprehension of Amharic video content through a fusion of visual and textual modalities. One of the primary challenges in video question answering is the heterogeneous nature of visual and textual data, especially in low resource languages like Amharic. Conventional approaches often rely on randomly sample video frames, did not consider sematic relation between object ,and all of them are for English. To overcome these limitations, this study introduces a Bidirectional Cross Modal Attention mechanism with CLIP based best frame selection, which models fine grained interactions between video representations CLIP features, temporal embeddings, object features and the question encoding using BERT. Previous models either aggregate all visual features at once or treat the question as a global embedding, which results in loss of word level alignment and spatial temporal correspondence. In contrast, the Bidirectional Cross Modal Attention model allows both visual and textual tokens to attend to each other iteratively, improving semantic alignment between questions and relevant visual content. To further enhance understanding, multiple visual cues such as CLS tokens, CLIP embeddings, object detections from FastRCNN, and temporal spatial features are integrated. An bidirectional cores modal attention based fusion layer selectively combines these features. The Bidirectional Cross Modal Attention Bidirectional Cross Modal Attention VQA model not only introduces the first ever benchmark for Amharic Video Question Answering (Amharic VQA) but also achieves significant improvements over current state of the art methods on the English MSVD QA dataset. The Amharic Video Question Answering model achieved 48.21% accuracy, the English model using English MSVD QA reached 58.71%, showing a notable improvement compered with Yu et al. 2024 with final accuracy with 48.2% on MSVD QA and Tang et al. 2024 with final result 39.1%. These results highlight the effectiveness of fine grained, bidirectional attention in enhancing semantic fusion between video content and questions, improved Video Question Answering performance, particularly in English

Description

Citation

Endorsement

Review

Supplemented By

Referenced By