Generating Image Captions in Amharic Language Using Hybridized Attention-Based Deep Neural Networks

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

ASTU

Abstract

More than thirty million Ethiopian people speak Amharic, and Amharic is a family of Semitic languages. This research aims to generate image captions using the Amharic language based on the deep learning approach. Generating captions in the Amharic is used for different application tasks, primarily to help people with visual impairments to understand the content image. Although experiments with captions in English and different languages have been carried out in the past years, generating captions using the Amharic has not been widely studied. Recently, image captions have moved by encoder-decoder deep learning approaches. Generally, this approach uses previously trained CNN models to take out visual features of the picture and represent a specific dimension through the encoder. And passes these features to a decoder (such as LSTM or GRU) to generate captions. To improve approaches that use encoder-decoder various experiments have been carried out over the past few years using different methods. The AM and bidirectional language decoder are one of these methods to play an key role in producing better captions. Bridging the modality gap between the visual and textual features is one of the significant challenges in these previous methods to enhance the semantically correct caption. In this case, the gap between these two features maximizes the models dose not predict proximity words to define the picture content.. To address this issue, this study proposed by hybridized attention-based DNN for generating Amharic cations. The proposed consists of an Inception-V3 CNN encoder to get the image features, Visual attention help to contrate essential areas of the image, and Bi-GRU with an AM as a decoder for language generation. Bi-GRU can learn long-term relationships between vision and textual features effectively compared to Bi LSTM. For the dataset for the proposed model, we translated the English version of the Flickr8k dataset into Amharic and performed experiments. As the result of the hybridized attention-based approach shows, we get better results on 1G-BLEU, 2G-BLEU, 3G-BLEU, and 4G-BLEU, which are 60.6, 50.1, 43.7, and 38.8, respectively. In addition, the proposed model achieved better results in experiments using BNATURE and Flickr8k (English) datasets. And the result demonstrated the significance of the proposed approach to compare with the baseline models. It is 0.21% and 0.21% higher than CNN-Bi-GRU and Bag-LSTM on the 4G-BLEU matrix.

Description

Citation

Endorsement

Review

Supplemented By

Referenced By