Generating Image Captions in Amharic Language Using Hybridized Attention-Based Deep Neural Networks
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
ASTU
Abstract
More than thirty million Ethiopian people speak Amharic, and Amharic is a family of
Semitic languages. This research aims to generate image captions using the Amharic
language based on the deep learning approach. Generating captions in the Amharic is used
for different application tasks, primarily to help people with visual impairments to
understand the content image. Although experiments with captions in English and different
languages have been carried out in the past years, generating captions using the Amharic
has not been widely studied. Recently, image captions have moved by encoder-decoder deep
learning approaches. Generally, this approach uses previously trained CNN models to take
out visual features of the picture and represent a specific dimension through the encoder.
And passes these features to a decoder (such as LSTM or GRU) to generate captions. To
improve approaches that use encoder-decoder various experiments have been carried out
over the past few years using different methods. The AM and bidirectional language decoder
are one of these methods to play an key role in producing better captions. Bridging the
modality gap between the visual and textual features is one of the significant challenges in
these previous methods to enhance the semantically correct caption. In this case, the gap
between these two features maximizes the models dose not predict proximity words to define
the picture content.. To address this issue, this study proposed by hybridized attention-based
DNN for generating Amharic cations. The proposed consists of an Inception-V3 CNN
encoder to get the image features, Visual attention help to contrate essential areas of the
image, and Bi-GRU with an AM as a decoder for language generation. Bi-GRU can learn
long-term relationships between vision and textual features effectively compared to Bi LSTM. For the dataset for the proposed model, we translated the English version of the
Flickr8k dataset into Amharic and performed experiments. As the result of the hybridized
attention-based approach shows, we get better results on 1G-BLEU, 2G-BLEU, 3G-BLEU,
and 4G-BLEU, which are 60.6, 50.1, 43.7, and 38.8, respectively. In addition, the proposed
model achieved better results in experiments using BNATURE and Flickr8k (English)
datasets. And the result demonstrated the significance of the proposed approach to compare
with the baseline models. It is 0.21% and 0.21% higher than CNN-Bi-GRU and Bag-LSTM
on the 4G-BLEU matrix.
