Tigrinya Hate Speech Detection and Classification from Facebook Posts and Comments Using Deep Learning Approaches
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
ASTU
Abstract
Nowadays, using social media to share information with loved ones, friends, and coworkers is
one of the easiest things to do. Even though people are free to express their feelings as they
choose, spreading hate speech using this platform has become more convenient. Hate speech
detection is a way in which a machine with trained model detects the speech and classifies it as
hate or hate-free speech accordingly. As a result, to the best of our knowledge there is no single
study conducted regarding the Tigrinya hate speech detection. The main objective of this study
is to design and develop hate speech detection model for the Tigrinya language from Facebook
posts and comments. Hate speech detection models have been developed worldwide, including
in our own country of Ethiopia, for a variety of languages by various scholars. However, due to
the difference in language, the hate speech detection model designed for Amharic or Afan Oromo won't be applicable in the case of the Tigrinya language. As a result, we have developed
a model that detects the Tigrinya hate speeches for both binary and multi-class classification.
In this study to achieve our objective we have collected 5400 posts and comments as a dataset,
and the dataset was collected from different Facebook pages using the Facepager tool and
prepared it in CSV file format. While the 5400 of the datasets have been used for multi-class
classification, only 3608 of it has been applied for binary classification. We have used the
stratified k fold cross-validation techniques for the purpose of splitting our dataset to train and
test our model. We have designed the model using three different deep learning approaches with
three different feature extraction techniques. For the purpose of designing the model the Bi LSTM, CNN-LSTM, and CNN has been applied with the feature extraction techniques of
Word2vec, Fast text and the Keras Embedding layer. We have applied the dropout and L2
regularization techniques to overcome the overfitting problem occurred in each model. As a
result, for the multi-class classification the Bi-LSTM model with the Fast text of CBOW has
outperformed the CNN-LSTM and the CNN model with the Accuracy of 87.41 %, Precision of
88.02 %, Recall of 85.74 %, and F1-Score of 86.86 %. Additionally, for the binary classification
the Bi-LSTM model with the Fast text of skip-gram has outperformed the CNN-LSTM and the
CNN model with the Accuracy of 96.11 %.
