Hate Speech Detection for Amharic Language on Social Media Using Machine Learning Techniques
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
ASTU
Abstract
Hate speech on social media has unfortunately become a common occurrence in the Ethiopian online community largely due to the substantial growth of users on social media in recent years. Hate speech on social media has the potential to quickly disseminate through the online user that could escalate into an act of violence and hate crime on the ground. Determining a portion of a text containing hate speech is not simple tasks for humans it is time-consuming and introduces subjective notions of what constitutes a text to be hate or offensive speech. As a solution to this problem, this research proposed hate speech detection using machine learning and text-mining feature extraction techniques to build a detection model. A hate speech data was collected from the Facebook public page and manually labeled into three classes and then converted into binary class to build binary and ternary datasets. The research employed an experimental approach to determine the best combination of the machine learning algorithm and features extraction for models. SVM, NB, and RF models trained using the whole dataset with the extracted feature based on word unigram, bigram, trigram, combined n-grams, TF-IDF, combined n-grams weighted by TF-IDF and word2vec for both datasets. The models evaluated using 5-fold cross-validation, and classification performance were used to compare the models. Using the two datasets the study developed two kinds of models with each feature those are binary models and ternary models. The models based on SVM with word2vec achieve slightly better performance than the NB and RF models for both binary and ternary models. According to the classification performance result, the ternary models achieve less confusion between hate and nonhate speech than the binary models. However, the models tend to misclassify offensive speech as hate speech. Generally, hate speech detection with the machine learning and text feature extraction method based on multi-class dataset achieves a better performance than the binary class detection models.
