Enhancing the Performance of Email Spam Detection and Classification using Machine Learning Techniques
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
ASTU
Abstract
Email communication is regarded as the primary professional channel, enabling
businesspeople and both commercial and nonprofit organizations to communicate with one
another and exchange significant official documents and reports on a global scale. This global
route draws a lot of attackers and intruders who utilize these innovations to commit crimes; in
particular, the spammers create fake emails that contain attractive information and distribute
them to clients worldwide. Current approaches primarily focus on widely spoken languages
like English, resulting in a lack of comprehensive models tailored to underrepresented
languages such as Amharic. This linguistic gap leads to poor performance in identifying spam
emails in Amharic due to limited availability of preprocessed datasets and language-specific
tools. Additionally, challenges such as imbalanced datasets, noise in translated data, and the
absence of robust feature extraction techniques further exacerbate the problem. While methods
like TF-IDF and ML algorithms, including SVM and RF, show promise, they still struggle with
precision and recall trade-offs, especially for imbalanced classes. These limitations underscore
the need for a systematic enhancement of data preprocessing, balanced dataset creation, and
model optimization to bridge the gap and improve email spam detection and classification for
Amharic and other low-resource languages. ML and NLP techniques used to identify spam. ML
techniques frequently used in spam filtering to classify emails to either legitimate (ham) or
spam (unsolicited messages). We studied the existing features and models to improve the
accuracy, precision, recall and f-score. We created Amharic dataset by translating from local
and publicly available English dataset using Google translate API. We split the data 70% for
training and 30% to obtain result that ensure generalizability and we used 5 K-fold CV to
ensure the model generalizability. We employed four ML models MNB, BNB, SVM, and RF to
examine SVM model performs higher that other models by attaining 98.52% accuracy. We also
compared the proposed results with baseline works and our proposed model performs better in
detecting and classifying email spam data.
