Enhancing the Performance of Email Spam Detection and Classification using Machine Learning Techniques

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

ASTU

Abstract

Email communication is regarded as the primary professional channel, enabling businesspeople and both commercial and nonprofit organizations to communicate with one another and exchange significant official documents and reports on a global scale. This global route draws a lot of attackers and intruders who utilize these innovations to commit crimes; in particular, the spammers create fake emails that contain attractive information and distribute them to clients worldwide. Current approaches primarily focus on widely spoken languages like English, resulting in a lack of comprehensive models tailored to underrepresented languages such as Amharic. This linguistic gap leads to poor performance in identifying spam emails in Amharic due to limited availability of preprocessed datasets and language-specific tools. Additionally, challenges such as imbalanced datasets, noise in translated data, and the absence of robust feature extraction techniques further exacerbate the problem. While methods like TF-IDF and ML algorithms, including SVM and RF, show promise, they still struggle with precision and recall trade-offs, especially for imbalanced classes. These limitations underscore the need for a systematic enhancement of data preprocessing, balanced dataset creation, and model optimization to bridge the gap and improve email spam detection and classification for Amharic and other low-resource languages. ML and NLP techniques used to identify spam. ML techniques frequently used in spam filtering to classify emails to either legitimate (ham) or spam (unsolicited messages). We studied the existing features and models to improve the accuracy, precision, recall and f-score. We created Amharic dataset by translating from local and publicly available English dataset using Google translate API. We split the data 70% for training and 30% to obtain result that ensure generalizability and we used 5 K-fold CV to ensure the model generalizability. We employed four ML models MNB, BNB, SVM, and RF to examine SVM model performs higher that other models by attaining 98.52% accuracy. We also compared the proposed results with baseline works and our proposed model performs better in detecting and classifying email spam data.

Description

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By