A Comparative Analysis of Machine Learning Algorithms for Word Sense Disambiguation: In the Case of Wolaita Language
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
ASTU
Abstract
Words that can signify different things in various circumstances are present in all human
languages. The term "word sense" in natural language processing (NLP) refers to the various
interpretations or meanings that a word may have depending on the context in which it is used.
Word Sense Disambiguation (WSD), in the context of natural language processing, has been
defined as, a task that involves determining the correct meaning of a word within a given context.
Word ambiguity problems have faced the difficulties for Natural Language Processing and
computers can’t understand ambiguous words as human beings. As the solution to this big
challenge, WSD is developed for different languages by different researchers. In the Wolaita
language also there are different ambiguous words like in all other languages. So, this thesis
presents a research work on Word Sense Disambiguation in the Wolaita Language. To conduct
this study, we selected a corpus-based machine-learning approach for 3560 sentences collected
from different data sources in the language. To conduct the research, we selected seven
ambiguous words from the language namely “Sintta”, “Haytta”, “Ayfiya”, “Doona”,
“Aadhdha”, “Naaga”, and “Ogiya” and seven different datasets are prepared. After the dataset
was prepared, we applied preprocessing techniques like tokenization, stopword removal,
stemming, and normalization. We used BOW, Word2vec, and Tf-idf integrating with N-gram for
feature extraction. We tested four different clustering algorithms (EM, simple k-means, farthest
first, and hierarchical clustering) for unlabeled data for comparison and we also selected six
different algorithms (SVM, NB, NN, Adaboost, RFC, and Bagging) for the supervised approach.
Finally, we compared the performance of the algorithms in both clustering and classification
models. From the selected clustering algorithms, the EM had the best performance with 63.4%,
and from the selected six supervised algorithms the SVM and NB achieved good performance
with an accuracy of 86.5% and 84.1% respectively with optimum window size 5-5 for the
Wolaita language WSD.
