Word Sense Disambiguation for Wolaita Language Using Machine Learning Approach
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
ASTU
Abstract
The amount of data accessible online has been increasing and the need for Natural Language
Processing significantly increasing to access and process this data. However, ambiguity
problems have faced the difficulties for Natural Language Processing. As human beings,
computers can’t understand one word in different way. As solution to this, Word Sense
Disambiguation models developed for many languages to address the problem of lexical
ambiguity. For the Wolaita language, there are also a lot of polysemy words and these can be
the cause of difficulties for Natural Language processing applications developed by previous
researchers. Therefore, Word Sense Disambiguation Model for Wolaita language using a
machine learning approach was proposed. To conduct the research, a total of 2797 sense
examples were collected from Holy Bible, academic books, media agencies (Sport, Health,
Business and national and international News), and data from prior researchers. The collected
data was annotated by the language experts and then five datasets prepared for five ambiguous
words such as “Doona”, “Ayfiya”, “Aadhdha”, “Naaga” and “Ogiya”. We employed
quantitative experimental research approach to determine the best combination of the machine
learning algorithms and features extraction techniques. Support Vector Classifier , Bagging,
Random Forest Classifier, and AdaBoost classifier with BOW, TF-IDF and Wor2Vec feature
extraction techniques selected and trained using five datasets on six-window sizes (WS3, WS5,
WS7, WS9, WS11 and WS13). From the six window sizes, WS11 (5-5) was selected as the optimal
window size in terms of accuracy and the computational time it costs. Among four algorithms,
Support Vector Classifier and Bagging classifiers with TF-IDF achieved accuracy of 83.22%
and 82.82 % respectively on WS11 (5-5) using 10-fold cross-validation.
