Named Entity Recognition for Wolaytta Language Using Machine Learning Approach

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

ASTU

Abstract

Currently, the overcrowded nature of digital data has resulted in the challenge of extracting relevant structured information. Information extraction techniques were introduced to solve such a bulky process of searching for a relevant query. One of those Information Extraction techniques known by extracting proper names from unstructured text is Named Entity Recognition. Named Entity Recognition is a significant task to identify named entities from large documents. Significant researches have been conducted on NER for well-studied languages like English. However, it wasin infant stage for Ethiopian languages. The objective of current study is to develop Named Entity Recognition for Wolaytta Language. Wolaytta language is morphologically rich, but highly disadvantaged in terms of computational linguistic resources. In this study, we have collected data from three sources Wolaytta Wogeta Radio Station, Wolaytta Fana Broadcasting Corporation Radio Station and Wolaytta Language and Literature Department. Therefore, newly labeleddataset that have 16420 words is used for the study. Current work focused on exploring three main Named Entities Person, Location, and Organization from unstructured Wolaytta text. Word-level feature extraction and DictVectorizer are used for feature engineering task. Then on extracted feature representation, we have employed three machine-learning algorithms, Conditional Random Field, Support Vector Machines and Random Forest to classify Named Entities to their predefined classes. We have undertaken several experiments to determine best performing model. Conditional Random Field model is best performing with 93.8%, 94.9% and 91.4% precision, recall and f1-score respectively over other classifiers. After determining a well-fitted model with our problem, we have examined the combination of feature sets to know which word-level feature combination has more influence in recognizing Named Entities from Wolaytta text. We droppedone feature at a time, the more decrease in model?��?s performance, the more influential that feature is. Therefore, the experiment result shown us that dropping Part-of-Speech tag decreased in 2.2%, 1.2% and 0.8% precision, recall and f1-score and word-shape features resulted in decrease 1.6%, 2.6% and 0.2% respectively from baseline performance. However, applying suffix feature shown less effect in our model. Furthermore, beyond current model building our newly prepared dataset can take vital allotment for future res

Description

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By