Automatic Complex Sentence Parsing For Afan Oromo Text: An Experiment Using Hidden Markov Model (HMM).
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Natural language processing is a research area which is becoming increasingly popular each day for both academic and commercial reasons. Higher NLP systems (e.g., machine translation) are materialized only when the lower ones (e.g., part-of-speech tagger, syntactic parser) are successfully built. This functional dependency exists even among the lower NLP systems.
This thesis can be taken as an attempt to integrate ideas and outputs of previously attempted Afan Oromo part of speech tagger towards solving a bit further problem in parsing of complex sentence language.
Syntactic parsing underlies most of the applications in natural language processing. Parsers are already being used extensively in a number of disciplines such as in computer science (for compiler construction, database interfaces, artificial intelligence, etc), and in computational linguistics (for text analysis, corpora analysis, machine translation, etc.).
Although there have been some comprehensive studies of Afan Oromo syntax from a linguistic perspective, attempts for investigating it from a computational point of view is a very recent story. In this thesis, an attempt to extract such features as Afan Oromo word and phrase classes, sentence formalisms that enable implementation of automatic Afan Oromo complex sentence parser is presented.
The sample data used in this study has been taken from references that are widely used in the teaching-learning process of the language. This data has also been manually annotated and analyzed, tagged, parsed, and then used as a corpus to extract the grammar rules and to assign probabilities.
IX
Experiments have been conducted in 300 complex sentences having 3029 words. In this study, 80% of the corpus (240) sentences have been used for training set and 20% (60) complex sentence have been used as test set.
The part of speech tagger integrated on this study using 3029 manually annotates words used 28 category of tag sets. Out of these the 27 of them are a tag category whereas the 28th one “X” is used for unknown words. The result of this experiment showed that the tagger attained 89.7% and 84% of accuracies on the training set and the test set, respectively.
The experiments on complex sentence parsing showed 80.0% accuracy result on the training set and 71.6% accuracy result on the test set prepared for this purpose.
