Development of part-of-speech tagger for Xhosa

Delman, Xolani

Title: Development of part-of-speech tagger for Xhosa
Creator: Delman, Xolani
Subject: Computational linguistics -- Methodology Natural language processing (Computer science) Linguistic models
Date: 2016
Type: Thesis
Type: Masters
Type: MSc
Identifier: http://hdl.handle.net/10353/11872
Identifier: vital:39114
Description: Part-of-Speech (POS) tagging is a process of assigning an appropriate part of speech or lexical category to each word in a given sentence of a particular natural language. Natural languages are languages that human beings use to communicate with one another be it Xhosa, Zulu, English etc. POS tagging plays a huge and important role in natural language processing applications. The main applications of POS tagging include machine translation, parsing, text chunking, spell checkiXhosa (sometimes referred to as isiXhosa) is one of the eleven official languages of South Africa and is spoken by over 8 million South Africans. The language is mainly spoken in the Eastern Cape and Western Cape provinces of the country. It is the second most widely spoken native language in South Africa after Zulu (sometimes called isiZulu). Although the number of speakers might seem to be high, Xhosa is considerably under-resourced. There are very few publications in Xhosa, very few books have been published in the language and also the domains that use the language as a medium of instruction are very limited. However, the language is finding momentum nowadays. An Oxford approved Xhosa dictionary has been developed recently, and Xhosa newspapers that did not exist in the recent past are now published. Text from previously mentioned sources can then be combined to formulate a larger text that can be used to train the tagger. This work aims to develop an effective POS tagger for Xhosa. g and grammar. This thesis presents/describes the work that needed to be done to produce an automatic POS tagger for Xhosa. A tagset consisting of 36 POS tags/labels for the language were used for this purpose. These are listed. A total of 5000 words were manually tagged/labelled for the purpose of training the tagger. Another 3000 words were used for testing the tagger and these were disjoint from the manually tagged training data. The open source Stanford CoreNLP toolkit was used to create the tagger. The toolkit implements a Maximum Entropy machine learning model which was applied in the development of the tagger presented in this thesis. The thesis describes the implementation and testing processes of the model in detail. The results show that the development of the Xhosa POS tagging model was successful. This model managed to obtain a tagging accuracy of 87.71 percent.
Format: 88 leaves
Format: pdf
Publisher: University of Fort Hare
Publisher: Faculty of Science and Agriculture
Language: English
Rights: University of Fort Hare

Hits: 1089
Visitors: 1243
Downloads: 291

Collections

UFH Department of Computer Science

		Thumbnail	File	Description	Size	Format
View Details			SOURCE1	MSc (Computer Science) DELMAN, X.pdf	2 MB	Adobe Acrobat PDF	View Details