Morphological Analyzer And Generator For Tamil Language Computer Science

Essay add: 28-10-2015, 16:56   /   Views: 330

Natural Language Processing (NLP) is the computer based approach for analyzing text/speech, based on both a set of theories and a set of technologies. Since it is an upcoming active area of research there is no single agreed upon definition that would satisfy everyone. However, when taking into account the aspects of NLP it can be stated as a field combined with computer science and linguistics which is concerned with the interactions between computers and human languages. The main purpose of NLP is to achieve a human like language processing for a range of tasks or applications. Thus, NLP is considered as a discipline within Artificial Intelligence (AI). Although the entire field is referred to as Natural Language Processing there are in fact two distinct areas focused upon- Natural Language Processing (Analyzing) and Natural Language Generation. Natural Language Analyzing usually does an analysis of a particular language for the purpose of producing a meaningful representation, where as Natural Language Generation refers to the production of language components given a representation.

The goal of NLP is to design and build a computational model that would analyze, understand and generate human languages. Applications of NLP include machine translation from one language text to another, generation of human language text such as fiction and manuals, communicating with other systems such as databases and robotic systems by enabling the use of human language type commands and queries, text summarization or draw conclusion given a text, data retrieval for search engines, speech recognizers, TTS etc. It may be easy to parse a sentence and determine its syntax, but determining the semantic meaning of a sentence or to analyze the context to determine the exact meaning, are difficult tasks to perform.

In NLP, Morphology is the level that deals with the structure of words and how they are formed. Words are composed of Morphemes- the smallest meaningful units. Morphology resides in between Phonology and Syntax in the NLP cone. In Computational Morphology there are two major models used to study the formation of words- Two level Morphological analysis and Stemming. In two-level Morphological analysis both analysis of words and generation of words are done. In analysis words are broken down into morphemes, and in generation words are formed with the given morphemes and some rules on how the morphemes should attach together. In stemming which is also known as lemmatization, the morphemes are stripped off to get the stems of the words.

Morphological analysis is an essential component in language applications ranging from spelling error correction to machine translation. When performing a morphological analysis it leads to segmentation of a word into morphemes, combined with an analysis of the attachments of these morphemes as I have mentioned earlier. In English language the complexity of the formation of words is not much high compared with other languages. But when it comes to Indic languages they are very much complex. Thus, in a morphologically rich language like Tamil it is the same. So a system that could predict such changes leads to researches in this area. The morphemes in the language, the rules how these morphemes are connected (orthographic rules) and the changes occur when they attach together are important and interesting factors that needs to be considered.

Up to date much research has been carried out for different languages of the world. Finnish has been one of the major languages that various morphological methods have been applied upon [6] [8]. English [17], Spanish, Hebrew [15], Arabic [11], Japanese [17], Croatian [3], for verbs in Zamudio Basque [7], Tigrinya [12] are some of the other languages which have been well analyzed. In morphologically rich languages such as Indic languages too some research has been carried out. Hindi [16], Kannada [9] and Urdu [16] are such examples. Tamil Morphological analysis also has been done by various research communities, but with limited exposure to other research communities. A project on Sinhala Morphological analysis is presently carried out by the Language Technology Research Laboratory of UCSC [21].

Other than the two major approaches mentioned earlier, Two level Morphological Analysis and Stemming, various other approaches have also been applied for some languages. Statistical methods where Hidden Markov Model is used for Morphological disambiguation [5] is one example. Research has also been done for memory based morphological analysis, in which memory based learning algorithms can learn mapping classifications when an adequate number of instances of these mappings is given to them as input [1]. Here strategies such as the windowing method and 10- fold cross validation are used. Language independent Morphological analysis has also been done earlier. Here the tokenization concept is mainly focused on and the Hidden Markov Model is employed in achieving the goals [17]. Languages in general are categorized as segmented (e.g: English) and non-segmented languages (e.g Chinese, Japanese) when this model is used. However because of segmentation ambiguity many problems occur as the paper discusses. Noise-robust supervised morphological analysis using WordFrame model is another methodology that has been reported in the past work [13]. Stemming is a widely used strategy for Morphological Analyzing. A later version of the stemming algorithm was written by Martin Porter (July 1980), and is widely known as Porter's Stemmer. Brute Force Stemming, Production algorithm, suffix stripping algorithms are other techniques used in Stemming.

Introducing Two-level Morphology

Two-level Morphological analysis is one of the dominant model used and is popularly known as the PC-KIMMO model [8]. Kimmo Koskenniemi's model of two-level morphology was based on the traditional distinction that linguists make between morphotactics, which enumerates the inventory of morphemes and specifies in what order they can occur, and morphophonemics, which accounts for alternate forms or spellings of morphemes according to the phonological context in which they occur. Morphological analysis is simply a model of mapping between the surface form and the lexical form of words [4]. (i.e):-

BoysSurface form:

Generator

Analyzer

Boy + N + PL

Lexical form:

As a more complex example, the word "caused" is analyzed as the stem "cause" followed by the suffix "-ed". However, the addition of the suffix "-ed" causes the elimination of the final letter e of cause. Kimmo's model of two-level morphology reveals that a word is represented as a direct, letter-for-letter correspondence between its lexical form and its surface form. So, for instance the word "caused" can be given a two-level representation as,

Lexical form: c a u s e + e d

Surface form: c a u s 0 0 e d

In such a Morphological model we need three major components. The lexicon which has the stem and the affixes, the Mophotactics which maps the order the morphemes should be arranged and the Orthographic or the spelling rules, which explain the variety found in the final surface form.

The two-level Morphology is implemented using Finite State Transducers (FST). A finite state network/machine consists of states, including one start state and one or more final states. Transitions between states are possible only if the required input is recognized. A Path is a sequence of transitions over arcs to a particular state. A finite state transducer is a finite state machine that provides a set of outputs from an accepted input and expresses the relations between languages. In FST we can analyze (look up) and generate (look down). Here input is the lower side of symbol and output is the upper side of symbol, thus FST creates relations between strings. The finite state transducers built at Xerox are inherently bidirectional- there is no privileged input side [10].

t

t

a

a

b

b

l

l

e

e

+Noun

ε

+Sg

ε

A simple FST Network

A single surface string can be related to multiple lexical strings. (i.e. there may be an overflow of data generated to one instance of a word). Some of these forms may be wrong. So, we need to identify those unwanted over-generation by making some changes to the network. For "unknown words" we have to perform an analysis using the suffix generalization by comparing it with similar patterns. The style of Tamil Morphology, the categories into which each component has been divided, building up networks for each category and finally combing into a big network are interesting topics to be discussed further. The main question arises whether the two-level morphological analysis is directly applicable[14] to the Tamil language or with what modifications it needs to be applied; this will the main focus of this research and is addressed in the later chapters.

Chapter 3Design3.1

Article name: Morphological Analyzer And Generator For Tamil Language Computer Science essay, research paper, dissertation