Morphological Analyzer And Generator For Tamil Language Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Natural Language Processing (NLP) is the computer based approach for analyzing text/speech, based on both a set of theories and a set of technologies. Since it is an upcoming active area of research there is no single agreed upon definition that would satisfy everyone. However, when taking into account the aspects of NLP it can be stated as a field combined with computer science and linguistics which is concerned with the interactions between computers and human languages. The main purpose of NLP is to achieve a human like language processing for a range of tasks or applications. Thus, NLP is considered as a discipline within Artificial Intelligence (AI). Although the entire field is referred to as Natural Language Processing there are in fact two distinct areas focused upon- Natural Language Processing (Analyzing) and Natural Language Generation. Natural Language Analyzing usually does an analysis of a particular language for the purpose of producing a meaningful representation, where as Natural Language Generation refers to the production of language components given a representation.

The goal of NLP is to design and build a computational model that would analyze, understand and generate human languages. Applications of NLP include machine translation from one language text to another, generation of human language text such as fiction and manuals, communicating with other systems such as databases and robotic systems by enabling the use of human language type commands and queries, text summarization or draw conclusion given a text, data retrieval for search engines, speech recognizers, TTS etc. It may be easy to parse a sentence and determine its syntax, but determining the semantic meaning of a sentence or to analyze the context to determine the exact meaning, are difficult tasks to perform.

In NLP, Morphology is the level that deals with the structure of words and how they are formed. Words are composed of Morphemes- the smallest meaningful units. Morphology resides in between Phonology and Syntax in the NLP cone. In Computational Morphology there are two major models used to study the formation of words- Two level Morphological analysis and Stemming. In two-level Morphological analysis both analysis of words and generation of words are done. In analysis words are broken down into morphemes, and in generation words are formed with the given morphemes and some rules on how the morphemes should attach together. In stemming which is also known as lemmatization, the morphemes are stripped off to get the stems of the words.

Morphological analysis is an essential component in language applications ranging from spelling error correction to machine translation. When performing a morphological analysis it leads to segmentation of a word into morphemes, combined with an analysis of the attachments of these morphemes as I have mentioned earlier. In English language the complexity of the formation of words is not much high compared with other languages. But when it comes to Indic languages they are very much complex. Thus, in a morphologically rich language like Tamil it is the same. So a system that could predict such changes leads to researches in this area. The morphemes in the language, the rules how these morphemes are connected (orthographic rules) and the changes occur when they attach together are important and interesting factors that needs to be considered.

Up to date much research has been carried out for different languages of the world. Finnish has been one of the major languages that various morphological methods have been applied upon [6] [8]. English [17], Spanish, Hebrew [15], Arabic [11], Japanese [17], Croatian [3], for verbs in Zamudio Basque [7], Tigrinya [12] are some of the other languages which have been well analyzed. In morphologically rich languages such as Indic languages too some research has been carried out. Hindi [16], Kannada [9] and Urdu [16] are such examples. Tamil Morphological analysis also has been done by various research communities, but with limited exposure to other research communities. A project on Sinhala Morphological analysis is presently carried out by the Language Technology Research Laboratory of UCSC [21].

Other than the two major approaches mentioned earlier, Two level Morphological Analysis and Stemming, various other approaches have also been applied for some languages. Statistical methods where Hidden Markov Model is used for Morphological disambiguation [5] is one example. Research has also been done for memory based morphological analysis, in which memory based learning algorithms can learn mapping classifications when an adequate number of instances of these mappings is given to them as input [1]. Here strategies such as the windowing method and 10- fold cross validation are used. Language independent Morphological analysis has also been done earlier. Here the tokenization concept is mainly focused on and the Hidden Markov Model is employed in achieving the goals [17]. Languages in general are categorized as segmented (e.g: English) and non-segmented languages (e.g Chinese, Japanese) when this model is used. However because of segmentation ambiguity many problems occur as the paper discusses. Noise-robust supervised morphological analysis using WordFrame model is another methodology that has been reported in the past work [13]. Stemming is a widely used strategy for Morphological Analyzing. A later version of the stemming algorithm was written by Martin Porter (July 1980), and is widely known as Porter's Stemmer. Brute Force Stemming, Production algorithm, suffix stripping algorithms are other techniques used in Stemming.

Introducing Two-level Morphology

Two-level Morphological analysis is one of the dominant model used and is popularly known as the PC-KIMMO model [8]. Kimmo Koskenniemi's model of two-level morphology was based on the traditional distinction that linguists make between morphotactics, which enumerates the inventory of morphemes and specifies in what order they can occur, and morphophonemics, which accounts for alternate forms or spellings of morphemes according to the phonological context in which they occur. Morphological analysis is simply a model of mapping between the surface form and the lexical form of words [4]. (i.e):-


Surface form:



Boy + N + PL

Lexical form:

As a more complex example, the word "caused" is analyzed as the stem "cause" followed by the suffix "-ed". However, the addition of the suffix "-ed" causes the elimination of the final letter e of cause. Kimmo's model of two-level morphology reveals that a word is represented as a direct, letter-for-letter correspondence between its lexical form and its surface form. So, for instance the word "caused" can be given a two-level representation as,

Lexical form: c a u s e + e d

Surface form: c a u s 0 0 e d

In such a Morphological model we need three major components. The lexicon which has the stem and the affixes, the Mophotactics which maps the order the morphemes should be arranged and the Orthographic or the spelling rules, which explain the variety found in the final surface form.

The two-level Morphology is implemented using Finite State Transducers (FST). A finite state network/machine consists of states, including one start state and one or more final states. Transitions between states are possible only if the required input is recognized. A Path is a sequence of transitions over arcs to a particular state. A finite state transducer is a finite state machine that provides a set of outputs from an accepted input and expresses the relations between languages. In FST we can analyze (look up) and generate (look down). Here input is the lower side of symbol and output is the upper side of symbol, thus FST creates relations between strings. The finite state transducers built at Xerox are inherently bidirectional- there is no privileged input side [10].















A simple FST Network

A single surface string can be related to multiple lexical strings. (i.e. there may be an overflow of data generated to one instance of a word). Some of these forms may be wrong. So, we need to identify those unwanted over-generation by making some changes to the network. For "unknown words" we have to perform an analysis using the suffix generalization by comparing it with similar patterns. The style of Tamil Morphology, the categories into which each component has been divided, building up networks for each category and finally combing into a big network are interesting topics to be discussed further. The main question arises whether the two-level morphological analysis is directly applicable[14] to the Tamil language or with what modifications it needs to be applied; this will the main focus of this research and is addressed in the later chapters.

Chapter 3


3.1 Introducing Finite State Transducers

What are Finite State networks?

People, elements (material) and machines are often known to reside in different states. They tend to change from one stage to another depending on the given conditions or actions. For an example, a child can be in a happy mood, a bored mood, an excited mood or a mischievous mood etc., which are known as his/her states. Likewise, the H2O molecule is said to be in different states as in Chemistry. It is said that, as the temperature increase the H2O molecule changes its solid state (frozen ice) into liquid state (water), and then later into gaseous state(vapor). At a particular temperature these state changes will take place - the conditions.

When it comes to machines, it is not much different. In a simple mechanical machine like a light switch, the states of that machine are the ON state and the OFF state, and these states change from one to another, depending on the current state and the action done. (i.e) When the switch is in the ON state, you need to push down to change it to the OFF state, and for the vice versa you need to push up- the actions or inputs. The point to note here is that, if we push up when the switch is in the ON state, it won't work. The switch will not accept it- these are called illegal inputs. When these are modeled into a diagram, the states are circled and inputs or the actions are used as the labels for the arrow line arcs. Illegal inputs are omitted in the diagram. Shown below is such a model drawn for the On-Off light switch with its states and transitions.





The model of the On-Off light switch with its states and transitions

Before we go deeper, let us examine a simple booth telephone machine which accepts coins and depending on our inputs it decides whether we can make a call. This coin accepting machine is in fact a finite state machine. To simplify it more, assume that,

A telephone call costs 15 rupees. And the Telephone- Machine only accepts,

Five rupee coins (N)

Ten rupee coins (D)

The machine accepts any combination of these coins that add up to 15 rupees.

Machine needs the exact change.

In this machine the start-state (left-most) can be labeled as zero, and the machine will allow you to make a call only until it enters the final-state, which is also known as the acceptance-state. This is labeled as 15. Conventionally, final states are double-circled. Since, the machine accepts only five and ten rupee coins the possible intermediate states would be 5 and 10. They are said to be non final states. The model giving all the possible sequences of input coins is drawn below,










A Simple Telephone Machine

Possible sequences of coins accepted by this telephone machine are,




So, when it comes to linguistics in finite state networks we need to define three major aspects. They are the Alphabet, Words compiled by the network and the Language of that network. Here is how we map the definitions of those terms with the model of the coin- accepting machine described earlier,

ALPHABET - The set of valid symbols (input) the machine accepts. - {N},{D}

WORDS - The sequence of symbols the machine accepts. - {NNN}, {ND}, {DN}

LANGUAGE - Entire set of words the machine accepts. - {NNN, ND, DN}

Finite State Languages Vs Natural Languages






The above finite state network accepts the single word "chair". The alphabet of this machine are the symbols "c", "h", "a", "i" and "r" .The language of this machine consists of a single word, "chair". Likewise, we can add more words into the network and make the machine recognize a language with a bigger alphabet and vocabulary. If we have a very large network as mentioned, then we would be able to build a spell checker- if the transducer accepts it is taken as a valid word. Thus, this spell checker depends more on the coverage of the network.















To make it a more interesting this network can be thought like Kimmo's two-level network.

In the analyzing procedure,

Match the input symbols with the lower side symbols on the arrows/arcs.

When you succeed in finding a path return the upper side symbols on the path, else do not return anything.

This is called as look up. As to the example given, the input will be "chairs", and the output will be "chair+Noun+Pl".

In the generation procedure,

Match the input symbols with the upper side symbols on the arrows/arcs.

When you succeed in finding a path return the lower side symbols on the path, else do not return anything.

This is called as look up. As to the example given, the input will be "chair+Noun+Pl", and the output will be "chairs".

Here, we convert from a string which contains a set of symbol to another. The upper side language consists of the roots and the tags where as the lower side language will contain the surface form- valid words.


When the famous grammarians like Panini were living, linguists defined a change from a lexicon to surface form as a set of rewrite rules applied in a specific order and a set of intermediate strings. Each output that is derived from one rewrite rule is given as input to the next rewrite rule.

Lexicon string

Rewrite Rule 1

Intermediate string

Rewrite Rule 2

Intermediate string

Rewrite Rule n

Surface string

As given in the literature review it was Johnson (1972) who first came out with the concept that simple rewrite rules are less powerful, and rewrite rules can be modeled by a finite state transducer, in his dissertation "Formal Aspects of Phonological Description". He further argued that a cascade of rewrite rules can be compiled into one single FST that performs the same operation.

Lexicon string

Single Rule FST

Surface string

Lexicon string

Rule FST 1

Intermediate string

Rule FST 2

Intermediate string

Rule FST n

Surface string


Cascade of rules compiled into one finite state transducer

3.2 Understanding the Tamil Language

As the Morphological Analyzer is done for both nouns and verbs, we will analyze each of them separately in this section. Each of them will be discussed under three major topics. They are, the suffix categories, Noun/Verb Categories and finally with the Orthographic Rules for each of this categories.

3.2.1 Tamil Nouns

Here the Noun suffix categories, the Noun categories identified and the orthographic rules for each of these noun categories will be discussed.


Tamil literature books [22] define eight set of different suffix classes that a noun can take in a sentence. Each category consists of a suffix or more than one suffix or even with no suffixes. They are discussed below.

*noun focused on

a* alternative suffixes - but they can stand alone in a sentence compared with traditional suffixes.

Type 1:

No suffixes, where the noun is the subject of the sentence.

E.g: ரவி* வாசிக்கின்றான். - Ravi va:sikkindra:n.

Ravi is reading.

Type 2:

"ஐ" (ai) suffix where the noun becomes the object.

E.g: ரவி புத்தகத்தை* வாசிக்கின்றான். - Ravi puththagaththai va:sikkindra:n.

Ravi is reading the book.

Type 3:

Four types of suffixes are used here to give the meaning "with","by". They are,

"ஆல் " -(a:l) -by

"ஆன் " -(a:n)

Not in use!

"ஒடு " -(odu)

"ஓடு " -(o:du) -with


1. ரவியால்* புத்தகம் வாசிக்கப்பட்டது. - Raviya:l puththagam va:sikkappattadhu.

The book was read by Ravi.

2. ரவி மாலாவோடு* சென்றான். - Ravi Malavo:du sendra:n.

Ravi went with Mala.

Type 4:

Here the suffix used is "கு" (Ku), which gives the meaning "to", "for".

E.g: ரவிக்கு* புத்தகத்தை கொடு. - Ravikku puththagaththai(type2) kodu.

Give the book to Ravi.

However, there may be allomorphs of, "க்கு" (ikku), "உக்கு" (ukku), "அக்கு" (akku) .

The "கு " (ku) suffix is used to indicate many forms of grammar in Tamil. They are Friend, envy, ability, reasoning and relationship.

The suffix "ஆக " a* (a:ga) can also be used in this type to convey the same meaning for "reasoning".

E.g: கூலிக்கு* வேலை செய்தான். - kulikku velai seidha:n.


கூலிக்காக* வேலை செய்தான்.- kulikka:ga velai seidha:n.

He worked for wages.

Type 5:

Suffixes : "இல்" (il), "இன்" (in) , where they are used to convey the meaning of "off the", "limitation", "reasoning", "comparison".


1. தலையின்* விழுந்த மயிர். - Thalaiyin vilundha mayir.

The hair that fell off the head. - off the

2. இலங்கையின்* வடக்கே இந்தியா உள்ளது. - ilangayin vadakke indhiya ulladhu.

India is situated north to Srilanka. - limitation

3. மன்னனிற்* கற்றோன் சிறப்புடையோன். - Mannanitr katro:n sirappudaiyon.

An educationist is worth (special/sublime) than a king. - comparison

4. கல்வியிற்* பெரியோன் கம்பன். - Kalviyitr periyon Kamban.

However, to convey the meaning of "off the" and "limitation" the suffix "இருந்து " a* (irundhu) can also be used, both for animated and non-animated nouns.

Type 6:

"அது" (adhu), "ஆது" (a:dhu), "அ" (a) are used as suffixes. "ஆது " (a:dhu) and "அ" (a) are not used these days. "அது" (adhu) gives the meaning of "of the". The suffix "உடைய" a* (udaya) can also be used for this purpose.

E.g: புத்தகத்தினுடைய* மட்டை. - Puththagathinudaiya mattai.


புத்தகத்தின் உடைய a* மட்டை. - Puththagathin udaiya mattai.

The cover of the book….

Type 7:

Suffixes : "கண்" (kan), "இல்" (il), "உள்" (ul), "இடம்" (edam) where they are used to convey the meaning of "in between", "inside".

E.g: மாணவர்களுள்* சிறந்தவன் ரவி. - Ma:navargalul sirandhavan Ravi.

Ravi is the best student among (in between) them.

Type 8:

"ஏ" (ae:), "ஓ" (o:) suffixes are used for exclamation purposes.


1. மகனே* வா. - Magene va:

Son, come here.

2. அப்பனோ* உண்ணாய். - Appano: unnai.

These suffixes and their categories can be summarized as below,


Suf 1

Suf 2

Suf 3

Suf 4

Category 1

No suffixes ( Nominative)

Category 2

"ஐ" (ai)

Category 3


"ஆன் " (a:n)

"ஒடு " (odu)

"ஓடு " (o:du)

Category 4

"கு" (Ku)

"ஆக " (a:ga)

Category 5

"இல்" (il)

"இன்" (in)

"இருந்து" (irundhu)

Category 6


"ஆது" (a:dhu)

"உடைய " (udaya)

Category 7

"கண்" (kan)

"இல்" (il)

"உள்" (ul)

"இடம்" (edam)

Category 8

"ஏ" (e:)

"ஓ" (o:)

Noun Stem Categories

Noun stems in Tamil language can be divided based on the phonological ending of those stems. They can be divided into six categories as mentioned below.


Classification according to

Type 1

"m" ending

Type 2

" i " , " i: " , " ai " endings

Type 3

"l", "L", "n" ,"N" endings - Consonant doubling

Type 4

"du" , "Ru" endings - Consonant doubling

Type 5

"a:" , "u" , " u:" , "o" endings

Type 6

"u" endings