Improving word coverage using Unsupervised Morphological Analyzer

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract Powerful computers are needed for processing tasks related to human languages these days. Human languages, also called natural languages, are highly versatile systems of encoding information and can capture information of various domains. To enable a computer to process information in human languages, the language needs to be appropriately "described" to the computer, i.e., the language needs to be "modeled". In this present work we present an approach for acquisition of morphology of inflectional language like Hindi. It is an unsupervised learning approach, suitable for languages with a rich concatenative morphology. Broadly, our work is carried in three steps 1.Acquire the morphology of Hindi from a raw (un annotated) CIIL Mysore text corpus, 2. prepare clusters and prepare stem bag and suffix bag, 3. use the morphological knowledge to decompose given word as stems and suffixes according to their morphological behavior and add new words. A prime motivation behind this work is to eventually develop an unsupervised morphological analyzer which is language independent (used for Hindi). Second motivation is to develop a Morphological segmentation which is language independent as it is shown that study of morphology would benefit to a range of NLP tasks such as Speech recognition, speech synthesis, machine translation and information retrieval.

Though Hindi is an important and a national language in India, little computational work has been done so far in this direction. Our work is one of the first efforts in this regard and can be considered pioneering. There are many such languages for which it is very important to have a suitable but inexpensive computational acquisition process. Languages receive very little attention of computational linguistic research both in terms of availability of funds and number of researchers. We however do not claim that our approach is a solution for all such languages. Different languages have characteristics that require individual research attention.

Keywords. Human languages; unsupervised morphological analyzer; Clustering, Morphological segmentation, linguistic research.

1. Introduction

Morphological analysis is an integral part of language processing projects such as text-to-speech synthesis, information extraction, syllable identification or machine translation. Tasks include the separation of stems and affixes (prefixes, suffixes, infixes, crucifixes) and the identification of inflectional and derivational processes. These processes may be productive (i.e. apply to new words entering a language) and may also be combined (especially in agglutinative languages like Turkish) making an enumeration of morphological forms unfeasible as described in Jurafsky et al., (2000).

Unsupervised approaches to MA are important for less studied (and corpus-poor) languages, where we have small or no machine-readable dictionaries and tools. Ideally, an unsupervised morphological analyzer (UMA) would learn how to analyze a language just by looking at a large text in that language, without any additional resources, not even mentioning an expert or speaker of the language. The advantages of unsupervised approach for morphological analysis have been stressed by Hammarstrom et al (2006) and Goldsmith (2001). These advantages include elegance, economy, time and money, no additional resources, employment of same technology for new languages, appropriateness for languages with fewer resources.

There is a body of related work that grows faster and faster as briefed in Déjean (1998) first induces a list of 100 most frequent morphemes and then uses those morphemes for word segmentation. His approach is thus not fully unsupervised. Keshava & Pitler (2006) combine the ideas of Déjean (1998). On the Morpho Challenge 2005 datasets, they achieved the best result for English, but they did remarkably worse for Finnish and Turkish. Other UMA learning algorithms exploit the Minimum Description Length (MDL) principle Mathias Creutz & Krista Lagus (2002). Specifically, EM is used to iteratively segment a list of words using some pre-defined heuristics until the length of the morphological grammar converges to a minimum. Brent et al (1995) were the first to introduce an information theoretic notion of compression to represent the MDL framework. Goldsmith (2001) also used an MDL-based approach but applied a new compression system with different measuring of the length of the grammar. Akshar Bharathi (2001) uses a concept of 'observational paradigm' which considers the frequency of occurrence of word forms in a raw corpus. It generates its features-structure set to obtain a suitable vector and compares it with reference vector to find most likely stem paradigm. There is a chance of misclassification for low frequency words and rare word forms. Creutz (2003) uses probabilistic distribution of morpheme length and frequency to rank-induced morphemes. He outperforms Goldsmith for Finnish but gets worse results for English.

Various approaches for unsupervised extraction of stems and suffixes have been reported for English, Assamese Utpal Sharma (2006), Dutch, German and Finnish among others. For the best of our knowledge, this is the first time such approach has been employed on extracting Hindi stems and suffixes. Our approach presents a simple algorithm for unsupervised learning of morphological analysis for inflectionally rich languages like Hindi, given a low coverage morph and a corpus of raw text. It assumes no particular theoretical model of morph, but can used for any language.

We present a brief survey on acquisition of morphology from text corpus, our methodology for extracting valid stems and suffixes from a given text corpus. Technique used for generating new words based on probabilistic models as in Jhon Chen et al (2000). Apart from the assumptions given by Hammarstrom, our methodology further assumes words constitute only of stems and suffixes, maximum length of a suffix is 8 and word can consist of only a single suffix. The experiments were conducted on CII Hindi corpus and the corresponding results are briefed in next section. Last section consist conclusion and future scope.

2. Steps in processing of raw text

The details of the complete system is shown in the following figure 1.

2.1 Step -1 Acquisition of Morphology from a raw text corpus.

For acquiring morphology of a language from an unannotated text corpus we perform a surface level analysis of the corpus. An unannotated text corpus presents two kinds of information about the language- first, the lexical space of the language, i.e., of the infinite possible letter sequences, the ones that form valid words in the language, and second, the morphological phenomena, i.e., the noticeable similarity in the structure of groups of words. In case of inflectional languages like Assamese, Hindi the predominant morphological phenomenon is suffixation. We model the morphology acquired through analysis of the input training corpus, in the form of a collection of suffixes and the criteria for identifying the presence of these suffixes in different words. This knowledge of morphology is used in building a lexicon that is compact as well as provides more insight about words than a plain listing of the words encountered does. The morphological model and the lexicon can be subsequently used for morphological analysis of words in texts.

Figure 1. Proposed system

Our first task is to identify the underlying suffixes in the language. Suppose, SC is the set of suffixes identified by the computational process, and SL is the set of suffixes that are actually there in the language. The ideal goal of the morphology acquisition process is to have SC be the same as SL. However, due to the constraints on the available evidence and the methods applied, SL is usually not the same as SC. Letter strings that are not really suffixes are identified as ones, while several valid suffixes are left unidentified. A morphology acquisition method is useful only if the SC obtained is a close approximation of an underlying SL. Similar issues arises in the next task of our approach.

The next task is to build a lexicon from a corpus- possibly, the same training corpus. We use the suffixes acquired to decompose the words in the corpus. Simply, looking for matching of the suffixes at the end portion of words leads to a large number of invalid decompositions. Methods discussed in John Goldsmith (2001) and Gaussier (1999) are representative of reported approaches to tackle the problem. In our approach, we apply heuristics based on statistics as well as other language specific and script specific aspects. We find that in case of Hindi, and possibly in other languages with similar features, our approach produces better results. We first briefly discuss the two approaches proposed by Gaussier and Goldsmith for identification of suffixes from a text corpus.

2.1a Gaussier's Approach

Gaussier(1999) presents a method to acquire the suffixes used in a raw text corpus. In the method, a pair of decompositions using a common base is obtained for words Wi and Wj in the input corpus such that both are formed with common base or stem β1 and suffixes α1 and α2, where β1 is a accepted word in the corpus. The morphological extensions, α1 and α2, together referred to as a pseudo-suffix pair, are accepted as valid, if for some words Wk and Wl in the input, can be decomposed with valid stem β2 and suffixes α1 and α2.

To ensure the valid morphological extensions the criteria used are:

Minimum base length, p = 5.

Minimum morphological extension co-occurrence, c = 2, i.e., two suffixes occurring with the same base are considered as a pair. In other words, each base must have occurred with at least two morphological extensions3.

Minimum morphological extension occurrence, f = 2, i.e., each morphological extension must occur with at least two bases.

2.1b Goldsmith's approach

Some unsupervised morphology acquisition methods Matthew et al (2002) are based on probabilistic models. A particularly interesting approach which can be seen as a special case of probabilistic idea is presented by Goldsmith (2001). It is based on the Minimum Description Length (MDL) concept. The main intuition in this concept is that if all the morphemes, which are the basic elements of all words, involved in an input are assigned distinct numeric values in the smallest possible number space, then the input can be represented as a sequence of these numbers. Identification of morphemes can be guided by the goal of minimizing the length of the representation of the input corpus, which depends on the total number of morphemes as well as the representation lengths of the individual morphemes in number of bits. The implementation of this approach is available as free downloadable software called Linguistica. The information of Morphemes would help in modeling the language at word level for improving the speech recognition system as discussed in Saraswathi et al (2007).

2.1c Our approach for morphology acquisition : To discover the set of suffixes in Hindi from a raw text corpus, our first step is somewhat similar to Gaussier's approach We obtain all the decompositions in each of which a word, W1 : of the corpus is obtained by appending a string of letters α to another word, W2, of the corpus. That is

W1 = W2 + α

We refer to this exercise as initial decomposition. The idea is that since α occurs as a suffix for other word in the corpus. Hence, it can be the valid suffix for some decomposition. Since word W1 has W2 as its leading portion, it is likely that W1 is derived from W2 with α as a morphological extension. The method of Gaussier ( 1999) detects a morphological extension, xs , if at least two bases that occur with xs also occur with some other, possibly NULL, suffix ys. i.e., if the words axs , ays , bxs and bys occur in the corpus, we get the pseudo-suffix pair (x, y)s . In our initial decomposition we miss a suffix xs even if it occurs adequate number of times unless the corresponding bases also occur independently. In Hindi this does not adversely affect the recall since given a good corpus size, stems do occur without the suffixes, too.

Suppose, the set of words in the input corpus is w. We find the set of initial decompositions, D, as

D : {[w = b + x] | w = bx, and w, b Є W}.

The set of morphological extensions obtained is

E : {x | [w = b + x] Є D}.

A few sample decompositions are shown below these are samples of Hindi transcribed text.

acCa = acC + a (अच्छ = अच्छ् + अ)

acCA = acC + A (अच्छा = अच्छ् + आ)

acCA = acCA + @ (अच्छा = अच्छा + @)

acCAaba acC + Aaba (अच्छाअब = अच्छ् + आअब)

2.1d Algorithm for implementation

Read text corpus and clean the data

ex: In Hindi transciption, some English text is there under header < > Remove that.

Create unique word list (L) from Corpus

Generate all possible suffixes and stems using following procedure:

For each unique word (w) in List (L)

(a)Decompose (w) into possible stems and suffixes assuming maximum length of a suffix is maxSuffixLength





























Note: In this algorithm, maxSuffixLength was taken as 8 and min. length of stem is 2. Examples are quoted in English

For each stem, corresponding suffixes were identified ie. bag of suffixes for each stem using the following procedure.

for each generated stem (S)


if (S) is not in the (stemList) append(S) into (stemList)

if (S) is in the (stemList)

if corresponding (suffix) is not in the (suffixesList) of particular stem (S) append (suffix) in to the (suffixList) of particular stem (S)


2.2 Step -2: Clustering stems and suffixes.

Clustering algorithms divide data into meaningful or useful groups, called clusters, such that the intra-cluster similarity is maximized and the inter-cluster similarity is minimized. These discovered clusters can be used to explain the characteristics of the underlying data distribution and thus serve as the foundation for various data mining and analysis techniques. The applications of clustering include characterization of different customer groups based upon purchasing patterns, categorization of documents on the World Wide Web, grouping of genes and proteins that have similar functionality, grouping of spatial locations prone to earth quakes from seismological data, etc.

CLUTO is a software package for clustering low and high dimensional datasets and for analyzing the characteristics of the various clusters. CLUTO provides three different classes of clustering algorithms that operate either directly in the object's feature space or in the object's similarity space. These algorithms are based on the partition, agglomerative, and graph partitioning paradigms. A key feature in most of CLUTO's clustering algorithms is that they treat the clustering problem as an optimization process which seeks to maximize or minimize a particular clustering criterion function defined either globally or locally over the entire clustering solution space. CLUTO provides a total of seven different criterion functions that can be used to drive both partitioned and agglomerative clustering algorithms, that are described and analyzed in Zhao & Karypis(2001, 2002). Most of these criterion functions have been shown to produce high quality clustering solutions in high dimensional datasets, especially those arising in document clustering. In addition to these criterion functions, CLUTO provides some of the more traditional local criteria (e.g., single-link, complete-link, and UPGMA) that can be used in the context of agglomerative clustering. Furthermore, CLUTO provides graph-partitioning-based clustering algorithms that are well-suited for finding clusters that form contiguous regions that span different dimensions of the underlying feature space.

Our approach uses partition algorithm where the number of partitions or clusters are specified. In our experiments we choose number of clusters as 25, 50, 60, 70, 75, and 100 for small corpus to make a study on how language generation gets effected. For corpus of size 40K we used 1000 clusters to generate large vocabulary.

To use CLUTO tool the data should be converted to matrix form which is done by using doc2mat tool is used

2.2a Algorithm for step two :

(i) Drop all the stems that occur below a certain frequency in the entire corpus.

(ii) Drop all the suffixes that occur below a certain frequency in the entire corpus.

(iii) Generate vector matrix file for the pruned data

* Note: DOC2MAT tool was modified to retain the cases of suffixes unchanged (usually DOC2MAT changes all words in to small case)

(iv) Clustering is carried out by using CLUTO tool, with matrix file as input. By analyzing the output of Cluto tool, different clusters of stems are identified (results in table 1).

Stem bag and suffix bag is prepared as shown in the figure 2 where the input words are printing, walking, walked, prints, wished watched, watches.


Corpus of size 2000 words


1-2 million words

Number of input lines



Number of input words



Number of distinct input words

(including hyphenated words)



Number of stems identified



Number of unique stems identified



Number of suffixes identified



Number of unique suffixes identified



Number used for clustering



Number of stems clustered



Number of stems not clustered



Table 1: Statistical details of corpus of size 2000 words and 1-2 million words.

2.3 Step -3 :Technique to improve word list.

Words are generated by combining the stems belonging to a particular stem bag with the corresponding suffix bag. These words are treated as the training data for the next set of new samples.

The main advantage of generating the new words is that as the languages keep growing in the vocabulary size this would help in identifying the new set of words that get generated.

Figure 2: Example of words represented as stems and suffix.

Figure 3: Example of English words generated

Disadvantage is that it generates many words that are not valid. This can be controlled by having the control on the word generation. The best method is while forming the new word selecting a stem from stem bag and a suffix from corresponding suffix bag choose only those stems and suffixes that have a frequency count that is more than a threshold value. The best threshold value to be chosen is 4 or 5.

2.3a Algorithm for language generation:

(i) Group the suffixes of the same cluster.

(ii) Generate wordlist.

(iii) Comparison of unsupervised method with the rule based Hindi Analyzer tool is performed.

3. Testing and Results

The experiments were conducted on raw CII Hindi Corpus, collected from Indian Institute of Information Technology (IIIT) Hyderabad. The study was made in various directions to compare the results our system developed with the out put of Hindi Analyzer developed by IIIT Hyderabad.

3.1 Comparison of rule based morphological analyzer with unsupervised morphological analyzer

3.1.a Calculation of recall precision and F measure : For two class classification problem differential features are used which are the actual values of the different features of both the classes In case of decision variables for the two classes are chosen symmetrically around zero and the parameters are estimated from training data. A test sample can then be classified as belonging to class c1 or c2 depending upon whether the value of the decision variable is positive or negative.

μ- Number of words in the given test data.

α - Number of stems identified by the Rule based Hindi analyzer.

β - Number of stems identified by the Unsupervised Hindi analyzer, which are also identified by Rule based HA.

γ - Number of stems not identified by the Unsupervised Hindi analyzer, which are identified by Rule based HA.

δ - Number of stems identified by Hindi Analyzer that are in modified form.

η - Number of stems identified by Hindi Analyzer which are not in modified form

η = α - δ

recall(R) = β / η * 100

precision(P) = β / (α - γ ) * 100

F -measure = 2*P*R /(P + R)

The results are as follows,

Description (2000 word corpus)



Total no of words in input



Number of stems that are identified by HA



Number of stems not identified by Hindi analyzer



Number of stems identified by UMA



Number of stems that were not identified by UMA which are identified by HA



Number of stems in modified form



Description (1-2 million word corpus)



Total no of words in input



Number of stems that are identified by HA



Number of stems not identified by Hindi analyzer



Number of stems identified by UMA



Number of stems that were not identified by UMA which are identified by HA



Number of stems in modified form



Table 2: Details of stems identified

Treating only those words that are recognized by the rule-based Hindi analyzer











small data










large data










3.2 Generation of new words with limited words set.

Words are generated by combining the stems belonging to a particular stem bag with the corresponding suffix bag. These words are treated as the training data for the next set of new samples.

Total number of words used by UMA (A1)= 940

Total number of words generated by UMA (A2)= 43343

Test data size (A3) = 40119

Test data words that are present in the A2 = 1191

Test data that has common words with small set (A3) = 741

New words that are present in the generated set = 1191 -741 = 450

Comparisons of different cluster numbers

A0 - Clustering number used.

A1 - Number of words analyzed by UMA.

A2 - Number of words generated by UMA.

A4 - Percentage coverage of stems.

A3 - Test data size.

A5 - Total words that are covered in the test data.

A6 - Test data that has common words with small set.

A7 - New words that are present in the generated set. A7 = A5 - A6









































On observing the results it is clear that recognition of stems does not depend on the number of clusters used. The word coverage depends on the cluster number used. If the low number is used then it results in generating large number of new words which would improve the range of coverage. If cluster number id high, then it results in generating less number of new words ultimately affecting the coverage of new words.

3.3 Extracting the word decomposition with the possible weight.

It is observed that very few words had multiple decompositions indicating the ambiguity. To resolve this we have assigned the weight, to decomposition based upon the following criteria.

Probability of decomposition is calculated as


where W1 W2 W3 are assumed to be 1 initially.

P(stem)=frequency of stem / total number of stems.

P(suff)=frequency of suffix /total number of suffixes

p(stem,suff)=frequency of the given stem for the given suffix/total number of stems and suffixes.

It is observed that the decomposition that has higher value is the a valid decomposition.


Actual word


Possibilities given by UMA





alaga + @ (अलà¤-+@)

ala + ga (अल + à¤-)



ammAz (अम्माँ)

ammAz (अम्माँ)

ammAz + @ ((अम्माँ+@)


AzsU (आँसू)

AzsU (आँसू)

AzsU + @ (आँसू +@)

Azs + U (आँस् + ऊ)







barawAva + @ (बरताव + @)

bara + wAva (बर + ताव)



kahIM (कहीं)

kahIM (कहीं)

kahIM + @ (कहीं + @)

ka + hIM (क + हीं)





laga +@ (लà¤- +@)

lag + a (लà¤-् + अ)



3.4 Stem cluster and suffix cluster.

We have used CLUTO clustering tool for clustering the stems that have common suffixes.


stem bag

suffix bag

Words formed

bulAneh, xewA, WA, sUJA, rahe, nahIMhE, karUz, karoge, jagAyA, jAegA, hUz, hE


bulAnehariXana, xewAariXana, WAariXana, sUJAariXana, raheariXana, nahIMhEariXana, karUzariXana, karogeariXana, jagAyAariXana, jAegAariXana, hUzariXana, hEariXana

aKwiy, yahAzwum, ya, xinoM, wum, wowum, wakana, wakagA, un, samaJowum, Ora, na, kI, ja, iwa, in, gAli, BIun, BIna, BE, bedZi, be, bA, Ba, apanI, am

ArA, ArI, Are, Az, AzKAne, Azjo, Azwuma, I, IM, IMAwI, IMhE, IMliyA, a, eM

naArA, naArI, naAre, naAz, naAzKAne, naAzjo, naAzwuma, naI, naIM, naIMAwI, naIMhE, naIMliyA, naa, name

BaArA, BaArI, BaAre, BaAz, BaAzKAne, BaAzjo, BaAzwuma, BaI, BaIM, BaIMAwI, BaIMhE, BaIMliyA, Baa, BaeM………..

acCA, xinaus, xevaw, Xa, warahaus, us, un, sAs, sab, rUpayo, rAtiyo, rah, neun, me, logaus, lap, koIus, kis, KAw, kareus, jIwum, jis, jEseus, in, hi, hEus, hEab, ha, GudZ, cO, ba, bacc, Ap,

WA, jAwe, jI, kA, kAapanA, kI, kIsAsa, ke, kepAzva, kiyoM, ko, koTarI, logoM, ne, sahasA, se,

acCAWA, acCAjAwe, acCAjI, acCAkA, acCAkAapanA, acCAkI, acCAkIsAsa, acCAke, acCAkepAzva, acCAkiyoM, acCAko, acCAkoTarI, acCAlogoM, acCAne, acCAsahasA, acCAse,

baccWA, baccjAwe, baccjI, bacckA, bacckAapanA, bacckI, bacckIsAsa, baccke, bacckepAzva, bacckiyoM, baccko, bacckoTarI, bacclogoM, baccne, baccsahasA, baccse,

3.5 Failure of UMA with respect to HA

This system could not recognize those stems which were in modified form because of sandhi rules in the language.

Actual word

Stem by Hindi Analyzer

Stem given by UMA

Probability Value





ladZakara + @







ladZake + @







lambe + @





jA जा

gaye + @

à¤-ये +@




jA जा

gayI + @

à¤-यी +@


4. Conclusion

Morphological analysis is a very significant step of NLP for highly inflectional languages such as Hindi. Morphology and syntax are two complementary parts of the structural aspects of natural language expression. Because of the structural nature of morphology, simple computational methods can serve as the initial steps for acquisition of morphology of a language and morphological analysis. We have been largely successful in computational acquisition of the morphology of Hindi using unsupervised method.

We have identified and tackled several issues inherent in using a raw text corpus for acquisition of morphology of a language. We have also dealt with language specific and script specific issues in the process.

Our work is particularly significant because apart from acquisition of morphology we could focus on the natural language generation and also on the possible decompositions of the given word with probability of that occurrence.

The Unsupervised Morphological analyzer provided by our method can be directly used for processing of different languages of similar behavior. This system can be improved by making it semi supervised. That is adding the learning algorithms which could learn the sandhi rules based on the output generated by the Rule based HA and applying these rules while decomposing. If such strategy is used then by having small data we can generate large vocabulary and also improve the performance of morph analyzer.

We are grateful to Mr.Srinivas Bangalore of AT&T labs, Mr Sriram Research scholar from the LTRC group of IIIT and Prof. Rajeev Sangal Director of IIIT Hyderabad who gave us good suggestions through Natural Language Processing(NLP) Winter school organized at IIIT Hyderabad which is sponsored by Tata Consultance Service(TCS). We also thank Central institute of Indian languages(CIIL), Mysore for providing text corpus.