Take All Splits Heuristic Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The term conflation is used to denote the act of mapping variants of a word to a single term or stem. Stemmers are used to convert inflected words into their root or stem. Stem does not necessarily correspond to linguistic root of a word. Stemming improves IR performance by morphological variants of a word are reduced to a single form. This paper presents an algorithm is to develop unsupervised Telugu stemmer. The Take-All-Split heuristic proposed in this paper aims at the development of stemmers used for several inflectional languages like Telugu. 1000-2000 randomly extracted trained words will be considered and stemming will be applied. The training data will be considered from large corpus contains 106400 words. Corpus will be taken to generate prefixes and these prefixes will be used at the time of actual stemming of Telugu query. The accuracy, F-score, precision and recall are evaluated. As the algorithm does not require any language specific information, it can be applied to other Indian languages as well. The effect of stemmer is evaluated in terms of reducing size of index for Telugu information retrieval task. The results will be compared with existing rule based Telugu stemmers. The test runs may show that this stemmer outperforms existing Telugu stemmer.


Telugu is an Indian language spoken by over 50 million people in the country. The language is rich in literature and has been studied by native and foreign linguists significantly, yet it has not benefited significantly from the recent advances in computational approaches for linguistic or statistical processing of natural language texts. Andhra Pradesh state, where Telugu is spoken, shares borders with 5 different states which speak Tamil, Kannada, Marathi, Hindi and Oriya. Thus, in regions along the borders with these states, the dialect of Telugu is different, although the script and formal (written) language are the same. It has recorded origins around 7th century AD and became a literary language around 11th century AD.

However with the recent progress in standardization of machine representation of text, applications like machine translation and information retrieval are beginning to surface. Multilingual applications such as machine translation, multilingual book reader with transliteration and translation capabilities, and Indian language search engine have all been built over the transliteration tools and Indian language data generated by DLI (Digital Library of India).

An integral part of many of the applications based on natural language processing is the morphological analyzer/generator. Although Telugu language has ancient origins, and is today spoken by such a large number of people, and is spoken in a State with significant advancements in the information technology, few advances have been made in computational linguistic processing or computational natural language processing in this language.

An information retrieval (IR) system, which manages text resources, processes words to extract and assign content descriptive index terms to documents or queries. As we use naturally spoken or written language, words are formulated with many morphological variants, even if they are referred to as a common concept. Therefore stemming has to be performed in order to allow words, which are formulated with morphological variants, to group up with each other, indicating a similar meaning. Most of the stemming algorithms reduce word forms to an approximation of a common morphological root, called "stem", so that a relevant document can be retrieved even if the morphology of their own words is different from the one of the words of a given query. It is interesting to note that a stemming algorithm is a special case of query expansion due to the fact that it provides the searchers a way to expand the set of query terms with their morphological variants.

Speech recognition and speech synthesis also need a powerful stemmer to give stems and derivations. Stemmers can improve the retrieval effectiveness of information retrieval systems, however the designing and the implementation of stemmers requires a laborious amount of effort due to the fact that documents and queries are often written or spoken in several different languages. There is no standard Telugu stemmer is available thus the reason there is need to develop Telugu stemmer.


Strong (or aggressive) stemmers may reduce the size of the index for a given corpus drastically. However, stemming is afflicted with two kinds of errors: understemming and overstemming. Understemming is the case where words that should have been grouped into the same class are not so, and the performance is suboptimal. When too many unrelated words are merged together, then it is the case of overstemming. This leads to a reduction in precision during retrieval and an increase in the error rate for classification. One may also note that with increase in the strength of a stemmer, recall is increased, but either retrieval precision or classification accuracy (or both) is degraded. This may be concluded by observing that increase in stemming strength gradually leads to a reduced number of stem classes, and in an extreme case, all the words belong to a single stem class, which provides none of the information required for classification or retrieval.

Thus, the general objective in designing a stemmer is to ensure that classification accuracy and retrieval precision are maintained. In this paper, we describe the design of such a stemmer.

Our approach is partly in line with Goldsmith approach. It is based on split-all method. This method uses Boltzmann distribution. The work presented in this paper focuses on the development of an unsupervised Telugu stemmer will be used in retrieval task. This work is a part of our ongoing work on cross-lingual information retrieval (CLIR).


Consider corpus is taken. A heuristic is applied is called that take-all-splits heuristic, and which considers all cuts of a word of length l into stem+suffix w1,i + wi+1,l, where 1 ≤ i < l.

The function H(.) assigns a value to a split of word w of length l: w1,i +wi+1,l. H does not assign a proper distribution; we use it to assign a probability to the cut of w into w1,i+wi+1,l . Clearly the effect of this model is to encourage splits containing relatively long suffixes and stems.

H(w1,i + wi+1,l) = -(i log freq (stem = w1,i) + (l - i) log freq (suffix = wi+1,l))


Where n is number of splits.

For each word, best parse is considered, that is, which parse has the highest rating by virtue of the H-function. This process is iterated until no word changes its optimal parse, which empirically a typically less than five iterations on the entire lexicon. Initial split of all words into stem plus suffix are generated.


Each word we will assign an optimal split into stem and suffix by the initial heuristic chosen, and we will consider henceforth only the best parse for that word, and we will retain only those stems and suffixes that were optimal for at least one word. For each stem, we will make a list of those suffixes that appear with it, and we will call an alphabetized list of such suffixes (separated by an arbitrary symbol, such as period) the stem's signature; we may think of it as a miniparadigm. However, they will also appear as freestanding words, and so we will use the word NULL, to indicate a zero suffix.

Each stem is associated with exactly one signature; we will also use the term signature to refer to the set of affixes along with the associated set of stems when no ambiguity arises. We will establish a data structure of all signatures, keeping track for each signature of which stems are associated with that signature. As an initial heuristic, we will discard all signatures that are associated with only one stem and all signatures with only one suffix. The remaining signatures we shall call regular signatures, and we will call all of the suffixes that we find in them the regular suffixes. As we shall see, the regular suffixes are not quite the suffixes we would like to establish for the language, but they are a very good approximation, and constitute a good initial analysis. The nonregular signatures produced by the take-all-splits approach are typically of no interest.

The regular signatures are thus those that specify exactly the entire set of suffixes used by at least two stems in the corpus. The presence of a signature rests upon the existence of a structure as in (1), where there are at least two members present in each column, and all combinations indicated in this structure are present in the corpus, and, in addition, each stem is found with no other suffix. (This last condition does not hold for the suffixes; a suffix may well appear in other signatures, and this is the difference between stems and affixes.)




stem2 stem3


A signature Ω has both a stem count (the number of stems associated with it) and an affix count (the number of affixes it contains), and we use log (stem count) * log (affix count) as a rough guide to the centrality of a signature in the corpus.


Clustering algorithms divide data into meaningful or useful groups, called clusters, such that the intra-cluster similarity is maximized and the inter-cluster similarity is minimized. These discovered clusters can be used to explain the characteristics of the underlying data distribution and thus serve as the foundation for various data mining and analysis techniques.

CLUTO is a software package for clustering low and high dimensional datasets and for analyzing the characteristics of the various clusters. This software uses partition algorithm which is iterative procedure which divides the set into two groups and repeats until the number of clusters is equal to the specified number. To use CLUTO tool the data should be converted to matrix form which is done by using doc2mat tool which converts the document data to matrix form.


Drop all the stems and suffixes that occur below a frequency count of 2 in the entire corpus. Generate vector matrix file for the pruned data. Clustering is carried out by using CLUTO tool, with matrix file as input. By analyzing the output of CLUTO tool, different clusters of stems are identified. These clusters will be used at the time of IR to increase recall rate.


The output of our unsupervised stemming algorithm is the set of suffixes. Each suffix roughly captures some morphological variation. We will use longest suffix matching during stemming. This means that longest possible suffix of the target word matching with some suffix in the list is dropped. The suffix list is arranged in decreasing order of length to facilitate it. We will start matching form first suffix to last suffix, where a match is found we segment the word in its stem and suffix.


Accuracy is defined as fraction of words stemmed correctly.

Recall is defined as the ratio of length of stem produced by our stemmer and the length of standard stem.

Precision is defined as the ratio of length of standard stem and length of the stem produced by our stemmer

F-score is defined as the harmonic mean of recall and precision.

Precision = 1, if length of standard > length of actual stem

, otherwise

Recall = 1, if length of standard < length of actual stem

, otherwise

Where m is total number of test words.


A large volume of text is now available in electronic form in multiple languages including Telugu which is an Indian language spoken by more than 50 million people in the country. Language is very rich in literature, and it requires advancements in computational approaches. In order to provide access to this information the need of multi-lingual (cross-lingual) text retrieval system is increasingly felt. This has made language processing an active area of research. Early work in this area was mainly focused on English. However, the past decade has witnessed proliferation of research on Asian language processing. Still, the scarce availability of tools and other lexical resources for languages other than English and major European languages put major obstacle on the work in this direction. This is especially true for languages from Indian sub-continent. This work focuses on development of one such tool, namely stemmer for Telugu language. Almost all of the search and retrieval system use stemmer to reduce morphological variants to their stem.

Stemmers are simplest morphological systems that collapse morphological variants of a given word to one lemma or stem. The stems need not be a valid linguistic root of the word, it is usually enough that related words map to the same stem. For example, the words వెలుతునాం, వెలాలి, వెలాము are reduced to same stem, వెల. Stemmers are used in text retrieval systems to reduce index size. Reducing morphological variants to same stem eliminates the need of using all variants for indexing purpose. This helps in improving recall rate and precision rate.

Telugu is inflectionally rich language. Telugu words may have many morphological variants. These variants are mostly generated by adding suffixes to the root word for example Telugu plural nouns are generated by adding suffixes like à°‚, లు, రు, డు. A word may have a number of morphological variants. For example, à°…à°à°¿à°•à°¾à°°à°®à±, à°…à°à°¿à°•à°¾à°°à°¾à°²à±, à°…à°à°¿à°•à°¾à°°à°¿, à°…à°à°¿à°•à°¾à°°à±à°¡à±, it will reduce into à°…à°à°¿à°•à°¾à°°. Similarly following inflected words రామ, రాము, రాముడు, రాములు are reduced to same stem రామ. Variability in Unicode encoding is another problem. Both spelling Unicode variations create problem in stemming thus we want to use UTF-8 code.

Standard stemmers are available for English and other European languages. However, no such standard stemmers are available for Telugu and most of other Indian languages. Two widely known stemmers for English have been developed by Lovins and Porter. Both Porter's and Lovin's stemming algorithms rely on linguistic rules for stemming. It is possible to devise such riles for Telugu that would reduce inflectional variant to its stem. However, this is a time consuming process. Further, the efforts need to be duplicated for a different language as the rules are highly language dependent.

Supervised approach for stemmer requires a set of inflection-root pair to learn suffixes. Training set has to be created manually which is a time consuming process and requires linguistic knowledge. Therefore, this paper focus on the development of unsupervised approach which only requires a list of words for learning suffixes. Such a list can be easily extracted from the raw corpus.

In this paper explains the design of an unsupervised Telugu stemmer. This approach uses a set of words to learn stemming rules automatically. 106400 words are considered from Telugu corpus. A simple heuristics is applied to avoid over stemming in Telugu words. For testing 1000-2000 words are selected from Telugu database. Test cases are generated and accuracy, recall, precision and F-score are evaluated. This Telugu stemmer performance is compared with existing rule based stemmer (TelMore). The second evaluation measures the performance of this stemmer for indexing task. The percentage reduction in index size is compared with existing stemmer. The algorithm is computationally inexpensive. Suffix lists are generated and longest suffix stripping is used to perform stemming. Some post-processing heuristic are applied to repairs to further refine the learned suffixes.

Associated work

The development and use of stemming algorithms are not new initiatives. First ever published stemmer was written by Julie Beth Lovins in1968. This paper was remarkable for its early date and had great influence on later work in this area. Another widely known stemmer was developed by Martin Porter. These early stemmers were rule-based. Formulating such rules for a new language is time consuming. Supervised stemming requires a set of inflection-root pairs to learn a set of string transduction. These transductions are then used to reduce unseen inflections into their root.

A lot of work on morphology focuses on unsupervised approaches. Brent et al. proposed an information theoretic approach based on Minimum Description Length (MDL) framework. Brent and Snover and Brent later proposed a Bayesian model for MDL for English and French. Goldsmith's work is based on minimum description length principle which prefers a model that gives rise to minimal length coding of observed data set whereas the work in Creutz uses prior distribution of morpheme length and frequency to find a good morpheme boundary. Yet another work on stemming was reported by Freitag. His work is based on automatic clustering of words using co-occurrence information. The suffixes have been identified using orthographic dissimilarity between the words in different clusters.

Earlier work on Indian language morphology includes a Telugu morphological generator, TelMore. TelMore is a rule-based approach. It uses the results of analysis of Telugu performed by Brown's and Krishnamurti's analysis.

Proposed Paradigm

Proposed paradigm is moderately in Goldsmith's approach. It is based on take-all-splits heuristic. Goldsmith's heuristic was completely unsupervised, for unsupervised studying, words from Telugu corpora are considered. These words have been separated to give all splits to consider (i=1, 2, 3…..l-1) suffix, where l is length of the word. Then frequencies of stems and suffixes of each split of words are computed. These frequencies are used to get optimal split of word. The optimal segment unifies stem and suffix. Some normalization steps and bench marks are considered to refine the learned suffixes. Fig-1 shows the layout of our paradigm. The details of these steps are shown below:

Word Segmentation

Take-all-splits heuristic is used for word segmentation and it considers all cuts of a word of length l into stem+suffix W1,i + Wi+1,l, where and W refers the word from corpus and this heuristic is based on Boltzmann distribution. The heuristic assigns a value to a split of word W of length l: W1,i +Wi+1,l. Clearly the effect of this paradigm is to encourage splits containing relatively long suffixes and stems.

For each word, the best split to consider stem and suffix is which parse has the highest rating of heuristic.

As iteration proceeds, split heuristic value changes.

The frequencies of shorter stems and suffixes are very high when compared to the slightly longer stems and suffixes. Thus the multipliers i (length of stem) and l-i (length of suffix) are introduced in this heuristic in order to compensate for this disparity.

This heuristic takes all splits of word Wl into stems and suffixes as:

Where stem1i refers the prefix of word Wl with length i and suffixi+1l refers the suffix of word Wl with length l-i.

For example, the word 'à°…à°à°¿à°•à°¾à°°à°®à±' is split into following stem and suffix:

The suffixes correspond to i=1, 2, 3,…,l where l is length of the word.

Input: Corpus

Output: Suffix list

Step-1 Segment words as:

Step-2 Compute frequencies of stem and suffix and calculate Heuristic value of each split by using following equation

Consider highest Heuristic value of all splits, that gives the stem and suffix of word.


Repeat step-2 up to all words in corpus.

Step-4 Create analogous signatures, it is two step process. In first step, we group suffixes with same stem

In second step, we group stems having similar suffix list.

Step-5 Normalization

Step-6 Suffix list is output of step 5.

Optimal split

The split with highest heuristic is treated to be optimal segments of word. This segmentation mainly depends upon the frequency distribution stem and suffix of the word, language and corpus. Some restrictions are taken on length of stem while considering optimal split. That is minimum length of stem should be three. This restriction removes words like khata, khas, khara, etc. The second heuristic is based on the frequency distribution of prefixes and suffixes of word. The rough characterization of heuristic values of all splits is shown in Fig-2. This figure is not a plot of actual data. It is just a conceptual representation of heuristic values of all splits of word. As evident from the Fig. 2, the heuristic value of all splits first increases and then starts decreasing again. The point where the heuristic value starts decreasing again, represented in bold in Fig.2 that is considered as optimal split of word.


Analogous Signatures

After getting stems and suffixes, Suffixes grouped together with similar stem. This process is shown below:

Where, suffix1, suffix2 … have the same stem, stem1. For example, à°¿, à°‚, ము, ుడు have same stem à°…à°à°¿à°•à°¾à°°. This step yields the group


Next, stems are grouped with same set of suffixes. For example, if 􀝏􀝐􀝁􀝉􀯜 and 􀝏􀝐􀝁􀝉􀯝 both have the same set of suffix then we group them together as shown below:

For example, छुिô²µà¤¯ and पô³à¤¿à¤¤à¤¯ consist similar set of suffix {â-Œà¤¾â-Œà¤, â-Œà¤¾â-Œà¤‚, â-Œà¥‹â-Œà¤‚ }. So we group them to yield {{छुिô²µà¤¯à¤ªô³à¤¿à¤¤à¤¯} : { â-Œà¤¾â-Œà¤â-Œà¤¾â-Œà¤‚â-Œà¥‹â-Œà¤‚}}.

Table 1 shows sample classes derived after this step. We observed some error in grouping of suffix and stem. Hence, we apply some heuristics to further improve these classes as discussed in the following section.

Table1: Sample classes

S.N. Classes ( left side of # is Stems and right side of # is


जावे सोचे#NULL â-Œà¤‚ â-Œà¤‚à¤-े à¤-ा à¤-ी

à¤-ाये सुने#NULL â-Œà¤‚ â-Œà¤‚à¤-े à¤-ा

अंतत मूलत#NULL â-Œà¤ƒ

चौथा चौड़ा रंà¤-ा रोपा लंबा#NULL ई

उपजा िटका#NULL ऊ ने

जम􁱭 ठेके#NULL दार दारी दार􁲂

ईमान चौकी#NULL दार दारी

अछूत छोकर ठण्ड तà¤-्त बरतन#NULL â-Œà¤¾ â-Œà¥€ â-Œà¥‡

िलà¤-त िलपट#NULL â-Œà¤¾ â-Œà¥€

अँà¤-र्ेज अँà¤-र्ेज़ à¤-वाह चाकर डकैत दलाल#NULL â-Œà¥€ â-Œà¥‹â-Œ


In order to improve the performance of our algorithm, we apply a heuristic that attempts to concatenate stems with common prefix string of suffixes. It is observed that in many cases the set of suffixes having similar stem involve common prefix as depicted by the following expression:

􀝏􀝑􀝂􀝂􀝅􀝔􀯜 􀵌 􀜿􀬵􀜿􀬶􀝔􀬵􀝕􀬵 …

􀝏􀝑􀝂􀝂􀝅􀝔􀯝 􀵌 􀜿􀬵􀜿􀬶􀝔􀬶􀝕􀬶 … …

􀝏􀝑􀝂􀝂􀝅􀝔􀯞 􀵌 􀜿􀬵􀜿􀬶􀝔􀬷􀝕􀬷 … …

Here 􀝏􀝑􀝂􀝂􀝅􀝔􀯜, 􀝏􀝑􀝂􀝂􀝅􀝔􀯝, 􀝏􀝑􀝂􀝂􀝅􀝔􀯞 have a similar stem, 􀝏􀝐􀝁􀝉􀯜 and similar prefix substring 􀜿􀬵􀜿􀬶 . The application of this heuristic concatenates 􀝏􀝐􀝁􀝉􀯜 with suffix 􀜿􀬵􀜿􀬶 and modifies suffixes accordingly by dropping common prefix from them. For example, suffixes यक, यकता and यकतानुसार contain the same stem आवश ्and similar prefix substring यक. This heuristics will concatenate the stem आवश ् with the common prefix string of suffixes, यक , to produce the stem आवँयक and it drops the common prefix from stems to give NULL, ता, and तानुसार. Table 2 shows final suffix list.

Table 2 Suffix list of our stemmer

â-Œà¥à¤¦à¥à¤¦à¥€à¤¨ â-Œà¥à¤°à¥‹à¤œà¤¨ इयों लैंड िâ-Œà¤¯à¤¾à¤à¤¿â-Œà¤¯à¥‹à¤‚ िâ-Œà¤ƒà¤Ÿ â-Œà¤‚à¤-ेकरण कों à¤-ों नों पुरमान याँयों रों एँएफ ओंओंकर à¤-ी à¤-ेजी डी ता तेना नेया वर â-Œà¤¾â-Œà¤ â-Œà¤¾â-Œà¤‚â-Œà¤¾à¤² िâ-Œà¤• िâ-Œà¤¤ â-Œà¥€à¤¯ â-Œà¥‡â-Œà¤‚â-Œà¥‹â-Œà¤‚â-Œà¥à¤¸ â-Œà¤â-Œà¤‚ई ए फ â-Œà¤¾ â-Œà¥€ â-Œà¥â-Œà¥‚â-Œà¥‡

Stemming of strange words

For the Hungarian language our suggested stemmer [5] mainly involves inflectional removal (gender, number and 23 grammatical cases, as for example in "h´azakat" → "h´az" (house)) and also some pronouns (e.g., "h´azamat" (myhouse) → "h´az") and a few derivational suffixes (e.g., temet´es" (burial) → "temet" (to bury)). Because the Hungarian language uses compound constructions (e.g., "h´etv´ege" (weekend) = "h´et" (week / seven) + "v´eg" (end)), we increase matching ossibilities between search keywords and document representations by automatically decompounded Hungarian words.

The output of our unsupervised stemming algorithm is the set of suffixes. Each suffix roughly captures some morphological variation. We have used longest suffix matching during stemming. This means that longest possible suffix of the target word matching with some suffix in the list is dropped. The suffix list is arranged in decreasing order of length to facilitate it. We start matching form first suffix to last suffix, where a match is found we segment the word in its stem and suffix. Fig. 3 shows sample output of our system.


Accomplishments and Conversation

Inferences and Future work