This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
A large volume of text is available in electronic form in multiple languages including Telugu which is an Indian language, spoken by more than 50 million people in the country. Linguist experts said that there can be so many different forms available for a single Telugu word verb. Language is very rich in literature and morphology, and it requires advancements in computational approaches. To access this information the need of multi-lingual (cross-lingual) text retrieval system is necessary. Language processing system has become an active area of research due to this. Earlier a lot of research was done on English. However, from the last decade there is sudden increment in research on Asian language processing. There are so many tools and other lexical resources available for English and other major European languages. There are no such standard tools available for Indian languages. This paper focuses on development of one such tool, namely stemmer for Telugu language.
Stemming is process that associates morphological variants of a word. Stemmers are used in query systems, indexing, web search engines, machine translation and information retrieval systems. Stemming also offers the benefits of minimizing storage requirements by eliminating redundant terms and reduces word forms to approximation of a common morphological root, called stem, so that relevant documents can be retrieved even if the morphology of their own words are different from one of the words of given query by improving recall and precision.
The stem is not necessarily a valid linguistic root of the word, it is used to map the variant forms of the same stem. For example, the words °à±†°à±°à±°°°, °à±†°°°°, °à±†°°°à± are reduced to same stem, °à±†°.
Already said Telugu words may have many morphological variants. For example, °°°°°°°°à±, °°°°°°°°°à±, °°°°°°°°, °°°°°°°à±°à±, are reduced into °°°°°°°. These variants are mostly generated by adding suffixes to the root word for example, Telugu plural nouns are generated by adding suffixes like °, °à±, °°à±, °à±. Similarly following inflected words °°°°, °°°°à±, °°°°à±°à±, °°°°à±°à± are reduced to same stem °°°°.
English and other European languages have standard stemmers but Telugu and most of the other Indian languages do not have such standard stemmers. The most widely known stemmers have been developed by Lovins and Porter for English. Both Porter's and Lovin's stemming algorithms depend on linguistic knowledge.
There are two approaches to design stemming algorithm, it may be linguistic approach which is based on prior knowledge of the morphology of the specific language or non-linguistic approach which does not need any prior knowledge of morphology. The linguistic approaches are likely to be more powerful because the quality of the morphological analysis has been guaranteed by linguistic experts but it is hard to develop and codify that type of rules if a new language is to be added to an information retrieval system. This process is a time taking process and it is not always possible to have an expert for each language. In recent times, corpus based, without any linguistic knowledge approaches have been emerged as promising alternative to traditional linguistic approaches. Therefore, this paper focuses on the development of unsupervised approach which only relies on corpus for learning powerful suffixes.
This paper explains sequence of steps needed to develop an unsupervised Telugu stemmer. Proposed paradigm uses set of words to derive powerful suffixes that are used in stemming of strange words. Telugu corpus is taken from CIIL Mysore. This Telugu corpus consists of 106400 words. Improved heuristics are applied to drop and to refine set of suffixes to avoid over stemming. In order to test the performance of proposed stemmer two sets of 1000 words are selected from Telugu daily newspapers and these words are manually segmented for creation of actual root. Two different experiments are conducted. First experiment shows the performance of proposed stemmer in terms of accuracy, precision, recall and F-score. Two test-runs are performed in the first experiment. Accuracy, F-score are observed 89.9% and 90.4% respectively. These results are compared with YASS stemmer. In the second experiment the performance of proposed stemmer is evaluated for indexing task. The percentage reduction in index size is measured by proposed stemmer and compared with reduction attained by YASS stemmer. Results show that proposed stemmer outperforms unsupervised YASS stemmer. The proposed paradigm does not rely on linguistic knowledge. Hence, this approach can be used for other languages that are morphologically rich. Proposed paradigm is language independent and computationally inexpensive. Thus this approach is known as unsupervised. The number of suffixes in the dropped list is 720 and longest suffix matching algorithm is used for stemming of strange words.
The rest of the paper is organized as follows: In the section two previous works on stemming algorithms is explained. Proposed paradigm to develop Telugu stemmer is described in section three. Section four presents experimental investigation. Section five clarifies the accomplishments. Section six concludes the paper.
From the last five decades many stemming algorithms were proposed and implemented to improve information retrieval task. First stemming algorithm was written by Julie Beth Lovins of Massachusetts Institute of Technology in1968. This is single pass, context sensitive and uses longest suffix stripping algorithm. Lovins stemmer maintains 250 suffixes. These suffixes were used in stemming of strange words. This stemmer was remarkable for its early date and had great influence on later work in this area. Later Martin Porter designed and implemented stemming algorithm for English language at the University of Cambridge in 1980. It is five step procedure, some rules were applied under each step. If word is matched with suffix rule then the condition attached to that rule is executed to get the stem. Porter stemmer is a linear process. It is widely used as it is readily available. J.L. Dawson of the Literary and Linguistics Computing Centre at Cambridge University developed one stemmer that is based on Lovins stemmer by extending the suffix list to 1200 and modifying recoding rules. Bob Krovertz developed another stemming algorithm at the University of Massachusetts, in 1993. Krovertz stemmer is based on morphology and uses dictionary.
The techniques used in these stemmers were rule based. Preparing such rules for a new language is time consuming and these techniques are not applicable to morphologically rich languages like Telugu.
Later unsupervised approaches came into picture. Brent et al proposed unsupervised morphology technique based on Minimum Description Length (MDL). This is based on Bayesian model for English and French languages. Goldsmith uses MDL framework to learn unsupervised morphological segmentation of European languages. Mathias Creutz presented language independent and unsupervised algorithm for word segmentation that is based on prior distribution of morpheme length and frequency. Freitag was proposed stemming algorithm based on automatic clustering of words using co-occurrence information.
Previous work on morphology of Indian languages includes light weight stemmer for Hindi, statistical stemmer for Hindi, unsupervised morphological analyser for Bengali, rule based Telugu morphological generator (TelMore) for Telugu.
Word segmentation heuristic is based on Goldsmith's approach. This heuristic is based on split-all method uses Boltzmann distribution. The proposed paradigm presented in this paper focuses on the development of an unsupervised Telugu stemmer to be used in Telugu information retrieval task.
with Take-All-Splits Heuristic
Procuring optimal split to produce generic stems and suffixes
Generating analogous signatures and removing irregular signatures
Normalization to enhance proposed paradigm
Removing spurious signatures to increase no. of effective suffixes
Stemming of strange words
with longest suffix matching alg.
List of powerful suffixes
Trained Telugu corpus
Fig. 1 Proposed paradigm
Proposed paradigm is moderately in Goldsmith's approach. It is based on taking all splits of each and every word from corpus. Goldsmith's heuristic does not require any linguistic knowledge thus it is completely unsupervised, for unsupervised studying, words from Telugu corpora are considered. Each word from corpus is being considered in l-1 different ways, splitting the word into stem+suffix after i letters, where 1<=i<=l-1 and l is length of the word. Then frequencies of stems and suffixes of each split of words are computed. These frequencies are used to get optimal split of word. The optimal segment unifies stem and suffix. Making analogous signatures, normalization step and removal of spurious signatures are applied to refine the learned suffixes. Fig. 1 shows the layout of proposed paradigm. The details of these steps are shown below:
Telugu corpus is given as input to this process. Corpus consists of unique words. 106400 words are considered as corpus. Take-all-splits heuristic is used for word segmentation and it uses all cuts of a word of length l into stem+suffix w1,i + wi+1,l, where w refers the word and 1≤ i < l. This heuristic assigns a value to each split of word w of length l: w1, i +wi+1, l. This heuristic takes all splits of word wl into stems and suffixes as:
Where stem1i refers the prefix of word wl with length i and suffixi+1l refers the suffix of same word with length l-i. For example, the word '°°°°°°°°à±' is split into following stem and suffix:
Next, stems and suffixes are stored in database and heuristic value of each split is calculated by using following equation:
Heuristic value at split i = i log freq (stem = w1,i) + (l - i) log freq (suffix = wi+1,l)
As the value of i changes the heuristic value also changes. Heuristic value mainly depends upon frequency of stem and suffix in corpus. Frequency of stem and suffix are varying as length of stem and suffix is increasing. The frequencies of shorter stems and suffixes are very high when compared to the slightly longer stems and suffixes. Thus the multipliers i (length of stem) and l-i (length of suffix) are introduced in this heuristic in order to compensate for this disparity.
PROCURING OPTIMAL SPLIT
The output of first heuristic is set of heuristic values for each word. The split with highest heuristic is treated to be optimal split of the word. This segmentation mainly depends upon the frequency distribution of stems and suffixes of the word, language and corpus. Some restrictions are considered while taking optimal split. First, the word length is restricted to minimum of three. From these small words like °°, °°°, °°°-, °°° cannot be split. The second restriction is based on heuristic value of each split. If any heuristic value of the split is zero it is not taken into account. From this the words with large length like °°°-°°°°°°°°à±°°°à±°à±, °°°à±°°°°à±°°°à±°°°°à±‡ are not considered.
The characterization of heuristic values of all splits of word '°°à±°°°°à±°°' is shown in Fig. 2.
C:\Documents and Settings\siva\Desktop\wifi_3.bmp
Fig. 2 Heuristic values of all splits of word
Index is considered in the x-axis and heuristic values are considered in the y-axis. By looking into graph the heuristic values of word is increased first and then decreased. The point where the heuristic value starts decreasing is considered as optimal split of the word. As the stem length increasing up to certain limit after frequency of stem starts decreasing. That is the reason heuristic value starts decreasing. This step gives the generic stems and generic suffixes of given corpus.
An optimal split is assigned to every word of corpus into generic stem and generic suffix by the above step. After getting generic stems and generic suffixes, the suffixes that are having same stem are grouped together. This is called stem's signature. This structure is shown below:
Where suffix1, suffix2, suffix3…. are having similar stem, stem1. By considering Telugu word example, °à±°°°, °°à±°à±, °°°°à± suffixes have same stem °°°°°°°°°°à±‹. By grouping suffixes with same stem is shown below:
Signature can also be referred to set of suffixes along with the associated set of stems. Stems are grouped that are having same set of suffixes. Here stem1, stem2 have same set of suffixes suffix1, suffix2, suffix3 and suffix4. The presence of signature structure is shown below:
Here there are at least two members present in each set, and all combinations present in this structure are available in the corpus and each stem is found with no other suffix but suffix may well appear in other signatures.
By considering Telugu words example, °°°°à±°à±, °°°°°°°à±°°à±Œ° and °°°°°à±€ have same set of suffixes °à±‹, °à± and °°. So by grouping set of stems that are having same set of suffixes is shown below:
All signatures are discarded that are associated with only one stem and only one suffix. These are called irregular signatures.
It is perceived that in some of analogous signatures, set of suffixes having similar stem involve some common set of prefixes. Thus to enhance the performance of proposed paradigm, another heuristic is applied. This heuristic normalizes the prefixes of suffixes that are common to same stem. This is shown by the following expressions:
suffixi = a1b1suffix1
suffixj = a1b1suffix2
suffixk = a1b1suffix3
suffixl = c1d1e1suffix4
suffixm = c1d1e1suffix5
The application of this heuristic concatenates stemi with prefixes of suffixes a1b1, c1d1e1 and modifies stem and suffixes. The stems and suffixes after applying normalization heuristic are stemia1b1, stemic1d1e1, suffix1, suffix2, suffix3, suffix4 and suffix5.
The normalization process by considering Telugu words is displayed below:
After applying normalization heuristic to the word '°°°°°' the stems and suffixes is displayed below:
REMOVING SPURIOUS SIGNATURES
The output of above step gives set of signatures. Again all signatures that are associated with only one stem and only one suffix are discarded. These signatures are called spurious signatures. To enrich the performance of proposed paradigm these signatures are removed. The remaining signatures are called regular signatures, and the suffixes that are dropped from them are called regular suffixes. The regular suffixes are not quite the suffixes that can be used to establish the morphology, but they are very good for approximation, and useful for stemming process.
Some set of suffixes produced by proposed paradigm is shown below:
°à± °°à± à± °°à±°° °° à±°à±‹ °°°° à±° à±°à± °à±°°° °°°° °°à±°° °°à±‹ °°à±‹°° °à± °à± °-°°à±‡ °à±‡°à± °à±‡°° °-° °°° °°° °°à± °°à±‹ °°° °à±‹°à± °°° °°°° à±° à±°°à± à±°à± °à±° °°°° °°°à± °°à±°°° °°°°°-° °°à±€° °°à±‹ °° °°°à± à±° °°° °°à±€ à±°°à± à±°°à±‚ °à±ˆ°à±‡ à±°°à± °°à±‡ à±°à±°° °-°°à±‡ °° °°à± °°à± °°°à± °° °-°°°° °à±†°à± °à±‡° °à±‡°°° °à±‡°à±‡ °à±° °à±°à± °à±°à± °à±† à±°à±‡ à±°à±ˆ° à±°
STEMMING OF STRANGE WORDS
The output of unsupervised paradigm is a list of powerful suffixes. Each suffix approximately seizes some morphological variation. The longest suffix matching algorithm is used in stemming of strange words. To alleviate stemming of strange words the suffixes are organized in decreasing order of their lengths. The longest possible suffix of the strange word that is matching with some suffix in list is dropped, this way is called longest suffix matching algorithm. Matching is started from first suffix to last suffix, where a match is found the strange word is segmented into stem and suffix. Fig. 3 shows the sample output of proposed paradigm.
Fig. 3 Sample output of proposed paradigm
Always there is an argument on effectiveness of stemming; Frakes reported that different outcomes and results are possible for stemming process. Harman reported that influence of stemming in overall performance is little if effectiveness is measured by traditional precision and recall measures. It is true if language is with simple morphology like English. But Krovertz, Popovic and Willett proved that stemming process significantly improves the retrieval effectiveness, mainly precision, for short queries or for languages with high complex morphology, like romance languages.
Mainly the performance of a stemmer is measured in terms of its contribution to enrich the performance of an IR system. In order to evaluate proposed unsupervised Telugu stemmer, two different experiments were conducted. The first experiment evaluates the performance of stemmer in terms of accuracy, precision and recall rate. In the second experiment the performance of stemmer is measured in terms of its capability to reduce the size of index in an information retrieval task. Proposed stemmer is compared with YASS. YASS is also unsupervised stemmer. The training data extracted from CIIL Mysore contain 106400 words from 200 documents. Table I shows an overview of characteristics of the trained corpus. Two different tests are produced by randomly extracted words from Telugu news papers.
TRAINED TELUGU CORPUS
Stem's signatures after normalization
Experiment 1: To measure the performance of this stemmer accuracy, recall, precision and F-score metrics are considered in the first experiment. Two different test runs test1 and test2 are performed. The evaluation metrics used in this experiment are described below:
Accuracy of proposed stemmer is defined as number of words stemmed correctly. This is calculated by comparing manually stemmed test words and stems of test data that are produced by proposed stemmer.
Recall is defined as the ratio of length of stem generated by proposed stemmer and the length of actual stem that is identified manually. If length of stem that is produced by proposed stemmer is less than length of actual stem then recall is treated 100% because stem produced by proposed stemmer is part of actual stem. However, this type of words decreases the precision value because actual stem is having some extra characters. Mathematically recall is calculated using below formulae:
Recall = 1, if length of stem produced by proposed paradigm < length of actual stem
Precision is defined as the ratio of length of actual stem that is identified manually and length of the stem generated by proposed stemmer. If the length of stem that is produced by proposed stemmer is greater than the length of actual stem then precision is treated 100% because all the character present in actual stem is also present in stem that is produced by proposed stemmer. Mathematically precision is calculated using below formulae:
Precision = 1, if length stem produced by proposed paradigm > length of actual stem
Where n is total number of test words.
Harmonic mean of recall and precision is defined as F-score. Mathematically F-score is represented using below formulae:
The two test runs of proposed stemmer and YASS stemmer are shown in Table II and Table III respectively.
TEST RESULTS OF PROPOSED STEMMER
TEST RESULTS OF YASS STEMMER
Experiment 2: The performance of proposed paradigm in terms of reduction in index size of Telugu information retrieval is measured in this experiment. For this experiment documents from Telugu corpus is taken. The difference between number of index terms with and without stemming is considered as reduction in index size. Table IV describes the comparison of percentage reduction in index size of proposed stemmer and YASS stemmer.
PERCENTAGE REDUCTION IN INDEX SIZE
No. of words without stemming
Stemming with proposed paradigm
Stemming with YASS
No. of words
ACCOMPLISHMENTS AND CONVERSATION
Table II describes that maximum accuracy is found to be 89.9% and F-score is found to be 93.4% in the run-2. The recall and precision values in two test runs indicate that under stemming is main cause of difference in accuracy. Table III shows that the maximum accuracy is 79.3% and F-score is 83.5% obtained using YASS stemmer. By seeing all test runs in terms of accuracy, precision, recall and F-score proposed stemmer is better than YASS stemmer. This accomplishment is achieved with large set of powerful suffixes. YASS stemmer uses less number of suffix set.
The result of reduction in index size is shown in table IV. Index size without stemming is considered as baseline. According to number of index terms the reduction found to be 20.2% using proposed stemmer. 15.23% reduction is obtained using YASS stemmer in terms of number of index terms. By considering index size (measured in MB) the reduction found to be 16.2% and 13.5% with proposed stemmer and YASS stemmer respectively.
The results show that proposed paradigm outperforms the YASS stemmer. Proposed stemmer is unsupervised and language independent. Corpus is used to derive set of powerful suffixes and it does not need any linguistic knowledge. Hence, this approach can be used for developing stemmers of other languages that are morphologically rich.
The performance of proposed paradigm is evaluated in terms of accuracy, precision, recall and F-score and compared with unsupervised YASS stemmer. Ability to reduce index size of information retrieval task is also measured by proposed stemmer. This also shows that percentage reduction in index size is better for the proposed stemmer.