Morphological Cross Reference Method English To Telugu English Language Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Transliteration is the technique of mapping text written in one language using the orthography of another language by means of a pre-defined mapping. It is useful when a user knows a language but does not know how to write its script and in case of unavailability of a direct method to input data in a given language. In general, the mapping between the alphabet in one language and the other in a transliteration scheme will be as close as possible to the pronunciation of the word. English transliterated text has found widespread use with the growth of internet usage, in the form of mails, chats, blogs and other forms of individual online writing. This kind of transliterated text is often referred by the words formed by a combination of English and the language in which transliteration is performed, like Telugu, Hindi etc. Depending on various factors like mapping, language pair, etc, a word in one language can have more than one possible transliterations in the another language. This is more frequently seen in the case of transliteration of named entities and other proper nouns.

Telugu is one of the fifteen most spoken languages in the world and the third most spoken language in India . It is the official language of Andhra Pradesh state. Telugu has 56 alphabets, among them 18 are vowels and 38 are consonants. There are 16 additional Unicode characters to represent the phonetic variants of each vowel and consonant. Telugu wikipedia is one of the largest wikipedia of the Indian language wikipedias. However, Telugu still does not have a user efficient text input method and a user friendly environment, which is widely accepted and used. For Indian Language text input method, many tools and applications have been designed. But, an evaluation of the existing methods has not been performed in a structured manner to standardize on an efficient and accurate input method. Most of the users of Indian language on the internet are those who are familiar with typing using an English keyboard. Hence, instead of introducing them to a new Telugu keyboard designed for Indian languages, it is easier to let them type their source language words using Roman script. In this paper, we deal with the problem of English to Telugu transliteration of text using both Grapheme and phoneme based transliteration models.

In graphemic approach, the source language word is split in to individual sounding elements. For example: bharath is split as bha-ra-th, b(బ్),h(హ్),a(అ) are combined to form bha(భ),r(ర్),a(అ) are combined to form ra(ర),t(ట్),h(హ్) are combined to form th(త్) by using an input mapping table. The table contains contains the phonetically equivalent combination of target language alphabets in terms of source language and its relevant unicode hexadecimal value of target language alphabets. According to the source input the exact hexadecimal uniocode equivalent of the target language is retrieved and displayed as transliterated text.

Generally characters in both languages do not adhere to a one-to-one mapping because English has 26 alphabets and Telugu has 56 alphabets. So our system combines Grapheme model with Phoneme based transliteration model in which a parallel corpus is maintained which contains source English words and Telugu phonetically equivalnt Romanized text in terms of source language. For example: 'period' English word has its relevant Romanized text as 'pEriyad'. If 'period' is translitered using Grapheme based model then the result is 'పెరిఒడ్' but by combining Grapheme with Phoneme we can get exact transliteration which is 'పీరియడ్' .

Our system provides an user friendly environment which is platform and browser independent, case insensitive to the vocabulary words which are placed in parallel corpus, case sensitive to the general text so our transliteration system will work very fastly and provides accurate results when compared to the other transliteration systems like Google, Baraha, Quillpad etc.

Related Work

There has been a large amount of interesting work in the arena of Transliteration from the past few decades.

Antony P.J, Ajith V.P, Soman K.P [1] proposed the problem of transliterating English to Kannada using SVM kernel which is modeled using sequence labeling method. This framework is based on data driven method and one to one mapping approach which simplifies the development procedure of transliteration system.

V.B. Sowmya, Vasudeva Varma [2] proposed a simple and efficient technique for text input in Telugu in which Levenshtein distance based approach is used. This is because of the relation between the nature of typing Telugu through English and Levenshtein distance.

Chung-chian hsu and chien-hsing chen. Mining [3] identified a critical issue namely the incomplete search-results problem resulting from the lack of a translation standard on foreign names and the existence of synonymous transliterations in searching the Web, to address the issue of using only one of the synonymous transliterations as search keyword will miss the web pages which use other transliterations for the foreign name, they proposed a novel two-stage framework for mining as many synonymous transliterations as possible from Web snippets with respect to a given input transliteration.

Guo Lei, Zhou Mei-ling,Yao Jian-Min, Zhu Qiao-Ming [4] a supervised transliteration person name identification process, which helps to classify the types of query Lexicon and concepts of transliteration characters and transliteration probability of a character.

Roslan Abdul Ghani, Mohamad Shanudin Zakaria, Khairuddin Omar [5], introduced a transliteration approach to semantic languages, easy way and fast process in Jawi to Malay transliteration in which Jawi stemming process was develop to make a word as short as possible but only focus on root word and some prefix and suffix. Vocal filtering and Diftong filtering methods are also introduced to make a word simpler in Unicode mapping process in which Jawi-Malay rules are also applied to make output more accurate. Other than the above stated method, a dictionary database also provided for checking the words that cannot be found while process occur. This alternative method is used because format writing in Jawi is not remained.

Chun-Jen Lee, Jason S. Chang, Jyh-Shing Roger Jang [6] proposed a new statistical modeling approach to the machine transliteration problem for Chinese language by using the EM algorithm. The parameters of this model are automatically learned from a bilingual proper name list. Moreover, the model is applicable to the extraction of proper names.

Wei Gao, Kam-Fai Wong, and Wai Lam [7] modeled the statistical transliteration problem as a language model for post-adjustment plus a direct phonetic symbol transcription model, which is an efficient algorithm for aligning phoneme chunks as a statistical transliteration method for automatic translation according to pronunciation similarities, i.e. to map phonemes comprising an English name to the phonetic representations of the corresponding Chinese name.

III. Transliteration Procedure


English words


Aligned Parallel Cropus

Word found in Parallel corpus?


Input as set of


characters or file

Replace source word with

romanized parallel corpus



Transliterated Telugu text

by Unicode mapping

Fig. 1 Transliteration Model

The whole model consists of two important phases: i) Preprocessing phase ii) Transliteration phase

A. Prepocessing Phase

In preprocessing phase, English vocabulary words for which transliteration will not produce correct results will be Romanized and Aligned in parallel corpus which is used in Transliteration phase to get correct result.

1) Romanization: During this step, the transliteration system is trained for those words which cant be exactly transliterated using either Grapheme or Phoneme individually. For those words according to its pronunciation, Telugu phonemic equivalent word in terms of English alphabets are maintained.



English word

Romanized word







2) Alignment: XML is used for storage of parallel corpus in which English words and Romanized words are aligned each other. Our Transliteration system is platform indepependent one because of using XML for storage purpose and Java script is used for retrieval of Parallel Corpus.

B. Transliteration Phase

In transliteration phase the user entered English text or given file will be transliterated into Telugu text.

1) Searching Parallel Corpus: For each user entered word it will searched in Parallel Corpus, if a word is found in Parallel corpus then the original source word will be replaced with its Romanized equivalent word and it will be sent to Segmentation stage otherwise original source word will be sent for Segmentation stage.

2) Segmentation: Based on combination of vowels, consonants the source language text will be segmented. Generally the segmentation unit will end with an vowel. Each segmented unit is called Transliteration unit.There are four rules which are to be followed while segmenting. They are

For example: Consider words 'pEriyad' and 'period'

i) Consonant followed by vowel à pE

ii) Consonant followed by consonant à ri

iii) Vowel followed by consonant à ad

iv) Vowel followed by vowel à io

During Segmentation two or more alphabets can be phonetically combined only when it had consonant follwed by vowel or a consonant followed by consonant but in remaining two, each alphabet in uniquely mapped.

Before Segmentation After Segmentation

P E r i y a d pE | ri | ya |d

3) Unicode Mapping: For each alphabet in English there will an hexadecimal unique code is mapped and for transliteration units which are obtained from Segmentation stage these unicodes are combined to get phonetically equivalent Telugu alphabets. Using this method, we can convert English text into phonetically equivalent ones in Telugu. For Telugu, Unicode range varies from 0C01 ─ 0C7F.

pE | ri | ya | d

పీ | రి | య | డ్

If user enters text in textarea of GUI then the output will be displayed on another textarea which is on the same GUI, otherwise the transliterated text will be saved into another file in the same directory as the source file which is given as input.

IV Evaluation and Results

The proposed model is trained for 1000 words containing English vocabulary words. The model is evaluated by considering Google transliteration system. Accuracy of the system is calculated using the following equation:

Accuracy = (C/N) * 100

Where C indicates the number of test words with correct transliteration when compared with Google transliteration systems and N indicates the total number of test words.

A. Comparison with Google Transliteration System

By comparing our Morphological Cross Reference System with publicly available Google Indic Transliteration System the accuracy of the two systems is observed as follows:

1) Accuracy


Transliteration Accuracy

Transliteration System






Morphological Cross Reference System (MCR)

v-s → vocabulary words excluding silent words

v → vocabulary words

oov → out of vocabulary words

From the results it is observed that Morphological Cross Reference System gives an accuracy of ' %' for Vocabulary words excluding Silent words which is ' %' more than Google Transliteration systems, an accuracy of ' %' for vocabulary words which is ' %' more than Google Transliteration systems, but in terms of Out of Vocabulary words Google Transliteration systems gives an accuracy of ' ' which is ' %' more than Morphological Cross Reference System.

Fig 2 shows how the accuracy varies for Google and MCR systems in terms of vocabulary words excluding silent words, vocabulary words, out of vocabulary words.

Fig 2 Comparison of accuracy for Google and MCR Systems

Error Rate is defined as the ratio of Wrongly Transliterated words to Total number of Test words. Error Rate of MCR System can be calculated using the following equation:

Error Rate = (W/N)

Where W indicates number of wrongly Transliterated words when compared with Google Transliteration System ,N is total number of test words.

For MCR system Error Rate is calculated using above formula and found as ' '.

2) Sample Test Data

A sample Test Data used to compare the results of our model with that of Google Transliteration System is shown in Table III.


Comparison with Google

English word

Google Transliteration




System output
















From the above table we can conclude that our MCR system will perform better when we consider vocabulary words.

V Conclusion and Furture Work

In this paper we addressed the problem of transliterating English to Telugu language using Morphological Cross Reference System. This framework based on data driven method and one to one mapping approach simplifies the development procedure of transliteration system and facilitates better improvement in transliteration accuracy when compared with that of Google Transliteration System. The model is tranined on English Vocabulary words that don't have exact transliteration by considering Phonemic or Graphemic Transliteration models individually. The system is evaluated by considering Google Transliteration system, from the experiment we found that transliteration result increase the overall transliteration performance to a great extent. The model will work only for English vocabulary words but in future it will be extended to work accurately for named entities and proper nouns also. We hope this will be very useful in natural language applications like creating blogs, chatting, sending emails and in many areas.


[1] Antony P.J, Ajith V.P, Soman K.P, "Kernel Method for English to Kannada Transliteration", International Conference on Recent Trends in Information, Telecommunication and Computing, 2010.

[2] V.B. Sowmya, Vasudeva Varma, "Transliteration Based Text Input Methods for Telugu", Springer-Verlag Berlin Heidelberg, 2009.

[3] Chung-chian hsu and chien-hsing chen. Mining, "Synonymous Transliterations from the World Wide Web", ACM Transactions on Asian Language Information Processing, Vol. 9, No. 1, Article 1, March 2010.

[4] Guo Lei, Zhou Mei-ling,Yao Jian-Min, Zhu Qiao-Ming, "A Supervised Method for Transliterated Person Name Identification", Second International Symposium on Electronic Commerce and Security, 2009.

[5] Roslan Abdul Ghani, Mohamad Shanudin Zakaria, Khairuddin Omar, "Jawi-Malay Transliteration", International Conference on Electrical Engineering and Informatics 5-7 August 2009, Selangor, Malaysia.

[6] Chun-Jen Lee, Jason S. Chang, Jyh-Shing Roger Jang, "Extraction of Transliteration Pairs from Parallel Corpora Using a Statistical Transliteration Model".

[7] Wei Gao, Kam-Fai Wong, and Wai Lam, "Phoneme-Based Transliteration of Foreign Names for OOV Problem", Springer-Verlag Berlin Heidelberg, 2005.

[8] Shih-Hung Wu and Yu-Te Li, "Curate a Transliteration Corpus from Transliteration/Translation Pairs", IEEE IRI 2008, July 13-15, 2008, Las Vegas, Nevada, USA