Parallel Corpus Based Morphological Cross Reference English Language Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Transliteration is a mapping from one system of writing into another, word by word, or ideally letter by letter. Machine transliteration is a sub field of Computational linguistics for automatically converting words in one language into phonetically equivalent ones in another language. Machine transliteration deals with both Grapheme and Phoneme-based transliteration models. A corpus is generally regarded as a bank of machine-readable authentic texts that are sampled to be representative of a particular language or language variety. A parallel corpus is a corpus which contains both source texts and their transliterations in the target language. Parallel corpora can be built to serve diverse purposes, such as bilingual dictionary compilation, machine transliteration, language teaching etc. Transliterations are used in situations where the original script is not available to write down a word in that script. Several methods for transliteration have been proposed till date based on nature of languages considered but those methods are having less precision for English-Telugu transliteration especially where the spelling and pronunciation of the words are considered.

In this paper, we provide user friendly environment for transliteration of English-Telugu text using morphological cross reference approach, by which precision for transliterated text retrieval is maintained high. In this method, if user entered source input word is found in parallel corpus then by using grapheme based model the related mapping of target language word is used for transliteration otherwise the phoneme based transliteration is used where pronunciation is considered for transliteration purpose. In addition to word by word transliteration, this paper also deals with whole document transliteration. Our approach provides very much efficient transliteration in terms of accuracy and time.


Information Retrieval (IR) refers to the extraction of required information with a user query - formal statement of information need - written in one language (source language), from a large repository of documents that may be written in the same or some other language (target language). When the two languages have different alphabets, the query term must somehow be rendered in the orthography of the other language. The process of converting a word from one orthography into another is called transliteration. Transliteration is the practice of converting a text from one writing system into another in a systematic way. Machine transliteration is a sub field of Computational linguistics that investigates that use of computer software to transliterate text or speech from one natural language to another. Getting only relevant data from the existing literature is made easy and faster by IR systems. The state of the art IR systems have come a long way from the times when they were just existed in embryo. Nowadays, IR systems are used in search engines, and in university libraries for providing access to a large

number of books, journals and other documents available therein.

The necessity of netizens (people who are frequent or habitual users of the Internet) to gain knowledge from multi-lingual documents is being increased day by day owing to the rise in demand for Internet across the world. This necessity originates the challenge of overcoming language barrier present in between user query and documents to be searched. This ever-increasing requirement for multi-lingual information access along with the lack of technical support for multi-lingual processing bring about a new branch in research of Information Retrieval named Cross Language Information Retrieval (CLIR). CLIR makes use of user queries written in one language to retrieve the relevant documents written in some other language. For example, a user may pose their query in English but retrieve relevant documents written in French. The key challenges of CLIR are as follows:

Is there any way by which a query written in one language say L1 be expressed in another language say L2?

In this paper, we provide an user friendly environment for transliteration of English-Telugu documents using morphological cross reference approach of transliteration by which precision for transliteration in retrieval of English Telugu documents is maintained high. Parallel corpus for English & Telugu languages is maintained for efficient retrieval. An English query is transliterated into its respective Telugu writing system. The paper is organized as follows. Section 2 outlines the previous work on the translation by various organizations. Section 3 explains about the proposed system in detail. Section 4 contains the experimental results obtained by using this system and Section 5 concludes the paper. Transliterations are used in situations where the original script is not available to write down a word in that script. Several methods for transliteration have been proposed till date based on nature of languages considered but those methods are having less precision when coming to the case of English-Telugu transliteration. Our approach has proved very much efficient in terms of accuracy and time.

Mapping is made based on the pronunciation equivalent of alphabets both for vowels and consonants. Query is divided into parts such that the entire query is taken in the form of small sets of combination of vowels and consonants basing on the algorithm designed. The word is divided based on the position of consonants and vowels in the respective word.

Particular named entity transliteration is achieved up to a maximum extent especially for proper nouns. An efficient transliteration method is devised for effective English-Telugu transliteration of named entities.


Unicode Guidelines:

Computers can only interpret bits and bytes, and hence the representation of a script should also be defined in terms of bits and bytes. ASCII â€" American Standard Code for Information Interchange is a 8-bit code to represent English character set. Similarly there is Indian Standard Code for Information Interchange (ISCII) that defines a 8-bit character code for Indian language scripts. However, both these code overlap, and hence the computers on which ASCII character set is enabled would not able to interpret ISCII as a code for Indian language scripts. With an 8-bit code, one can define only 256 unique characters. To allow computers to represent any character in any language, the international standard ISO 10646 defines the Universal Character Set (UCS). UCS contains the characters practically to represent all known languages in the world. ISO 10646 originally defined a 32-bit character set. Each character is assigned a 32 bit code, however, these codes vary only in the least-significant 16 bits. ISO 10646 and Unicode though started as two projects finally merged their character set around 1991 and both are compatible with each other. In addition to the character set, Unicode standard specifies recommendation for rendering of the scripts, handling of bi-directional texts that mix for instance Latin (left to right writing system) and Hebrew (right to left writing system), algorithms for storage and manipulation of Unicode strings.

UTF-8 and UTF-16:

It has to be noted that Unicode is a table of codes that assign integer numbers to characters. One still has to define its implementation or encoding in the computers. A straightforward encoding of these integers is to store the Unicode text as sequences of 2 byte sequences. This encoding is referred to as UTF-16. An ASCII file can be transformed into a UTF-16 file by simply inserting a 0x00 byte in front of every ASCII byte. However, operating systems such as Unix/Linux have been written based on ASCII (1 byte code) character set and they expect each byte as a character. For these reasons, UTF-16 may not an appropriate encoding of Unicode in the case of filenames, text files, environment variables, etc. The UTF-8 encoding is a solution to use Unicode in compatible with operating systems working with 1-byte characters.

UTF-8 has the following properties:

Unicode characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.

All Unicode characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0x00-0x7F) can appear as part of any other character.

The first byte of a multi byte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character.

All further bytes in a multi byte sequence are in the range 0x80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes.

UTF-8 encoded characters may theoretically be up to six bytes long to handle 32-bit character set, however 16-bit characters are only up to three bytes long.

The bytes 0xFE and 0xFF are never used in the UTF-8 encoding. These bytes are used to denote byte order for UTF-16 codes.

Representation of Indian language Scripts:

Having known the syllabic nature of Indian language scripts, it is easy to understand the notation followed by the Unicode to represent Indian language characters. The following are the principles used by Unicode to represent the Indian language characters.

A Unicode is assigned to each consonant symbol, i.e., a consonant sound long with the inherent vowel.

Each independent vowel is represented by a Unicode

Each Maatra is also represented by a Unicode.

Viraam has a Unicode number too.

The following are Unicode Ranges for Indian Languages:

Tamil : 0B82 to 0BFA

Oriya : 0B01 to 0B71

Guajarati : 0A81 to 0AF1

Telugu : 0C01 to 0C7F

Kannada : 0C82 to 0CF2

Unicode Rendering Recommendations:

It should be noted that half-forms of the consonants are not represented by the Unicode. The half-forms are essential to render Aksharas involving consonant clusters. The Unicode recommendation for rendering resolves the issue of half-forms. These rules describe the mapping between Unicode characters and the glyphs in a font. They also describe the combining and ordering of these glyphs

Telugu Unicode Characters:

The Unicode helps in mapping English characters to Telugu by converting English alphabets into Unicode characters.


Generally, it deals with the detail planning of the already known source language alphabets to the target language without losing its phonetic resemblance. Here in this context the mapping is done between English (the source language) to Telugu (target language).

As the language English is considered it is a 26-alphabet which is shorter than the largest in Dravidian based originators which is 56-alphabet. So there may be some alphabets have the same sounding phonemes.

The basic combination of the various alphabets to form their equivalents is determined through the mapping procedure above mentioned. In general the word formation depends upon the combination of the alphabets to pronounce the same way as its target language equivalent is. Here there may be some inconvience to form the exact sounding phoneme with the same alphabet as it may be used for pronunciation in target language.

When conversion takes place it may be difficult to get the exact interpretation for some words in the target language through the source language equivalents using either the words or the phoneme basics.


Transliterations are used in situations where the original script is not available to write down a word in that script, while still high precision is required. For example, traditional or cheap typesetting with a small character set; editions of old texts in scripts not used any more (such as Linear B); some library catalogues. For example, the Greek language is written in the 24-letter Greek alphabet, which overlaps with, but differs from, the 26-letter version of the Roman alphabet in which English is written. Etymologies in English dictionaries often identify Greek words as ancestors of words used in English. Consequently, most such dictionaries transliterate the Greek words into Roman letters.

Transliteration should be distinguished from transcription, which is a rendition of a word in a given script, based on the word's sound rather than as a process of converting of one script into another. The primary aim of transliteration is to provide an alternate means of reading text using a different script. Transliteration is meant to preserve the sounds. In practice, the same word may be written differently in different scripts due to the local conventions employed for pronouncing the aksharas. In the northern scripts, the absence of the halanth is seen quite frequently but a person will understand that what is shown is really a generic consonant. The Southern scripts are explicit in the use of the halanth.

In India, English is also transliterated into the different Indian scripts with amusing results! Finding exactly matching aksharas is difficult for some of the vowels and a few consonants. A word in English, transliterated into say Hindi or Bengali, would have changed considerably when transliterated again into another script (typically Southern scripts).




Today Internet reached every corner of the world. There are different languages across the world to communicate. If the information on the internet is available in their local languages that will benefit the number of users. Transliteration provides such facility. Transliteration is a mapping from one system of writing into another, word by word, or ideally letter by letter. Machine transliteration is an automatic method for converting words in one language into phonetically equivalent ones in another language. There has been growing interest in the use of machine transliteration to assist machine translation and information retrieval. Two types of machine transliteration models grapheme - based, phoneme - based.

Telugu is a Dravidian language mostly spoken in the Indian state of Andhra Pradesh. Telugu has the third largest number of native speakers in India (74 million according to the 2001 census) and is 15th in the Ethnologic list of most-spoken languages worldwide .Unfortunately there is no efficient transliteration mechanism for English to Telugu. For a English word there may be number of pronunciations used based on the place people are leaving. So based on the phoneme, a phoneme is the smallest segmental unit of sound employed to form meaningful contrasts between utterances getting efficient transliteration is difficult. There are number of transliteration methods for English to Telugu, based on grapheme (A grapheme is a fundamental unit or the smallest units of a written language), morphemes (In linguistics, it is the smallest component of word, or other linguistic unit, that has semantic meaning) and phoneme (in spoken language, the smallest linguistically distinctive units of sound). But all of them are individually fail to apply effectively for some of the words. As said above grapheme and phoneme are the best practices known. So by combined application of grapheme and phoneme based will lead to better result.


India is a multi-language, multi-script country where different languages have different scripts. Now a day’s many people know how to speak in different languages but they are not aware of the scripts of different languages. Transliteration is a method where user can write in one language and it will be converted into another language automatically. Parallel Corpus based Morphological Cross Reference Approach is one of the Machine transliteration model which combines both grapheme and phoneme based transliteration models. Machine transliteration plays an important role in many multilingual speech and language applications. Many models which were designed for transliteration purpose are given importance either to sound or spelling of the word. Main objective of this project is to transliterate the words by considering both the spelling and the pronunciation of words where user can easily write the text either using spelling of the word or based on the pronunciation for which Morphological Cross Reference Approach is applied for transliteration purpose. In addition to this another important objective of this project is to transliterate English document where complete text in a file will be transliterated into Telugu language. Also, coming to the project usage properties like platform independence, case insensitive (as that of in dictionary) been treated as the basic goals of this project.


Transliteration is a process where text in one language is to be transliterated into another language. In Parallel Corpus based Morphological Cross Reference Approach where the parallel words related to one language and equivalent Romanized mapping of the source language words are maintained where the parallel data is retrieved during transliteration. To maintain Parallel corpus XML is used. The data from XML page will be retrieved and parsed using Java.

Searching the parallel corpus for User Input

Display Transliterated

Telugu text

Pictorial Representation of Parallel Corpus Based Morphological Cross Reference Based Approach.Searching the parallel corpus for User Input

Reading the User

entered English text

Phoneme Based

Transliteration Model

Retrieve the Romanized

mapped word in

Parallel corpus

Grapheme Based

Transliteration Model

Word not found

Word found

Related work:

Many applications concerning English to Telugu Transliteration have been designed in the past. These applications facilitate transliteration of English into Telugu through English input, which are not efficient for many of the English words.


Google Transliteration

However, though all the approaches have used different types of matching based approaches, this work differs from them in the application scenario, while all of them have used those methods for Word transliteration.

We have implemented that for both word and whole text document based on grapheme and phoneme approaches.


We have performed some experiments by varying the input types of two different ways of getting transliterated. We have tested this approach on these types to estimate the efficiency of the system.

General obsevations: