This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Specificity of information deals with selection of algorithm associated with application. The criterion is determined by the capability of with standing any type of attack within the boundaries defined by the application. The underlying design principle addresses the transformation process of messages into complex cryptograms (cipher text) which can withstand for intense cryptanalysis. A cryptographic algorithm and a set of cryptographic keys are the basic elements of a security system. The degree of protection depends on several factors which include the strength of the cryptosystem, its implementation in hardware or software and the possible keys that are used for encryption. It is unlikely that the secrecy of an algorithm can be maintained for an extended period of time. A policy of keeping the algorithm public is to promote the widespread use as well as comparing the levels of cryptographic strength with other algorithms. The general assumption is that the opponent knows the cryptosystem that is under use. This approach assures the level of security which is dependent on the secrecy of the key only. Thus an essential factor in the design of a cryptographic system is focused on the length of the key, whose size places an upper bound.
Security analysis identifies several attack models [WIE 2004], which include Cipher text-only attack, Chosen plain text attack, Known plain text attack and Chosen cipher text attack. At the same time, the solution of any cryptogram in general involves four basic steps viz. determination of the language used, determination of the general system used, reconstruction of specific keys, reconstruction of the plain text.
The complex properties of natural languages play an important role in the formation of message text. A natural language text, in addition to statistical properties has another fundamental property, which may loosely be referred as meaning. For a meaningful message, it is possible to shorten it without destroying the meaning. This property is referred as redundancy. Shannon addressed [SHA 1948] the redundancy of a language as its uncertainty. Every language is associated with certain amount of redundancy. Large amount of redundancy gives a way to cryptanalyst towards simple mechanisms of decipherment. The basic goal of the cryptanalysis is to determine the key. In case of cipher text-only attack, cryptanalyst is having infinite computational resources. Based on the apriori knowledge the analyst is able to rule out certain keys. But many "possible" keys may remain, out of which one is the correct key and the remaining are called spurious keys. Elimination of spurious keys, while determining correct one demands for the knowledge from the statistical behaviour of the message text.
The Computing power available today makes brute-force attacks against cryptographic systems less costly and simple. Virtually the existing key size is not offering much protection against brute-force attacks. Although systems allow electronic information to be encrypted using large keys, increase in computing power keep pushing up the size of keys. The increase in length of the key results in improved hardware and software complexity. To protect information adequately, in addition to key complexity, the type (language)of application need to be considered. A generic model applied on Latin text based applications need not be applied for other languages. An approach for providing adequate security with smaller key size, which is dependent on the complexity of the language, need to be explored. The present work is addressed towards this direction.
2. DECIPHERMENT PROBLEMS
The urge to discover secrets is deeply ingrained in human nature from olden days. For thousands of years, kings and generals have relied on efficient communication. It is the threat of enemy interception that motivated the development of codes and ciphers. The code makers continually strived to construct ever-stronger codes for defending communications, while code breakers continually invented more powerful methods for attacking them. The communication revolution changed the society, where information is represented on a machine. Security of information is an increasingly valuable service, and, so the process of encoding messages, known as encryption, plays an increasing role in everyday life. Here machine perception deals with information in bit stream where as human perception deals with information from the natural language point of view.
Cryptography plays a vital role in maintaining the privacy of electronic information against threats. Assessing the strength of the cryptographic systems is an essential step in employing cryptography for information security. Cryptanalysis plays [CAN et al. 2006] a key role in this context. The main objective of cryptanalysis is not only to recover the plain text, but also to estimate the strength of the algorithm which is useful in designing a viable cryptographic algorithm.
Substitution ciphers represent the basic building blocks of complex and secure ciphers that are most widely used today. Understanding the vulnerability of simple ciphers is important while building more complex ciphers. A large number of techniques [STA et al. 2003] are available in the literature to break substitution ciphers, each of them having advantages and disadvantages over one another.
While attacking the cipher models, the goal is to derive the secret key or cipher text under the influence of known algorithm. Different techniques were explored [CAR and MAG 2007, NAL 2006] in the literature to find the key, there by decrypting the entire cipher text. Several possible methods to break a substitution cipher include exhaustive search, simulated annealing, frequency analysis, genetic algorithm, particle swarm optimization, tabu search and relaxation algorithm etc.
The exhaustive search method is the simplest out of all algorithms used to break substitution ciphers. This technique is possible when the cryptographic systems have finite key space and allowing for all possible key combinations to be checked until the correct one is found. Michael Lucks described [LUC 1988] a method for automatic decryption of simple substitution ciphers. This method uses exhaustive key search and is controlled by constraints imposed on word patterns. The method adopts large online dictionary of words.
Ryabko suggested [RYA et al. 2005] an attack on block ciphers called gradient statistical attack. Analysis of statistical properties of block ciphers is used in the process of cryptanalysis. Applicability of this method to the RC5 cryptanalysis is analysed. Automated attack algorithms are proposed [ALA 2004] for which human intervention is largely minimized by exploring genetic algorithms. Joe Gester proposed [GOE 2003] a search algorithm based on likely hood approach. The proposed Genetic algorithm involves an iterative process of finding the fitness of the individuals in the population. If this method is not satisfactory, then attempt is made to search a smaller problem space by restricting the key space to those which are generated by a keyword.
Alabassal proposed [ALA and WAH 2004] a method to discover the key of a Feistal Cipher using Genetic Algorithm approach. The possibility of using Genetic Algorithm in key search is attractive due to its ability in reducing the complexity of the search problem, limited to 50%. Tabu search is another optimization technique, employing iterative search, characterized by the use of a flexible memory. It eliminates local minima and searches beyond the local minimum. The experimental results suggest that the genetic algorithm recovers improved characters than the other algorithms.
Tabu Search Algorithm and Genetic Algorithm frame works are applied on three types of ciphers viz. AES, Hill and Columnar Transposition, in which later one is reported to be efficient. However, the Genetic Algorithm reported with reduced performance on the Hill Cipher and AES. The reported results show that the transposition cipher is susceptible to Tabu Search as well as Genetic Algorithm attacks on weak passwords.
Laskari applied [LAS et al. 2005] the Particle Swarm Optimization (PSO) method, which is originated from evolutionary computation. It is a population-based algorithm which exploits a population of individuals, while searching promising regions of the functional space. This approach is reported with evaluation of locating the key of a simplified version of DES. Algebraic cryptanalysis is a method [ADA 2006, SIM 2009] in which cryptanalysis begins by constructing a system of polynomial equations in terms of plaintext bits, cipher text bits and key bits. The factors that play important role are the number of variables, the number of polynomials and the degrees of the polynomials while modelling the system of equations. Another factor is reported as the computing power of cryptanalyst.
Differential cryptanalysis uses probabilistic properties of block ciphers, in which cipher text pairs and their differential properties are explored. An alternative approach while exploring non-linearity in block ciphers is termed as linear cryptanalysis. Both methods are explored [SEL 2008] extensively forming the foundation for cryptanalysis. Differential cryptanalysis is originally developed to attack Fast data Encipherment Algorithm(FEAL), and later proved its worth when applied to the cryptanalysis of the Data Encryption Standard (DES). One aspect of differential cryptanalysis that appears to be overlooked is the use of several differences to attack a cipher simultaneously.
The analysis of a cryptosystem not only measures the strength under the best differential attack, but should take into account of several attacks. These simultaneous attacks reduce the search space by a reasonable factor. RC5 is a fast block cipher algorithm. Several attempts of cryptanalysis of RC5 cipher were found [BIR and KUS 1998] in the literature. The strength of the RC5 algorithm is evaluated with respect to linear and differential attacks. Blowfish is a Feistel cipher in which the round function F is a part of the private key. A differential cryptanalysis on Blowfish is possible either against to a reduced number of rounds or with the piece of information which describes the function F. Vaudenay showed [VAU 2006] that the disclosure of F allows performing a differential cryptanalysis which can recover the key with 248 chosen plaintexts against a number of rounds reduced to eight. However, neither of the technique is generally useful for practical attacks on ciphers. The primary reason for the impracticality of differential and linear cryptanalysis is that they require large amounts of chosen plain text or known plaintext.
The role of Cryptanalysis has another face of exploring the weaknesses of the system. Few approaches of cryptanalysis are reported in the literature using language characteristics while understanding the strength of cipher system. One such approach deals with frequency statistics. A method is presented [RIN et al. 2008] for de-ciphering texts in Spanish using the probability of letters in the language. This method is proposed on a mono alphabetic cipher by assigning weights to different letter alphabets of Spanish language and achieved positive results. However 100% success is far reaching due to inconsistencies in the frequency distribution. Samuel W. Hasinoff presented [HAS 2003] a system for the automatic solution of short substitution ciphers. The proposed system explored n-gram model of English and stochastic local search over all possible keys of the key space. This method resulted in the median of 94% cipher letters with an exact reverse map. Sujith Ravi et al. studied [RAV and KNI 2009-1] attacking Japanese syllable substitution cipher, with the help of natural language models. They proposed several improvements over previous probabilistic methods, while achieving improved results.
From a natural language perspective, cryptanalysis task can be viewed as unsupervised tagging problem. Language Modeling (LM) techniques are used [KNI et al. 2006] to rank proposed decipherment. This work mainly attacks on difficult cipher systems that have more characters than English and on cipher lengths that are not solved by low-order language models. Language-model perplexity is related to decipherment accuracy. Jackobsen proposed [JAC 1995] a method for cryptanalysis of substitution ciphers. In this method the initial guess of the key is refined through a number of iterations. In each step the recovered plain text is used to guess the closeness between the recovered key and the correct key. G W Hart proposed [HAR 1994] a method which works well on a small sample of text where the probability distribution of letters is far from what is expected. This method reported with better performance on longer and easier cryptograms. However an exponential time is required in the worst case. This method fails under the condition that words in the plain text are in the dictionary.
A deciphering model is proposed by Lee [LEE et al. 2006] to automate the cryptanalysis of mono alphabetic substitution ciphers while exploring enhanced frequency analysis technique. In the decipherment process of mono alphabetic substitution cipher, monogram frequencies, keyword rules and dictionary are explored one after the other. Knight et al. described [RAV and KNI 2008] a number of natural language decipherment problems that use unsupervised learning. These include letter substitution ciphers, phonetic decipherment, character code conversion and word-based ciphers with emphasis on machine translation. An efficient algorithm that accomplishes a naive application of the Expectation Maximization (EM) algorithm to break a substitution cipher is reported. Ravi and Knight explored [RAV and KNI 2009-2] low-order letter n-gram model based on integer programming to search over the key space. This proposed method is reported with a study of decipherment accuracy as a function of n-gram order and cipher length.
Shannon proposal of entropy is extended [SHA 1951] to evaluate the redundancy of a language. Entropy measures the average information produced on each letter of a text in the language where as redundancy measures the amount of constraint imposed on a text in the language because of its statistical nature letters. H Yamamoto presented [YAM 1991] a survey on different information theoretic approaches in cryptology. The survey reported Shannon's cipher system, Simmons authentication approach, wire tape channel, secret sharing communication system approaches.
Diffie and Hellman introduced [HEL 1977] another approach to achieve practical security based on computational complexity. Trap door functions and one way functions are explored. Borissov and Lee computed [BOR and LEE 2007] bounds on the theoretical measure for the strength of a system under known plain text attack. Zhang proposed [ZHA 2005] the key equivocation, which is the conditional entropy of the key given the cipher text and corresponding plain text is considered as a measure of strength of the system.
A parametric evaluation and analysis of the behaviour of the algorithms with respect to cryptographic strength is presented [PRI and TOM 1987] by Prieto. In general Unicity distance is considered as a parameter for evaluating the strength of a cipher. But when cipher text length is less than unicity distance, the predicted key has a non zero error probability for which an upper bound is proposed [JAB 1996] by Jabri. Bauer computed [BAU 2007] unicity distance for different substitution and transposition techniques using different n-gram language models. According to Bauer, for encryption method if an empirical unicity distance exists, it is expected that by suitable attacks other than exhaustion, breaking of it is easy. Sujith Ravi and K.Knight carried out [RAV and KNI 2008] empirical testing of Shannon's information theory for decipherment uncertainty for n-gram language models on English.
Cryptanalysis is useful in finding the strength of a cryptosystem. In a practical model, one can test the block cipher with different known attacks and logically assign certain security level to it. To quantify the security of a block cipher precisely and proving that it satisfies specific security requirements is a difficult task. A parametric evaluation is to be effectively incorporated in the analysis, to provide information about number of spurious solutions and a single solution to the given cipher.
3. PROBLEM STATEMENT
A general security system maps the information units onto an orthogonal plane with the help of confusion and diffusion approaches. The amount of confusion and diffusion is determined by the algorithm on one side and the key on the other side. However the process of decipherment is made possible due to the linear transformation phenomena in the encryption process which can be applied in the inverse form. Modern approaches deal with message text modelled in the form of a binary bit stream. The mapping phenomena of this bit stream do not consider the construction process of the message text which is generally carried out by a human being in the real world. It is extremely difficult to understand the transformed properties of these relations that exist in the message text.
Various efforts are made in the cryptanalysis process for adoption of basic characteristics of information units in the form of text and its associated knowledge for key recovery process. The information theoretic approach [SHA 1949] is useful while understanding the uncertainty issues of inverse transformation in the form of message characteristics as apriori knowledge. Every natural language has its own characteristics determined by the associated rules in the modelling process of message units and the relations among them.
It is necessary to evaluate the characteristic nature of message units in the form of n-gram units, their distributed patterns, redundancy nature of these units and their relative conditional probabilities. The apriori knowledge from the above evaluation can be adopted in the decipherment problems as proposed by Shannon. The present work is aimed at addressing these issues in general with a specific case study on Indic scripts emphasizing on Telugu text.
Message text of various natural languages viz. Telugu, Kannada, Hindi and English is considered as a basis for the present evaluation. Language complexity versus strength of the algorithm is addressed. As discussed earlier it is an important step to determine the key size for achieving common strength and is mostly dependent on the language complexity and its associate redundancy. Unicity distance is adopted as a measure to determine the strength of the algorithm. The uncertainty in the form of redundancy transformation of the above languages is evaluated for the computation of unicity distance versus key size. The above evaluation is compared with specific reference to Telugu, Kannada, Hindi and English. These parameters are addressed in the decipherment process of cipher text only attack. A model is proposed to perform encryption and decryption of Indic scripts. The proposed model defines meaningful units that are embedded in text documents, composed of sentences, words and primitive meaningful units in the form of character or byte stream. In case of Indic scripts this byte stream is a complex byte stream, where as in case of Latin text the byte stream is a one-to one mapping and it is a simple byte stream. So the present model addressed the specificity by taking into consideration of segmentation of words into syllables and extraction of byte stream from the syllables. They are transformed into a code point byte stream, which is further converted into bit stream undergoing the transformations similar to that of any system. The entire evaluation based on the frequency distribution of unigrams, bigrams and trigrams is also addressed in the present work.
5. INFORMATION THEORETIC APPROACH AND UNICITY
In theoretical secrecy there is always a measure [MAC 2005] of uncertainty, regardless of the method of analysis. Entropy of a natural language is a statistical parameter that measures the amount of information produced on an average for every letter in language. Entropy also indicates the uncertainty of various letters in the message. Whenever the uncertainty reaches a maximum value, then the receiver has no information about the message that is transmitted. Thus entropy, which is a measure of uncertainty, plays a vital role in cryptanalysis. On the other hand, it is possible to shorten the message without destroying the meaning. This property of a natural language is referred as redundancy, which can measure the amount of constraint imposed on a text in that language due to its statistical nature.
Entropy is evaluated using a series of approximations F0, F1, F2, .â€¦â€¦.Fn, that considers large statistics of the language. The n-gram entropy Fn measures the entropy of n consecutive letters and is given by the expression (1)
Where mi is a block of n-1 successive letters
p(mi, j) is the probability of the n-gram mi,j
pbi(j) is the conditional probability of letter j after the block mi
j is any arbitrary number following the block mi
In the above expression, Fn measures the average uncertainty of letter j when n-1 proceeding letters are known. Considering long range statistics, the entropy H is given by equation (2)
â€¦â€¦..(2) The n-gram entropies for small values of n =1,2,3 are evaluated using standard unigram, bigram and trigram probabilities. The value of F0 is equal to log2 (A) where A represents size of the alphabets in the respective language. The value of A for Telugu is 64 by eliminating unused characters usage, where as it is 26 for English with out spaces.
Using the corresponding n-gram probabilities, the entropy values of unigram(F1), bigram(F2) and trigram(F3) of Telugu text are calculated using equations (3), (4) and (5).
= 4.7 bits/letter
= 8.67 - 4.70 = 3.97 bits/letter
= 12.10 - 8.67 = 3.43 bits/ letter
As per the evaluation of Shannon, the values of F1, F2, F3 for English are 4.14, 3.56 and 3.3 respectively which are listed in Table 1.
Table1. n-gram Entropy values of Telugu,Kannad, Hindi and English
Entropy of n-gram model
Successive letters in a language need not be independent. Conditional probabilities among successive letters further reduce the entropy. In
certain cases, the structure of the language itself is going to introduce some amount of redundancy. A language has a particular value of redundancy for any N, with unequal probabilities of realistic n-gram sequences. Frequency distribution of certain sequences are maximum when compared to others resulting in redundancy. The lesser the redundancy, the larger the protection of encrypted message against cryptanalysis.
The redundancy RL is defined in equation (6)
RL= 1- HL/log2|P| â€¦â€¦..(6)
Where HL is the entropy of the language and |P| is the size of the n-gram space. RL is a measure of the fraction of multiple occurrences of characters.
Assume that p1; p2; _ _ _ ; pn be the probabilities of the set of possibilities, then the entropy H is given by equation (7)
The uncertainty is associated with the choice of the message and choice of the key. The amount of information produced when a message is chosen is given by equation (8)
The amount of information produced when a key is chosen is given by equation (9)
For a perfect secrecy system, the maximum amount of information in the message is at most log2 n (when all messages are equiprobable). So, the information can be concealed completely if the key uncertainty is at least log2 n.
The amount of information that a realization of X gives, when the occurrence of Y=y is already known is termed as conditional entropy denoted by H(X|Y=y) where X, Y are two random variables. This conditional entropy results in equivocation in cryptanalysis. Given the cipher text (c), either key (k) or plain text (m) is to be recovered. That means the conditional entropies H(k|c) or H(m|c) are to be evaluated. There are two significant equivocations, one that of the key and other that of the message. These are denoted by H(k|c) and H(m|c) respectively. It is desirable to use the equivocation as a theoretical index for measuring secrecy. The message and key equivocation are given by equations (10) and (11).
Where P(m|c), P(k|c) are the posterior probabilities of message and key respectively if cipher text c is intercepted. P(m,c) and P(k,c) are the joint probabilities of message and cipher text as well as key and cipher text respectively.
It is possible to construct secrecy systems with a finite key in which the equivocation does not approach zero as N ï‚®ï‚¥. In this case, no matter how much material is intercepted, it is not possible to obtain a unique solution to the cipher. Such systems are called ideal systems.
Consider a string of plain text symbols m1,m2, â€¦mn and cipher text c1,c2, â€¦cn with the help of a key. In an ideal case of cipher text only attack with infinite computational resources of natural language, certain keys are eliminated; except the correct key, all other keys are spurious keys.
Let RL be the redundancy of the respective language, then for a string of sufficiently large cipher text of length n, the expected number of spurious keys(assuming equi probable keys), Sn satisfies the relation (12).
Where |P| and |K| are the length of message and key spaces respectively. The value n at which the number of spurious keys Sn is zero is called unicity distance U. Substitute Sn with zero in the above expression (12), and n=n0
|P|n0RL = |K|
n0RL log2 |P|= log2|K|
n0â‰ˆ log2|k|/( RL log2 |P| )
Therefore Uâ‰ˆ H(K)/( RL log2 |P| )
U represents the average amount of cipher text needed for an opponent to uniquely determine the key. Unicity distance tends to infinity with zero redundancy, which is impractical in the real world. Evaluation of unicity distance doesn't make any deterministic prediction, rather gives probabilistic results. Larger values of unicity distance is a probabilistic measure of increased strength of the crypto system. The above model is evaluated on Telugu, Kannada , Hindi and English languages and are listed in Table 2.
Table 2. Unicity Distance of Telugu, Kannada,
Hindi and English for n-grams of n=1,2,3.
Table 3. Unicity Distance of Telugu, Kannad, Hindi and English for different key sizes
Unicity Distance in characters
Evaluation of unicity distances for four different languages, Telugu, Kannada, Hindi and English with varying key lengths is presented in Table 3. The unicity distance which is a metric for the strength of the cryptographic system is more for Telugu than English for a particular key size. But the techniques that are available in the literature for Latin texts are applied directly with the same key size on any application irrespective of its language resulting in increased hardware and software complexity. For example the unicity distance for Telugu with key size 56 bits is same as that of English with key size 128 bits. As per the above evaluation, selection of key size for a specific application can be made easy, given the language data.
6. TEXT RECOVERY USING LANGUAGE MODEL
Different approaches of cryptanalysis in the literature use language characteristics to understand the strength of cipher system. These characteristics are broadly categorized as independent language units and their relations among them. The pattern of the independent units is in the form of probabilistic (frequency) distribution, which inturn provides partial information content and also infers the redundancy nature. In substitution ciphers, the encryption process transforms message units (which are independent of the language units) and maps these units onto orthogonal plane. However this transformation cannot determine the changes that influence the frequency distribution patterns in the cipher text.
In natural language modelling, n-gram models are probabilistic models of text that use some limited amount of history, character or word dependencies, where n refers to the number of characters or words that participate in the dependency relation. A statistical language model assigns a probability to a sequence of m consecutive characters or words by means of a probability distribution. Relational patterns among information units of plain text are another parametric data that can be addressed in the model. In the present work the analysis is limited to frequency distribution patterns of n-gram based units. The entropy H(P) and Redundancy (RL) are the parameters of interest in this work.
In a conventional cryptographic system, a plain text message (m) is generated by the sender. An encryption transformation E, which depends on a secret key k, transforms the plain text m to cipher text (c) using the expression c=Ek(m). Cipher text c is then transmitted to the receiver where the decryption transformation D, which depends on secret key k ( in case of symmetric key cryptography), is used to recover the plain text m using the expression m= Dk(c). The assumption is that an opponent does not possess k and cannot recover m from c using D. (Note that algorithms D and E may be kept public). For the key k to remain secret, a secure communication channel is needed between the sender and receiver.
The information available for a cryptanalyst is a variable that can be only cipher text, the knowledge of the system (except the key), the algorithm used, the characteristics of the language and language statistics. If s represents the information available to an opponent and D1 represents the process of cryptanalysis, then the deduced information m1 is expressed as m1 = D1(s). The coincidence between m and m1 is a measure of strength of the system.
When English Text is considered, the problems are much less because the correspondence is between the transformed text and the original text. Though the key is generated randomly, since it is fixed for entire message, the mapping function transforms it into a distinct point in the orthogonal plane. On many occasions almost all characters are present for large text size. Even for a medium sized text this is true because of less number of characters that exist. More over because of one-to-one mapping, predictability is appropriate. The percentage of retrieved code points is calculated using frequency distribution. For Telugu script, the number of character codes that exist in the original text need not be complete set. Even though the mapping function takes care of one to one correspondence, in the transformation process all character codes may not exist from the original set of code points. This may lead to confusion in the crypto analysis.
The apriori distribution of the language units in the plain text is computed. These language patterns are transformed into the language patterns of cipher text using a key. The distribution is evaluated on mapped language patterns. The retrieval mechanism is applied with an assumption that the attack is a cipher text only attack. Mapping between the patterns is done using n-gram model.
In Indic scripts, words are treated as sequences of syllables (basic unit ). A syllable is formed using a canonical code structure given by (C(C))CV where C stands for consonant and V stands for vowel. A detailed analysis is carried out [PRA 2001] by Pratap on various possible combinations of the canonical structure. In case of Indic scripts there is many to one correspondence in the form of code sequences. While representing a syllable, non uniform size of code point lengths exist. Each syllable is of varying size based on the canonical structure and whose size may range from 1 to 10 bytes. Based on this principle the language units are modeled in Indic scripts.
In case of Latin script, the language units are characters, where as for Indic scripts these language units are canonical syllables. The cryptanalysis is attempted on substitution ciphers of fixed length block size. Since the canonical syllable possesses variable size, the intuitive units of syllable (code point) are adopted while modeling fixed length block.
Now n-gram models are applied on these code points. The distribution patterns of plain text are extracted by evaluating the frequency of each code point as apriori knowledge. The similar approach is adopted while computing the distribution patterns of cipher text. Mapping is carried out between units in plain text and cipher text using nearest neighborhood. The analysis is carried out on English and Telugu separately on various samples of varying size from 2000 to 75000. The results are presented in Table 4.
Table 4. Retrieved Plain Text in English and Telugu for 8-bit key
Plain Text Size in number of characters
% of character code points retrieved
As per the above observations, it is easy to infer that cryptanalysis of Telugu is much more difficult when compared with English. On an average the percentage of plain text retrieved in case of English is 39.11 where as in case of Telugu it is only 12.13%. Then the larger key size applicable to Latin text can be reduced in case of complex languages like Telugu even by providing greater level of security. The percentage of plain text retrieved is not linear with text size because a proper threshold function is required to map cipher text symbols to corresponding plain text symbols. The proposed cryptographic model is extended further with 16-bit key on English and Telugu using the above mentioned approach. The results indicate that the percentage of retrieved plain text for Telugu is observed in the range of 1-10% only. The observations on increased key size reflected the fact that the reverse mapping is further more difficult in case of Telugu.
Security issues associated with message text found its significance from the beginning. Complexity of the language will define the strength of the algorithm. To achieve a common strength, with a specific algorithm for any message irrespective of the language is not feasible. It is possible to define variation in complexity of algorithm to achieve a common strength. So, cryptanalysis from the complexity of language point of view is helpful in the specific design aspects of the crypto system. Language complexity is defined interms of language model and also the strength of the crypto system interms of correlation between the complexity and retrieval efficiency. A sample set of 2,400,000 code points is used to evaluate the behavioral aspects of frequency distribution, in syllables of Telugu text.
While evaluating the strength of the algorithm based on the language complexity, unicity distance is considered as a parameter for evaluation. Redundancy associated with language and key lengths are the parameters that are addressed in the present work. The unicity distance of Telugu for a key size of 56 bits is observed to be the same for a key size of 128 bits on English. Language complexity of Indic scripts is found with a requirement of half of the key size necessitated for English. To achieve same unicity distance, one has to formulate the length of the key based on the language chosen. Evaluation is carried out on retrieval efficiency using code point distribution for Telugu and English with varying sample set sizes from 2000 to 75000.
The maximum retrieval efficiency is observed to be 51% and 21% with English and Telugu respectively. However the minimum mapping is observed to be 27% and 1% respectively. The redundancy factor of English is observed in the above results. Probability distribution of n-gram code points is further evaluated on bigrams and trigrams reflecting the above results.
Extension of this model on the conditional probabilities of language units can be addressed as a future task. Probability distribution of the relational operators among meaningful language units is a long range work that is to be addressed.