# Several Approaches To Break Substitution Ciphers Computer Science Essay

**Published:** **Last Edited:**

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

There are several approaches to break substitution ciphers, each of them having advantages and disadvantages over one another. Substitution ciphers represent the basic building blocks of more complex and more secure ciphers that are used today. Hence understanding the vulnerability of simple ciphers is important in using and building more complex ciphers.

While attacking the cipher models, one can consider key recovery attack in which the goal is to derive the secret key or decryption attack in which the goal is to decrypt the cipher text or key recovery from decryption attack. Different techniques were explored in the literature to find the key of the cipher and there by decrypting the entire cipher text. Several possible methods to break a substitution cipher include exhaustive search, simulated annealing, frequency analysis, genetic algorithm, particle swarm optimization, tabu search and relaxation algorithm etc.

The exhaustive search method is the simplest out of all algorithms used to break substitution ciphers. This technique is possible when the cryptographic systems have finite key space and allowing for all possible key combinations to be checked until the correct one is found. This method could be an acceptable technique for breaking a mono alphabetic shift cipher. The first attempt using the exhaustive search is not the best choice, since it is time consuming, but it decrypts the text with 100% accuracy.

Brute force method is a way of trying to break simple substitution ciphers, but the number of possible keys that need to be checked is large. Thus, practically, it may be impossible to do an exhaustive search with in a reasonable amount of time. To overcome this, new algorithms have been developed for faster breaking of the cipher.

Raphael et. al presented a framework, which was designed to describe the block cipher cryptanalysis techniques compactly regardless of their individual differences. This framework describes possible attacks on block ciphers and also specifies the technical details of each type of attack and their respective strengths. It was shown how to apply the framework to describe various attacks on popular and recent block ciphers. A comparative study has been done between the proposed frame work and Commutative Diagram Cryptanalysis (CDC). Basically CDC is the framework for expressing certain kinds of attacks on product ciphers where as proposed framework focuses on distinguishing attacks and also on key-recovery attacks that exploit such distinguishers.

Baham studied the effect of multiple modes using chosen cipher text attack, when the underlying cryptosystems are DES and Feal-8. It was shown that in many cases, these modes are weaker than the corresponding multiple ECB mode. In most cases, these modes are not much secured than just one single encryption using the same cryptosystem.

For example, the triple CBC mode i.e. CBC|CBC|CBC-which was encrypted using a single DES and the modes CBC|CBC|ECB, CBC|ECB|CBC, and ECB|CBC|CBC were weaker than triple DES, and their strength is comparable to the strength of a single DES. It was concluded that strong modes of operation which are invulnerable against finding the keys, should not be based on combining different simpler modes, or use of internal feedbacks. It was suggested to use single mode and use multiple encryptions as the underlying cryptosystems of single mode. Multiple mode or any other mode that uses internal feedbacks can be strengthened by eliminating the use of the internal feedbacks.

Automated attack algorithms were developed for which human intervention is not necessary. These methods will be finished either after a predetermined number of iterations or after a message has been successfully decrypted. One such automated attack algorithms is the genetic algorithm which is widely used for cracking substitution ciphers.

Joe Gester proposed and implemented the simplest approach based on searching the more likely used keyword generated key space. The proposed Genetic algorithm involves an iterative process of finding the fitness of the individuals in the population. Then selectively genetic operators are applied to the members of the population to create a new generation and the process is repeated. Each generation is created by selecting members of the previous generation randomly and weighted according to their fitness.

The proposed method uses a simple genetic algorithm approach to search the key space of cryptograms. If this method is not satisfactory, then attempt was made to search a smaller problem space by restricting the key searched to those which are generated by a keyword. In first approach fitness function were used for rating the quality of each individual in the population's Solution, that is based on trigram and bigram counts. Then new populations were constructed by selecting randomly either crossover or mutation. In this approach rapidly the populations reach local maxima or never seem to converge to anything resembling English. The fitness function has evaluated the fitness of a candidate properly. The method used for crossover was more of a crossover coupled with several mutations rather than a simple crossover.

To allow for a simplified crossover mechanism it will be necessary to allow duplicate characters. An ad-hoc heuristic approach may be suitable to automate solving substitution ciphers.

Like other heuristic algorithms, the genetic algorithm will not always produce the exact result. They give solutions which are nearest to the correct one. In case of deciphering, after using the "best" key produced by genetic algorithm, most of the time, it is easy for a human to read the "decrypted" text. Then make small changes to reproduce the correct plain text. The experiments performed using the genetic algorithm method suggested that a fitness of about 0.9 is enough to determine the vowel substitutions and consonant substitutions after which the visual examination by a human can be used to decrypt the entire text.

Yean Li et al. performed a study on the effect of an optimization heuristic cryptanalytic attack on block ciphers. A known plaintext is encrypted by the chosen cipher using a randomly chosen key of reduced length. The possible key-solution generated by the heuristic function is used to decrypt the known-cipher text. The resulting plain text is compared with the original text. The fitness value for the solution is obtained by decrypting the known-cipher text and then calculating the percentage of character-location matches in the original text and the retrieved text. The search for the correct key combination will continue until a solution match or closest match is found within the constraints of the test environment.

Tabu search is another optimization technique used for breaking substitution ciphers. The experimental results suggest that the genetic algorithm recovers slightly more characters than the other two algorithms. Simulated annealing algorithm is much simpler to implement than genetic algorithms and the Tabu search. Tabu search method obtains the desired result faster than the other two algorithms.

Tabu Search Algorithm and Genetic Algorithm frame works were applied on three types of ciphers viz. AES, Hill and Columnar Transposition. Genetic Algorithm produced results efficiently in terms of the performance against Tabu Search. However, the Genetic Algorithm generally did not perform well on the Hill Cipher and AES. The results have shown that the transposition cipher was susceptible to the Tabu Search and Genetic Algorithm attacks on weak passwords. Polygraphic Substitution Cipher (Hill Cipher) is vulnerable to the Tabu Search attack as well as the Genetic Algorithm attack.

Simulated annealing is another technique that is similar to the genetic algorithm which is used to break substitution ciphers. The main difference is that the genetic algorithm has a pool of possible keys at each moment, while the simulated annealing keeps one value at a time. When combined with a few other simplifications, simulated annealing makes the approach much simpler than the genetic algorithm. The genetic algorithm matches more letters correctly than the simulation annealing does for any given length of the cipher text. This difference is not too high, so the simulated annealing method is still a good technique for breaking simple ciphers.

To reach the correct solution, simulated annealing takes less iterations, but the total time it takes is more than the time needed by genetic algorithm to reach the same result since the simulated annealing method spends long time in each iteration because of its detailed examination of each possible perturbation. Even though the genetic algorithm seems to be better than simulated annealing in all aspects, the main advantage of the simulated annealing is that it is much easier to implement and therefore, it still has an important role in cracking simple substitution ciphers.

The particle swarm optimization method is another method based on machine learning processes that is used for breaking substitution ciphers. The algorithm starts by selecting a random population of potential solutions, each of which is called a particle. Particle swarm optimization is a good method for breaking simple substitution ciphers as long as bigrams are used to calculate the fitness of particles. Considering unigram as a measure for fitness does not give any useful results for deciphering even the simplest ciphers.

Relaxation algorithm is another technique used to break substitution ciphers is the. This is a graph-based technique that relies on iterative and parallel updating of values associated with each node. The nodes of the graph, vi, are elements of the cipher alphabet. Each node has a random variable, li, associated with it which represents the probabilities of the possible characters that this node represents. The probabilities of a node are updated based on the appearance of its two neighbors in the cipher text and the trigram analysis of the original language.

The role of Cryptanalysis is also to study a cryptographic system with an emphasis on exploring the weaknesses of the system. The complex properties of natural languages play an important part in cryptanalysis. Different approaches of cryptanalysis in the literature use language characteristics to understand the strength of cipher system. One such approach deals with frequency statistics. This is based on the assumption that each letter in the plain text is to be substituted by another letter of the original ciphered text.

Frequency analysis is the process of determining at what frequency each symbol of the encrypted text occurs within the cipher text. This information is used along with the knowledge of frequencies of symbols within the language used in the cipher to determine which cipher text symbol maps to the corresponding plain text symbol. The frequency analysis algorithm is the fast approach to decipher the encrypted text. But, it requires the knowledge of the language statistics of the original text. The disadvantage is that it relies on constant human interaction to determine the next move in the process.

Bárbara et al. presented a method for de-ciphering texts in Spanish using the probability of usage of letters in the language. This method is basically to perform Crypto-analysis of a mono alphabetic cryptosystem. The method uses probability and usage of letters in Spanish language to break the encrypted text files. This method assigns weights to different alphabetical letters of Spanish language. The assignment of weights depends on their usage in the language. For this purpose analysis of the frequency of different symbols in the Spanish plain text is done. The same analysis was done on cipher text also. Every encrypted character was mapped to a single character in the original message and vice versa. In this way the original text was retrieved from the cipher text. Few characters vary because there are letters and symbols that have the same frequency.

This method of deciphering cryptograms in Spanish to obtain the original text gave positive results; still the deciphering was not 100% successful as there may be letters and symbols that have the same frequency values.

A simple substitution cipher uses substitution on the set of letters in the plain text alphabet such that different letters in the cipher text correspond to different letters in the plaintext. To encode a text by using character wise substitution, an "infinite key" was used and each letter in the plaintext will be replaced by a letter of the cipher text by means of a one-to-one self-mapping of the set of letters. Then, the knowledge of the key is necessary to reconstruct the plaintext. The work of Mineev et al. was concerned with a similar smoothing effect on the simple substitution cipher resulting from contracting the alphabet by quadratic residues and quadratic non residues in finite fields. As a sample, the Russian alphabets were considered in the proposed work.

A single-letter frequency analysis is helpful if it is used for obtaining an initial key to perform powerful bi-gram analysis. Apart from single character, relation between cipher text and plain text in terms of bi-grams and trigrams also play vital role.

Samuel W. Hasinoff presented a system for the automatic solution of short substitution ciphers. The proposed system operates by using n -gram model of English and stochastic local search over all possible keys of the key space. This method resulted in median of 94% cipher letters correctly decoded. The technique consists of two main components, a generic Stochastic Local Search (SLS) method and a scoring function. The search method is for navigation in the key space and scoring function is to evaluate the goodness of various keys. The scoring function of a particular key is the log likelihood of an n -gram language model that is applied to the cipher text and decrypted using that key. Here n -gram models of characters are considered. Such models can also be extended to the level of words. Most practical systems employ bigrams or trigrams.

Sujith Ravi et al. studied about attacking Japanese syllable substitution cipher. Different Natural language processing techniques were used to attack a Japanese substitution cipher. They made several novel improvements over previous probabilistic methods, and report improved results.

In general the receiver uses the key to convert cipher text to plain text. But a third party who intercepts the message may guess the original plaintext by analyzing the repetition patterns of the cipher text. From a natural language perspective, this cryptanalysis task can be viewed as a kind of unsupervised tagging problem. Language modeling (LM) techniques are used to rank proposed decipherment. This work mainly attacks on difficult cipher systems that have more characters than English, on cipher lengths that are not solved by low-order language models and relate the language-model perplexity to decipherment accuracy.

The work involves processing the input file into syllables and removing typographical errors. The sequence considered is approximately one million syllables in size and contains 65 unique syllable types. This data is split into three parts: LM training data, LM smoothing data and Plaintext messages (various sizes). When 3-gram LM was trained on various data, from the decipherment results it can be concluded that more LM data (i.e., more knowledge about the language) leads to better decipherment. With improved LM smoothing further improvements in accurate decipherment of shorter texts can be achieved. Further algorithms may lead to accurate decipherment of more complex Japanese cipher systems, including translation to other languages.

Jackobsen proposed method for cryptanalysis of substitution ciphers. In this method the initial guess of the key is refined through a number of iterations. In each step the recovered plain text using the current key was evaluated to how close it is to the correct key. To solve the cipher using this method bi gram distribution of letters in cipher text and plain text are sufficient. A distribution matrix was constructed only once and in each iteration the matrix is manipulated. This method is suitable for both mono and poly alphabetic substitution ciphers.

G W Hart proposed a method for solving cryptograms which works well even in difficult cases where only a small sample of text is available and the probability distribution of letters are far from what is expected. This method performs well even on longer and easier cryptograms. An exponential time is required in the worst case, but in practice it is quite fast. This method fails completely when no words of the plain text were in the dictionary.

A deciphering model was developed by Lee et al. to automate the cryptanalysis of mono alphabetic substitution ciphers. The method proposed by Lee et al. uses enhanced frequency analysis technique. The method is a three hierarchical approach. To perform deciphering of mono alphabetic substitution cipher, monogram frequencies, keyword rules and dictionary are used one by one. As a first step, for all of the letters in the cipher text the monogram frequencies are computed. In the second step, the keyword rules are used to decipher the other unknown cipher text letters. For letters that are not yet deciphered in the second step, they will be identified in the third step. In this step, the cipher text letters will be recognized by the dictionary checking. This approach was tested on two short cryptograms and it was observed that both cryptograms achieved successful deciphering results in good computational time. It was observed that this enhanced frequency analysis approach performs faster decryption than the Hart's approach. Due to the combined properties of keyword rules and dictionary checking, the Hart's approach weakness could be hindered.

Knight et al. discussed a number of natural language decipherment problems that use unsupervised learning. These include letter substitution ciphers, phonetic decipherment, character code conversion

and word-based ciphers with importance to machine translation. Basic unsupervised learning techniques often fail on the first trial. So, techniques for understanding errors and increasing the performance were discussed. These include letter substitution, character code conversion, phonetic decipherment, word based decoding etc.

An efficient algorithm that accomplishes the same thing as a naive application of the EM algorithm to break a substitution cipher was implemented. Unsmoothed parameter values for an English letter-bigram P(p) from sample data were estimated and tabulated. Then uniform P(c | p) were set up. The decipherment yielded by EM method results in 68 errors. Using a 1.5-million character data set instead of a 70,000-character data set reduced the number of errors from 68 to 64. Lambda interpolation smoothing was applied to P(p). This reduced errors further to 62.

Ravi and Knight introduced another method that uses low-order letter n-gram models to solve substitution ciphers. This method is based on integer programming which performs an optimal search over the key space. This method guarantees that no key is overlooked. This can be executed with standard integer programming solvers. The proposed method studies the variation of decipherment accuracy as a function of n-gram order and cipher length.

Fifty ciphers each of lengths 2, 4, 8, . . . , 256 were created. These ciphers are solved with 1-gram, 2-gram, and 3-gram language models. The average percentage of cipher text decoded incorrectly was recorded.

It was observed that solution obtained by integer programming is exact in achieving the objective. With a 2-gram model, EM algorithm resulted in 10% error for a 414-letter cipher, where as integer programming method provided a solution with 0.5% error only on the same cipher. For short cipher lengths, much higher improvement was observed when integer programming method was used. For example, on a 52-letter cipher, using a 2-gram model, the solution from integer programming method resulted in 21% error, which is low when compared to 85% error given by the EM method. The 1-gram model works badly in this scenario, which is inline with Bauer's[ ] observation for short texts.

The work mainly focuses on letter substitution ciphers which also include spaces. The work compares decipherment on ciphers with spaces and without spaces using different n-gram English model. For ciphers without spaces, English models are retrained on text without spaces. Cracking space-free ciphers is more difficult than ciphers with spaces with simple language models.

The entropy is a statistical parameter that measures how much information is produced on an average for each letter of a text in the language. Redundancy measures the amount of constraint imposed on a text in the language because of its statistical nature. Shannon proposed a new method of estimating entropy and redundancy of a language. This method uses the knowledge of the language statistics possessed by those who speak the language. It also depends on results in predicting the next letter when the preceding text is known. Some properties of an ideal predictor were developed.

An approach for finding n-gram entropy was developed. For this purpose a study has been done on 26 letter English where spaces, punctuation were ignored. The n-gram entropies can be calculated from letter, digram and trigram frequencies. The estimated entropy values for n =1,2,3 are 4.14,3.56,3.3 respectively. Based on the frequencies of symbols in the reduced text, it is possible to set bounds to the n-gram entropy of the original language.

The approach proposed by Shannon deals with the basic mathematical structure of secrecy systems. Shannon's work defines theoretical secrecy which is defined as how immune a system is to cryptanalysis when the cryptanalyst has unlimited time and computing power available for the analysis of cryptograms. This is related to communication systems in which noise is present and entropy, equivocation can be applied to cryptography. The work is also concerned with practical secrecy which can be defined as the level of security that is necessary to make the system secure against an enemy who has a limited amount of time and limited computational resources for attacking the intercepted cryptogram. This leads to methods for constructing systems which will require a large amount of work to solve. Finally, a certain incompatibility among the various desirable qualities of secrecy systems is discussed. An analysis of the basic weakness of secrecy systems was made. Discussion was made regarding certain incompatibilities among the various desirable qualities of secrecy systems.

H Yamamoto presented a survey on different information theoretic approaches in cryptology. Shannon's cipher system, Simmons authentication approach, wire tape channel, Secret sharing communication system approaches were studied. Shannon introduced the concept of perfect secrecy which means that even with unlimited time and computing power information should not be leaked. To achieve this key space should be at least as larger as the message space which is practically impossible.

Diffie and Hellman introduced another approach to achieve practical security based on computational complexity. Trap door functions and one way functions were used. To prevent active attacks authenticity was introduced. Several authentication mechanisms were developed in the literature. Simmons proposed an authentication mechanism which holds for any type of system. The conclusion drawn from this work is that information theoretic approach is as important as computational complexity approach.

Shannon in his information theoretic approach of cryptography assumes that the computational abilities are unlimited. The work proposed by Hellman is an extension to Shannon's theory. The concepts of matching a cipher to a language and of the trade-off between local and global uncertainty was developed. Hellman defined a model in which the messages are divided into two subsets. One set is all meaningful messages, each with the same a priori probability and other with meaningless messages which are assigned a priori probabilities of 0. Shannon theory, approach considered in this work is not, in general, applicable directly to design practical cryptographic systems. But it appears to be useful for gaining insights into the design of practical cryptographic systems.

Borissov and Lee proposed bounds on the theoretical measure for the strength of a system under known plain text attack. Dunham [..] proposed the key equivocation H(K|M,C), which is the conditional entropy of the key given cipher text and corresponding plain text was proposed as a measure of strength of the system. For simple substitution cipher lower and upper bounds were found. This work gave a direction that key recovery in known-plaintext attack for substitution ciphers is more difficult when this has many fixed points.

A study has been done on estimating the key equivocation of secrecy systems using Additive Like Instantaneous Block (ALIB) ciphers. In general when the length of the block is large, it is difficult to find key equivocation. A simplified method for computing the key equivocation for two classes of secrecy systems with ALIB encipherers was developed by Zhang. The criterion here is the key equivocation rather than error probability.

For simple substitution ciphers bounds are derived for the message equivocation in terms of the key equivocation. It was established that the message equivocation approaches faster to the key equivocation. It was observed that the exponential behavior of the message equivocation is not determined by redundancy in the message source but by either the

symbol probabilities which are chest in a certain sense or the sum of the

two smallest symbol probabilities.

The importance of the key appearance equivocation in evaluating and bounding the message equivocation in terms of the key equivocation has been established for general cipher systems. When evaluating the strength of a cryptographic system, the cryptanalyst is assumed to know the general set of transformations and the statistics of both the message and key sources.

Maurer presents a review on the relation between information theory and cryptography. Shannon's approach fixes some lower bound on the size of the security to achieve a particular level of security. Recent models proposed contradiction to Shannon's approach, where in with a short key also it is possible to provide perfect secrecy. Models like wire type and broad cost channels, privacy amplification etc. were considered for illustration.

A parametric evaluation and analysis to study the behaviour of algorithms with respect to cryptographic strength was performed by Prieto. In general Unicity distance has been considered as a parameter for evaluating the strength of a cipher. According to Shannon's information theoretic approach unicity distance is a minimum length of cipher text required to determine the key. But when cipher text length is less than unicity distance, the predicted key will have a non zero error probability for which an upper bound was proposed by Jabri. It was observed that this probability is inversely proportional to logarithm of key size and directly proportional to redundancy of source.

Prieto proposed two more factors to evaluate the quality of the algorithm. One such factor is invulnerability factor and the other is called quality factor. This approach has been applied to five subsets of Spanish that belongs to different environments of normal language using DES with ECB and CBC modes. The proposed approach allows to make some general conclusions by comparing the algorithms and to choose the best algorithm in every case.

## 2.1 INTRODUCTION:

The objective of Cryptanalysis is

However thefirst appearances of the idea can be found in

the paper of