Compression And Decompression Technique For Secure Data Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract:Compression algorithms reduce the redundancy in data representation to decrease the storage required for that data. Data compression offers an attractive approach to reducing communication costs by using available bandwidth effectively. Dictionary-based code compression techniques are popular as they offer both good compression ratio and fast decompression scheme. State of the art lossless data compressors are very efficient. While it is not possible to prove that they always achieve their theoretical limit (i.e. the source entropy), their effective performances for specific data types are often very close to this limit. Lossless compression researchers have developed highly sophisticated approaches, such as Huffman encoding, arithmetic encoding, the Lempel-Ziv family, Dynamic Markov Compression (DMC), Prediction by Partial Matching (PPM), and Burrows-Wheeler Transform (BWT) based algorithms. However, none of these methods has been able to reach the theoretical best-case compression ratio consistently, which suggests that better improved method is needed. The Burrows-Wheeler Transform, or BWT, transforms a block of data into a format that is extremely well suited for compression. The block sorting algorithm they developed works by applying a reversible transformation to a block of input text. The transformation does not itself compress the data, but reorders it to make it easy to compress with simple algorithms such as move to front encoding[1]. The basic strategy adopted in this paper is to preprocess the text and transform it into some intermediate form which can be compressed with better efficiency and which exploits the natural redundancy of the language in making the transformation. A technique called efficient Dictionary Based encoding is used to achieve this. It preprocesses the standards text db prior to conventional compression to improve the compression efficiency much better and provides security.

Keywords: Compression, decompression,star encoding, Efficient dictionary based compression and decompression,BWT


1.1 Compression

Data compression, in context is the science(the art) to represent information in a compact form. It is the process of converting an input data stream (the source stream or the original raw data) into another data stream (the output, or the compressed, stream) that has a smaller size[2]. A stream is either a file or a buffer in memory. Data compression is popular for two reasons: (1) People like to accumulate data and hate to throw anything away. No matter how big a storage device one has, sooner or later it is going to overflow. Data compression seems useful because it delays this inevitability. (2) People hate to wait a long time for data transfers. When sitting at the computer, waiting for a Web page to come in or for a file to download, we naturally feel that anything longer than a few seconds is a long time to wait[8]. There are two major families of compression techniques when considering the possibility of reconstructing exactly the original source. They are called lossless and lossy compression. Certain compression methods are lossy. They achieve better compression by losing some information. When the compressed stream is decompressed, the result is not identical to the original data stream. Such a method makes sense especially in compressing images, movies, or sounds. If the loss of data is small, we may not be able to tell the difference. In contrast, text files, especially files containing computer programs, may become worthless if even one bit gets modified. Such files should be compressed only by a lossless compression method. There are many known methods for data compression. They are based on different ideas, are suitable for different types of data, and produce different results, but they are all based on the same principle, namely they compress data by removing redundancy from the original data in the source file. Data compression also offers an attractive approach to reduce the communication cost by effectively utilizing the available bandwidth in the data links[8]. Use of compression for storing text files has become inherent part of personal as well as commercial computing. The various compression applications available perform two functions, compression and decompression. The text document is first compressed and then the entire document is decompressed when required.

1.2 Decompression

Any compression algorithm will not work unless a means of decompression is also provided due to the nature of data compression. When compression techniques are discussed in general, the word compression alone actually implies the context of both compression and decompression. In many practical cases, the efficiency of the decompression algorithm is of more concern than that of the compression algorithm. For example, movies, photos, and audio data are often compressed once by the artist and then the same version of the compressed files is decompressed many times by millions of viewers or listeners [2].

Related work

Various sophisticated algorithms have been proposed for lossless text compression A very promising development in the field of lossless data compression is the Burrows-Wheeler Compression Algorithm (BWCA), introduced in 1994 by Michael Burrows and David Wheeler. The algorithm received considerable attention since of its Lempel-Ziv like execution speed and its compression performance close to state-of-the-art PPM algorithms.

A preprocessing method is performed on the source text before applying an existing compression algorithm. The transformation is designed to make it easier to compress the source file. The star encoding is generally used for this type of pre processing transformation of the source text. Star-encoding works by creating a large dictionary of commonly used words expected in the input files. The dictionary must be prepared in advance, and must be known to the compressor and decompressor. Each word in the dictionary has a star-encoded equivalent, in which as many letters a possible are replaced by the '*' character. For example, a commonly used word such the might be replaced by the string t**. The star-encoding transform simply replaces every occurrence of the word the in the input file with t**. If done properly, this means that transformed file will have a huge number of '*' characters. [5].This ought to make the transformed file more compressible than the original plain text. The existing star encoding does not provide any compression as such but provide the input text a better compressible format for a later stage compressor. The star encoding is very much weak and vulnerable to attacks.The encoded data has exactly the same number of characters, but is dominated by stars

Proposed work and objectives:

TransformEncodingThe goal of this project is to develop new transformations for lossless text compression to incorporate fast, secure and to achieve a good compression ratio in transmissions .The approach consists of exploiting the natural redundancy of the English language by encoding text into some intermediate form before applying the backend compression algorithm which increases the context for compression. The encoding scheme uses dictionaries to co-relate words in the text and the transformed words. At the receiver end the file is decompress by using the same backend algorithm and the intermediate form is converted to text by the use of dictionary.

Standard Text

Efficient dictionary

Transformed text

Compressed text Transferred by any media


Standard Text


Transformed text



Figure 2 Block diagram

3.1 Formulation of Dictionary

The heart of dictionary coding is the formulation of the dictionary. A successfully built dictionary results in data compression; the opposite case may lead to data expansion. According to the ways in which dictionaries are constructed, dictionary coding techniques can be classified as static or adaptive. At present, dictionary-based compression schemes using static dictionaries are mostly ad hoc, implementation dependent and not general purpose. Most well-known dictionary algorithms are adaptive. Instead of having a completely defined dictionary when compression begins, adaptive schemes start out either with no dictionary or with a default baseline dictionary. As compression proceeds, the algorithms add new phrases to be used later as encoded tokens. The basic principle behind adaptive dictionary programs is relatively easy to follow.

3.2 Transformed Encoding

Once have dictionaries, Need to examine the input text and find a string of symbols that matches an item in the dictionary. Then the index of the item to the dictionary is encoded. This process of segmenting the input text into disjoint strings (whose union equals the input text) for coding is referred to as parsing. Obviously, the way to segment the input text into strings is not unique. Parsing strategy which is used is a greedy parsing. With greedy parsing, the encoder searches for the longest string of symbols in the input that matches an item in the dictionary at each coding step. Greedy parsing may not be optimal, but it is simple in implementation.


Dictionary = empty; Prefix = empty;

DictionaryIndex = 1;

while(characterStream is not empty)

{Char = next character in characterStream;

if(Prefix + Char exists in the Dictionary)

then Prefix = Prefix + Char ;


{ if(Prefix is empty)

CodeWordForPrefix= 0 ;


CodeWordForPrefix = DictionaryIndex for

Prefix ;

Output: (CodeWordForPrefix, Char) ;

insertInDictionary( ( DictionaryIndex , Prefix + Char) );

DictionaryIndex++ ;

Prefix= empty ;



if(Prefix is not empty)


CodeWordForPrefix = DictionaryIndex for Prefix;

Output: (CodeWordForPrefix, ) ;


Example: Suppose the codeword are indexed starting from 1:Inputed string ABBCBCABABCAABCAAB

Compressed string (code words):

(0, A) (0, B) (2, C) (3, A) (2, A) (4, A) (6, B)

1 2 3 4 5 6 7 Index

Each code word consists of an integer and a character where character is represented by 8 bits. The number of bits n required to represent the integer part of the codeword with index i is given by:

Number of Bits required is (1 + 8) + (1 + 8) + (2 + 8) + (2 + 8) + (3 + 8) + (3 + 8) + (3 + 8) = 71 bits

Compression and decompression using Backend algorithm.

Any efficient backend algo(PPM/BWT) can be used to compress as well as decompress the data.The lossless Burrows-Wheeler compression algorithm has received considerable attention over recent years for both its simplicity and effectiveness. It is based on a permutation of the input Sequence €­€ the Burrows-Wheeler transformation €­€ which groups symbols with a similar context close together. The BWT is performed on an entire block of data at once. Most of today's familiar lossless compression algorithms operate in streaming mode, reading a single byte or a few bytes at a time. But with this new transform, we want to operate on the largest chunks of data possible. Since the BWT operates on data in memory, you may encounter files too big to process in one fell swoop. In these cases, the file must be split up and processed a block at a time. A typical scheme of the Burrows-Wheeler Compression Algorithm (BWCA) is presented and consists of four stages.

I/p->BWTƒ MTF->RLE->Huffman->O/p



Each stage is a block transformation of the input buffer data and forwards the output buffer data to the next stage. The stages are processed sequentially from left to right for compression; for decompression they are processed from right to left with the respective backward transformations. For compression, the first stage is the BWT. As the first stage of the compression algorithm.The purpose of this stage is the reordering of the input data depending on its context. The reordering produces many runs of equal symbols inside the output data of the BWT stage. In order to move the reordered symbols back into their original positions, a backward transformation exists, which reproduces exactly the input sequence. [1]

3.4 Decoding

Decoding (decompressor) in dictionary method is very simple. It will not parse the given input string. It will use the same dictionary and will decode the sequence.

Performance Analysis:

The performance issues such as Bits Per Character (BPC) and conversion time are compared for the three cases i.e., simple BWT, BWT with Star encoding and BWT with our proposed Efficient Dictionary method The results are shown graphically which proves that the proposed method has a good compression as well as reconstruction (Decompression) time.

Figure 3: Encoding

Figure 4: Decoding


An efficient Dictionary-based compression techniques use no statistical models. They focus on the memory on the strings already seen. The compression algorithm and the decompression algorithm build an identical dictionary independently. The advantage of this is that the compression algorithm does not have to pass the dictionary to the decompressor.An efficient scheme is an excellent improvement in text data compression and added levels of security over the existing methods. This method will maintain the compression ratio with a good reconstruction at the receiver's end.