The Lossy Data Compression Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

In age of mid 90's computer's became famous for its features and usability. Major concerns were raised related to the utilization of storage devices, like disks and also related to transferring data over the Internet at a faster rate. To address this situation "Claude E. Shannon" formulated a theory of data compression. In 1948 he presented this theory to reduce the consumption of costly resources, hard drive spaces or transmission bandwidth.

In his paper compression of data was achieved by removing redundant data. It makes file smaller by predicting the most frequent bytes and optimizing the storage of the same, reducing the storage size and transfer channel capacity. Hence, the compressor module was assigned two tasks 1) Predicting the probabilities of input and 2) Generating codes from those possibilities. Some data was also transformed or quantized to achieve more compression.

The need for this as mentioned is elaborated below:

If data can effectively be compressed wherever possible, significant improvements of data rate of transfer can be achieved.

In some cases, file sizes can be reduced by up to 60-70 %.

At the same time, many systems cannot accommodate purely binary data, so encoding schemes are also employed which reduce data compression effectiveness.

Many files can be combined into one compressed document making sending easier, provided combined file size is not huge, e.g. Adobe Acrobat Software.

There are two forms of data compression :

Lossy Data Compression

Lossless Data Compression

Lossy Data Compression

In many instances entire data of a file is not of utmost importance such as image/video files, where in some amount of pixel info if lost/ corrupted would not affect the visibility of the file. In such cases Lossy Commpressions technique can be used where the development cost of such technique is less and so is the importance of the quality of file. Lossy in concrete sense doesn't mean random lost of pixels, but loss of quantity such as frequency component, or perhaps loss of noise.

Lossless Data compression

In the case of images / videos certain amount of loss is accepted as it would not harm the end objectiveness. However in case of text files this is not acceptable as loss of a letter would/can change the objectiveness of the entire data. Hence Text files, then archived storage having any files image / video or text also needs to be lossless as well. Amount of compression done to obtain with lossless has a strict limit for e.g. WinZip, WinRar programs.

Comparison in simpler terms

In lossless data compression, the compressed-then-decompressed data is an exact replication of the original data. On the other hand, in lossy data compression, the decompressed data may be different from the original data. Typically, there is some distortion between the original and reproduced signal.

Data compression literature also often refers to data compression as data "encoding", and of course that means data decompression is often called "decoding".

Compression Techniques

There have been several compression techniques, evolved since the existence below are the two techniques discussed in detail:

Huffman Coding

Lempel-Ziv-Welch (LZW) Coding

Huffman Coding:

Entropy Encoding algorithm is what is called for Huffman coding in computer science and information theory, this uses lossless compression.

David A. Huffman, a Ph.D. student at MIT, presented paper on "A Method for the Construction of Minimum-Redundancy Codes" in 1952, and developed Huffman Coding, hence the name.

It uses a specific method for selecting the representation for each symbol, resulting in a "prefix code" that expresses the most common source symbols, using short strings of bits that are used for less common source symbols. Most efficient compression method of this type was designed by Huffman: no other mapping, of individual source symbols to unique strings of bits will produce a smaller average output size when the actual symbol frequencies agree with those used to create the code. Simple binary block encoding e.g. ASCII coding, is what is equivalent to Huffman coding.

Many variations of Huffman coding exist, some of which use a Huffman-like algorithm, and others of which find optimal prefix codes (while, for example, putting different restrictions on the output).

Some of them are

n-ary Huffman coding,

Adaptive Huffman coding,

Huffman template algorithm,

Length-limited Huffman coding,

Huffman coding with unequal letter costs

Optimal alphabetic binary trees (Hu-Tucker coding),

The canonical Huffman code, Model reconstruction,

Arithmetic coding a variation can be viewed as a generalization of Huffman coding, as due to various cases where they produce the same output when every symbol has a probability of the form 1/2k; also it proves more efficient in case of small alphabets. Huffman coding nevertheless remains in widely use because of its simple algorithm and high speed compression.

It is mostly used in "back-end" compression with respect to the DEFLATE and media codec's such as JPEG, or MP3 which have a front-end model.

Lempel-Ziv-Welch (LZW) Coding

Abraham Lempel, Jacob Ziv, and Terry Welch, created Lempel-Ziv-Welch (LZW) , a universal lossless compression algorithm. Published by Welch in 1984, is an improved implementation of the "LZ78", algorithm published by Lempel and Ziv in 1978. Due to limited analysis of data it not the best, but its algorithm is very fast to implement.

At that time it was first widely used universal data compression method, which provided best compression ratio. A large English text file would go to about half its original size.

The idea of the algorithm that used by LZW was, that encodes sequences of 8-bit data as fixed-length 12-bit codes. The codes from 0 to 255 represent 1-character sequences consisting of the corresponding 8-bit character, and the codes 256 through 4095 are created in a dictionary for sequences encountered in the data as it is encoded. At each stage in compression, input bytes are gathered into a sequence until the next character would make a sequence for which there is no code yet in the dictionary. The code for the sequence (without that character) is emitted, and a new code (for the sequence with that character) is added to the dictionary.


A dictionary is initializes to all the possible input characters, corresponding to the single-character strings.

It scans for longer substring till it finds one that is not in the dictionary.

When match is found, the index for the string less the last character is fetched from it and sent to output, and the new string including the last character is added to the dictionary with the next available code.

The next starting point is the last input character, to loop the scan.

The initial parts of a message will see little compression, as the algorithm works best on data with repeated patterns.

In this way, successively longer strings are stored in the dictionary and made available for succeeding encoding as single output value. As the message grows, however, the compression ratio tends approach to the maximum.


It uses the encoded file/string as an input and outputs a new file/ string from initialized dictionary.

The same point of time it fetches the next value from the input, and adds to the dictionary the merging of the string is just the output and the first character of the string is fetched by decoding the next input value.

The decoder then proceeds to the next value that was already read before and the procedure is repeated, till the end of the file.

In this way the decoder builds up a dictionary which is similar to that used by the encoder, and uses that to decode succeeding input values. BY this the full dictionary is not needed by the decoder, only the initial part of the dictionary comprising of the single-character strings is sufficient

UNIX system circa 1986, became more or less a standard utility for LXW. For both legal and technical reasons, It has since disappeared from many distributions, but as of 2008 at least FreeBSD includes both compress and uncompress as a part of the distribution.

Advantages and Disadvantages:

It works best for repeated data in a file, for e.g. text files and monochrome images.

It is a faster compression rate than others.

All recent computer systems have the horsepower to use more efficient algorithms. LZW is a fairly old compression technique.

Royalties have to be paid to use LZW compression algorithms within applications.

It became very popular when it became part of the GIF image format. It's optionally also been used in TIFF and PDF files. DEFLATE algorithm use Acrobat by default in PDF files.

LZW compression can be used in a variety of file formats:

TIFF files

GIF files

PDF files - In recent applications LZW has been replaced by the more efficient Flat algorithm.


To keep devices small but still as powerful, compression techniques must be used. The techniques mentioned above can be used according to the need in various cases. Here are some examples for LZW Compression:

Birthday cards that sing "Happy Birthday" when you open them.

Cartridge games for console games machines.

Films on DVD-ROM.

Digital Still ad Video Cameras to increase storage.

Compression is used increasingly everywhere in electronics.

While the basics of data compression are relatively simple, the kinds of programs sold as commercial products that are extremely sophisticated. Companies make money by selling programs that perform compression, and protect their trade-secrets through patents.