# The Concept Of Entropy Computer Science Essay

**Published:** **Last Edited:**

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

We can apply the concept of entropy this to data compression, to find the most efficient way to compress a piece of data. Data compression works by eliminating or minimizing redundancy in a file, making your files smaller without losing any information. Every character on your computer, every letter, digit and punctuation mark, is actually made up of several characters that make up computer code. A simple example of compression is: If you have a set of characters "AAAADDDDDDD" representing a letter, one type of compression software can rewrite this as "4A7D", saving seven spaces and making that line 64% smaller. Compression software uses algorithms to do this.

Compression makes files smaller so they take up less storage space and can be transferred faster from machine to machine. Combined with archiving, it becomes a useful way to organize files and projects. Data compression also removes some of the redundancies in a non-compressed text file, which actually contributes in data security.

Data compression also has its disadvantages. One of the disadvantages of data compression is that reliability is reduced. This is because there is a reduction in redundancies, which is useful for error detection.

Data compression is most useful for large archival file and files that need to be transferred over long communicational distances, for example over the internet, as data compression can offer dramatic savings on storage requirements.

Data compression can be split into two categories, Lossless (entropy encoding) and Lossy (source coding).

## Lossy and Lossless

Some compression techniques (usually image and multimedia files such as JPEG and mp3) lose information when they are compressed, reducing the quality of the file. Compressing these types of files are irreversible and therefore cannot match the original file. The size of the file is proportional to the quality degradation. Therefore this type of compression is not used with files that need to be restored to its original file. This is Lossy data compression

When the data of the file needs to be conserved, Lossless data compression is used. This ensures that the data during the compression process is not lost, so the data is the same before and after compression. Continuous compression of a Lossless file does not mean continuous reduction in size, as there is a lower limit to what the file can be compressed to, its entropy.

## Lossless Data Compression (Entropy Coding)

Lossless data compression uses a data compression algorithm that allows the original data to be reconstructed fully from the compressed data. This is used when the original piece of data must be fully reconstructed and once decompressed, the data should be identical to the data before compression.

As previously discussed, Shannon defined the entropy of a set of discrete events p1,ââ‚¬Â¦.,pn as:

H = - Ã¯Æ’Â¥ pi log pi

and that the entropy of a continuous distribution with the density distribution function with density distribution function p(x) as:

H = - Ã¯Æ’Â² p(x) log p(x) dx

By using the entropy of a set of symbols and their probabilities, we can deduce the optimum compression ratio we can get. For example, the English language has an entropy of 1.3, therefore an optimum level, we can use 1.3 bits per character.

The Shannon-Fano compression method and Huffman compression method is based on statistics obtained from the data. These statistics take into consideration the probability, or how often each symbol will appear and with this information we assign a binary string for each symbol. The aim is to assign the most occurring symbol with the shortest binary code and the least occurring with the longest. This allows the coded data to be smaller in size of the original given data.

Compression of data using Shannon-Fano coding

As we have already discussed the Shannon-Fano code in the previous chapter, we can skip the method on how to compute the code string to how the Shannon-Fano code can be affective in compression.

So if we consider the a set of symbols with their corresponding probabilities of occurrence:

Set of Symbols (x)

Probability of Occurrence P(Xi):

a

P(a) = 0.20

b

P(b) = 0.18

c

P(c) = 0.13

d

P(d) = 0.10

e

P(e) = 0.09

f

P(f) = 0.08

g

P(g) = 0.08

h

P(h) = 0.07

i

P(i) = 0.04

j

P(i) = 0.03

Aplying the Shannon-Fano compression method, we can use this order of decreasing probabilities of symbols to help compress the data. Firstly, we divide the list of decreasing probabilities into two groups, where sum of the symbols of each group, has approximately half of the total probability. Then we continue this division process until there is only one symbol in each group.

We will now demonstrate this method:

Group one

Group two

a

d

b

e

c

f

g

h

i

j

Total Probabilities

0.51

0.49

A tree diagram can represent the further splitting of these groups:

The entropy of this string of data is:

H = - ((0.2)log(0.2) + (0.18)log(0.18) + (0.13)log(0.13) + (0.1)log(0.1) + (0.09)log(0.09) +

(0.08)log(0.08) + (0.08)log(0.08) + (0.07)log(0.07) + (0.04)log(0.04)+

(0.03)log(0.03) = 3.1262772

Now the symbols have a binary string attached to them. With this binary string and their probability of occurrence, we can calculate the average length, and see how close it is to the entropy of the data. We can calculate the average length of this code by being the sum of each symbols binary string length multiplied by the probability of occurrence.

So the average length = (2x0.2) + (3 x (0.18+0.13+0.10) + (4 x (0.09+0.08+0.08+0.07+0.04+0.03)

= 3.19 bits

We can see how close the average length of the data is to its entropy. Now using this, we can compare which method of coding is the most effective by seeing which method of coding leaves the average length closest to the entropy.

Compression of data using Huffman Coding

Again, as we have already discussed the Huffman code in the previous chapter, we can skip the method on how to compute the code string to how the Huffman code can be affective in compression.

So if we consider the data string used above, with the same probabilities, we have:

Set of Symbols (x)

Probability of Occurrence P(Xi):

a

P(a) = 0.20

b

P(b) = 0.18

c

P(c) = 0.13

d

P(d) = 0.10

e

P(e) = 0.09

f

P(f) = 0.08

g

P(g) = 0.08

h

P(h) = 0.07

i

P(i) = 0.04

j

P(i) = 0.03

As we can see, the Huffman compression method, similarly to the Shannon-Fano compression method, uses this order of decreasing probabilities of symbols to help compress the data. However, thatââ‚¬â„¢s where the similarities end. In the Huffman compression method, we firstly create a binary tree in the order of decreasing probabilities. Then from the binary tree, we branch out from each symbol, starting from the least probable simple occurring branching to the next least probable symbol occurring. Then finally for each symbol, we label each branch with a binary number and the bit sequence obtained from the binary tree is that symbols Huffman code.

As computed earlier, the entropy of this string of data is:

H = - ((0.2)log(0.2) + (0.18)log(0.18) + (0.13)log(0.13) + (0.1)log(0.1) + (0.09)log(0.09) +

(0.08)log(0.08) + (0.08)log(0.08) + (0.07)log(0.07) + (0.04)log(0.04)+

(0.03)log(0.03) = 3.1262772

As we now have the binary strings attached to each symbol derived from the Huffman code, we can calculate the average length, and see how close it is to the entropy of the data. Again we can calculate the average length of this code by being the sum of each symbols binary string length multiplied by the probability of occurrence.

So the average length = (9 x (0.03+0.04) ) + (8 x 0.07) + (7 x 0.08) + (6 x 0.08) + (5 x 0.09) + (4 x 0.10) + (3 x 0.13) + (2 x 0.18) + 0.2 =

We can see how close the average length of the data is to its entropy. From the two compression methods, we can deduce that the Huffman compression method is more effective as the average length of the code is closer to the entropy. This is why today, Huffman coding is used more than the Shannon-Fano method, to compress data.

## Other Data Compression Algorithms

Compression using Lempel-Ziv method

The Lempel-Ziv method has been explained in the previous chapter, so we can summarise the use of the compression method. The Lempel- Ziv compression method is used primarily to compress text files. In the Lempel-Ziv compression method, the input sequence is put in to non-overlapping blocks of different lengths, whilst doing that we make a dictionary of blocks that we have already seen.

## Lossy Data Compression

Lossy compression is a data compression method that loses some of the data, in order to achieve its goal of compression. However, when decompressed the data content is different from the original, though similar enough to be useful in some way. Lossy compression is most commonly used to compress multimedia data, especially in applications such as streaming media and using the telephone over the internet.

## Compressing and Archiving

Compression is typically applied to a single file and compressed formats contain only contain a single item.

Archiving allows files and folders to be grouped together for compression. Archive formats can contain a single item, or many items, and preserve the hierarchy of nested folders. However, most archive formats include compression as part of the archiving process.

## Expansion and Extraction

When an archive is created, it can be accessed in two different modes. One of the modes is Expansion, where the entire contents of an archive are expanded out at once. The second is the Browse mode, where the archive is accessed like a folder. The hierarchical structure can be navigated and individual files can be extracted without having to expand the entire contents of the archive. In some cases, archive content can be manipulated: items can be renamed, moved, deleted, added, etc.

## Re-compression

Re-compression involves making files smaller by disassembling the structure of the data and then compressing the file more efficiently. Then when the file is then expanded, the data structure for that file is then reassembled.

Recompression usually results in an output which is 100% identical to its original, however in some cases, it may not be. In these cases the content and any data is never lost, however the encoding may be slightly different.

ReferencesP

1977, Martin J. Computer Data-Base Organisation. Englewood Cliffs, NJ: Prentice-Hall

1980. Reghbati, H.K. Technical Aspects of Teleprocessing. Saskatoon, Sask

1983 Encyclopedia of computer science and engjneering, Anthony Ralston