Data Compression Coding Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract- this paper includes the brief summary of data compression. It also includes the various techniques of data compression such as Huffman method, static defined sceme, adaptive Huffman coding. It also includes the future scope of the data compression technique.


Data compression is often referred to as coding, where coding is a very general term encompassing any special representation of data which satisfies a given need. Information theory is defined to be the study of efficient coding and its consequences, in the form of speed of transmission and probability of error [Ingels 1971]. Data compression may be viewed as a branch of information theory in which the primary objective is to minimize the amount of data to be transmitted. The purpose of this paper is to present and analyze a variety of data compression algorithms.

A simple characterization of data compression is that it involves transforming a string of characters in some representation (such as ASCII) into a new string (of bits, for example) which contains the same information but whose length is as small as possible. Data compression has important application in the areas of data transmission and data storage. Many data processing applications require storage of large volumes of data, and the number of such applications is constantly increasing as the use of computers extends to new disciplines. At the same time, the proliferation of computer communication networks is resulting in massive transfer of data over communication links. Compressing data to be stored or transmitted reduces storage and/or communication costs. When the amount of data to be transmitted is reduced, the effect is that of increasing the capacity of the communication channel. Similarly, compressing a file to half of its original size is equivalent to doubling the capacity of the storage medium. It may then become feasible to store the data at a higher, thus faster, level of the storage hierarchy and reduce the load on the input/output channels of the computer system.



A code is a mapping of source messages (words from the source alphabet alpha) into codewords (words of the code alphabet beta). The source messages are the basic units into which the string to be represented is partitioned. These basic units may be single symbols from the source alphabet, or they may be strings of symbols. For string EXAMPLE, alpha = { a, b, c, d, e, f, g, space}. For purposes of explanation, beta will be taken to be { 0, 1 }. Codes can be categorized as block-block, block-variable, variable-block or variable-variable, where block-block indicates that the source messages and codewords are of fixed length and variable-variable codes map variable-length source messages into variable-length codewords. A block-block code for EXAMPLE is shown in Figure 1.1 and a variable-variable code is given in Figure 1.2. If the string EXAMPLE were coded using the Figure 1.1 code, the length of the coded message would be 120; using Figure 1.2 the length would be 30.

source message codeword source message codeword

a 000 aa 0

b 001 bbb 1

c 010 cccc 10

d 011 ddddd 11

e 100 eeeeee 100

f 101 fffffff 101

g 110 gggggggg 110

space 111 space 111

Figure 1.1: A block-block code

The oldest and most widely used codes, ASCII and EBCDIC, are examples of block-block codes, mapping an alphabet of 64 (or 256) single characters onto 6-bit (or 8-bit) codewords. These are not discussed, as they do not provide compression. The codes featured in this survey are of the block-variable, variable-variable, and variable-block types.


In addition to the categorization of data compression schemes with respect to message and codeword lengths, these methods are classified as either static or dynamic. A static method is one in which the mapping from the set of messages to the set of codewords is fixed before transmission begins, so that a given message is represented by the same codeword every time it appears in the message ensemble. The classic static defined-word scheme is Huffman coding.


Huffman codes are optimal prefix codes generated from a set of probabilities by a particular algorithm, the Huffman Coding Algorithm. David Huffman developed the algorithm as a student in a 12 class on information theory at MIT in 1950. The algorithm is now probably the most prevalently used component of compression algorithms, used as the back end of GZIP, JPEG and many other utilities.

The Huffman algorithm is very simple and is most easily described in terms of how it generates the prefix-code tree.

Start with a forest of trees, one for each message. Each tree contains a single vertex with Weight

Repeat until only a single tree remains

Select two trees with the lowest weight roots (and .").

Combine them into a single tree by adding a new root with weight , and making the two trees its children. It does not matter which is the left or right child, but our convention will be to put the lower weight root on the left if .

For a code of size n this algorithm will require n-1 $steps since every complete binary tree with leaves has n-1 $internal nodes, and each step creates one internal node. If we use a priority queue. With O(log n) time insertions and find-mins (e.g., a heap) the algorithm will run in o(n log n) time.

The Huffman algorithm generates an optimal prefix code.

Proof: The proof will be on induction of the number of messages in the code. In particular we will show that if the Huffman code generates an optimal prefix code for all probability distributions of n messages, then it generates an optimal prefix code for all distributions of n+1 $messages. The base case is trivial since the prefix code for 1 message is unique (i.e., the null message) and therefore optimal.

We first argue that for any set of messages S there is an optimal code for which the two minimum probability messages are siblings (have the same parent in their prefix tree). We know that the two minimum probabilities are on the lowest level of the tree (any complete binary tree has at least two leaves on its lowest level). Also, we can switch any leaves on the lowest level without affecting the average length of the code since all these codes have the same length. We therefore can just switch the two lowest probabilities so they are siblings


Now for induction we consider a set of message probabilities S of size n+1 and the corresponding tree T built by the Huffman algorithm. Call the two lowest probability nodes in the tree x .and y, which must be siblings in T because of the design of the algorithm. Consider the tree gotten by replacing x .and y with their parent, call it z, with probability (this is effectively what the Huffman algorithm does). Lets say the depth of z is d, then


_Ñ- (5)

To see that is optimal, note that there is an optimal tree in which x and y are siblings, and that wherever we place these siblings they are going to add a constant px+ py to the average length of any prefix tree on _with the pair .and replaced with their parent . By the induction hypothesis la(T') B_is minimized, since is of size and built by the Huffman algorithm, and therefore la(T) is minimized and is optimal. Since Huffman coding is optimal we know that for any probability distribution and associated Huffman code

H(S) ≤ la(C) ≤ H(S) + 1


Even though Huffman codes are optimal relative to other prefix codes, prefix codes can be quite inefficient relative to the entropy. In particular H(s) could be much less than 1 and so the extra $ in H(s)+1 could be very significant. One way to reduce the per-message overhead is to group messages. This is particularly easy if a sequence of messages are all from the same probability distribution. Consider a distribution of six possible messages. We could generate probabilities for all 36 pairs by multiplying the probabilities of each message (there will be at most 21 unique probabilities). A Huffman code can now be generated for this new probability distribution and used to code two messages at a time. Note that this technique is not taking advantage of conditional probabilities since it directly multiplies the probabilities. In general by grouping .messages the overhead of Huffman coding can be reduced from 1 bit per message to bits per message. The problem with this technique is that in practice messages are often not from the same distribution and merging messages from different distributions can be expensive because of all the possible probability combinations that might have to be generated.


The Huffman coding algorithm has some flexibility when two equal frequencies are found. The choice made in such situations will change the final code including possibly the code length of each message. Since all Huffman codes are optimal, however, it cannot change the average length.

For example, consider the following message probabilities, and codes. symbol probability code 1 code 2

























Both codings produce an average of 2.2 bits per symbol, even though the lengths are quite different in the two codes. For some applications it can be helpful to reduce the variance in the code length. The variance is defined as

With lower variance it can be easier to maintain a constant character transmission rate, or reduce the size of buffers. In the above example, code 1 clearly has a much higher variance than code 2. It turns out that a simple modification to the Huffman algorithm can be used to generate a code that has minimum variance.

Figure 2: Binary tree for Huffman code 2

In particular when choosing the two nodes to merge and there is a choice based on weight, always pick the node that was created earliest in the algorithm. Leaf nodes are assumed to be created before all internal nodes. In the example above, after d and e are joined, the pair will have the same probability as c and a (.2), but it was created afterwards, so we join c and a. Similarly we select b instead of ac to join with de since it was created earlier. This will give code 2 above, and the corresponding Huffman tree in Figure 2.


In order to discuss the relative merits of data compression techniques, a framework for comparison must be established. There are two dimensions along which each of the schemes discussed here may be measured, algorithm complexity and amount of compression. When data compression is used in a data transmission application, the goal is speed. Speed of transmission depends upon the number of bits sent, the time required for the encoder to generate the coded message, and the time required for the decoder to recover the original ensemble. In a data storage application, although the degree of compression is the primary concern, it is nonetheless necessary that the algorithm be efficient in order for the scheme to be practical. For a static scheme, there are three algorithms to analyze: the map construction algorithm, the encoding algorithm, and the decoding algorithm. For a dynamic scheme, there are just two algorithms: the encoding algorithm, and the decoding algorithm.

Several common measures of compression have been suggested: redundancy [Shannon and Weaver 1949], average message length [Huffman 1952], and compression ratio [Rubin 1976; Ruth and Kreutzer 1972]. These measures are defined below. Related to each of these measures are assumptions about the characteristics of the source. It is generally assumed in information theory that all statistical parameters of a message source are known with perfect accuracy [Gilbert 1971]. The most common model is that of a discrete memoryless source; a source whose output is a sequence of letters (or messages), each letter being a selection from some fixed alphabet a. Without loss of generality, the code alphabet is assumed to be {0,1} throughout this paper. The modifications necessary for larger code alphabets are straightforward.

When data is compressed, the goal is to reduce redundancy, leaving only the informational content. The measure of information of a source message x (in bits) is -lg p(x) [lg denotes the base 2 logarithm]. Since the length of a codeword for message a(i) must be sufficient to carry the information content of a(i), entropy imposes a lower bound on the number of bits required for the coded message. The total number of bits must be at least as large as the product of H and the length of the source ensemble. Since the value of H is generally not an integer, variable length codewords must be used if the lower bound is to be achieved.


The classic defined-word scheme was developed over 30 years ago in Huffman's well-known paper on minimum-redundancy coding [Huffman 1952]. Huffman's algorithm provided the first solution to the problem of constructing minimum-redundancy codes. Many people believe that Huffman coding cannot be improved upon, that is, that it is guaranteed to achieve the best possible compression ratio. This is only true, however, under the constraints that each source message is mapped to a unique codeword and that the compressed text is the concatenation of the codewords for the source messages. An earlier algorithm, due independently to Shannon and Fano [Shannon and Weaver 1949; Fano 1949], is not guaranteed to provide optimal codes, but approaches optimal behavior as the number of messages approaches infinity. The Huffman algorithm is also of importance because it has provided a foundation upon which other data compression techniques have built and a benchmark to which they may be compared. We classify the codes generated by the Huffman and Shannon-Fano algorithms as variable-variable and note that they include block-variable codes as a special case, depending upon how the source messages are defined.


The Shannon-Fano technique has as an advantage its simplicity. The code is constructed as follows: the source messages a(i) and their probabilities p( a(i) ) are listed in order of nonincreasing probability. This list is then divided in such a way as to form two groups of as nearly equal total probabilities as possible. Each message in the first group receives 0 as the first digit of its codeword; the messages in the second half have codewords beginning with 1. Each of these groups is then divided according to the same criterion and additional code digits are appended. The process is continued until each subset contains only one message. Clearly the Shannon-Fano algorithm yields a minimal prefix code.

a 1/2 0

b 1/4 10

c 1/8 110

d 1/16 1110

e 1/32 11110

f 1/32 11111


Arithmetic coding is a technique for coding that allows the information from the messages in a message sequence to be combined to share the same bits. The technique allows the total number of bits sent to asymptotically approach the sum of the self information of the individual messages (recall that the self information of a message is defined as

To see the significance of this, consider sending a thousand messages each having probability

0.999. Using a Huffman code, each message has to take at least 1 bit, requiring 1000 bits to be sent. On the other hand the self information of each message is bits, so the sum of this self-information over 1000 messages is only 1.4 bits. It turns out that arithmetic coding will send all the messages using only 3 bits, a factor of hundreds fewer than a Huffman coder. Of course this is an extreme case, and when all the probabilities are small, the gain will be less significant. Arithmetic coders are therefore most useful when there are large probabilities in the probability distribution.

The main idea of arithmetic coding is to represent each possible sequence of n messages by a separate interval on the number line between 0 and 1, e.g. the interval from .2 to .5. For a sequence of messages with probabilities ., the algorithm will assign the sequence to an interval of size by starting with an interval of size 1 (from 0 to 1) and narrowing the interval by a factor of on each message i'. We can bound the number of bits required to uniquely identify an interval of size s, and use this to relate the length of the representation to the self information of the messages.

In the following discussion we assume the decoder knows when a message sequence is complete either by knowing the length of the message sequence or by including a special end-of-file message. This was also implicitly assumed when sending a sequence of messages with Huffman codes since the decoder still needs to know when a message sequence is over.

We will denote the probability distributions of a message set as and we

Figure 3: An example of generating an arithmetic code assuming all messages are from the same probability distribution a=.2,b=.5 and c=.3 . The interval given by the message sequence babc is [.255,.27] define the accumulated probability for the probability distribution as


So, for example, the probabilities [.2, .5, .3] correspond to the accumulated probabilities[0,.2,.7]. Since we will often be talking about sequences of messages, each possibly from a different probability distribution, we will denote the probability distribution of the message as and the accumulated probabilities as For a particular sequence of message values, we denote the index of the + message value as We will use the shorthand for and for .Arithmetic coding assigns an interval to a sequence of messages using the following recurrences


Where is the lower bound of the interval and .is the size of the interval, i.e. the interval is given by . We assume the interval is inclusive of the lower bound, but exclusive of the upper bound. The recurrence narrows the interval on each step to some part of the previous interval. Since the interval starts in the range [0,1), it always stays within this range. An example of generating an interval for a short message sequences is illustrated in Figure 3. An important property of the intervals generated by Equation 7 is that all unique message sequences of length n will have non overlapping intervals. Specifying an interval therefore uniquely determines the message sequence.

In fact, any number within an interval uniquely determines the message sequence. The job of decoding is basically the same as encoding but instead of using the message value to narrow the interval, we use the interval to select the message value, and then narrow it. We can therefore "send" a message sequence by specifying a number within the corresponding interval.

The method of arithmetic coding was suggested by Elias, and presented by Abramson in his text on Information Theory [Abramson 1963]. Implementations of Elias' technique were developed by Rissanen, Pasco, Rubin, and, most recently, Witten et al. [Rissanen 1976; Pasco 1976; Rubin 1979; Witten et al. 1987]. We present the concept of arithmetic coding first and follow with a discussion of implementation details and performance.

In arithmetic coding a source ensemble is represented by an interval between 0 and 1 on the real number line. Each symbol of the ensemble narrows this interval. As the interval becomes smaller, the number of bits needed to specify it grows. Arithmetic coding assumes an explicit probabilistic model of the source. It is a defined-word scheme which uses the probabilities of the source messages to successively narrow the interval used to represent the ensemble. A high probability message narrows the interval less than a low probability message, so that high probability messages contribute fewer bits to the coded ensemble. The method begins with an unordered list of source messages and their probabilities. The number line is partitioned into subintervals based on cumulative probabilities. A small example will be used to illustrate the idea of arithmetic coding. Given source messages {A,B,C,D,#} with probabilities {.2, .4, .1, .2, .1}, Figure 3.9 demonstrates the initial partitioning of the number line. The symbol A corresponds to the first 1/5 of the interval [0,1); B the next 2/5; D the subinterval of size 1/5 which begins 70% of the way from the left endpoint to the right. When encoding begins, the source ensemble is represented by the entire interval [0,1). For the ensemble AADB#, the first A reduces the interval to [0,.2) and the second A to [0,.04) (the first 1/5 of the previous interval). The D further narrows the interval to [.028,.036) (1/5 of the previous size, beginning 70% of the distance from left to right). The B narrows the interval to [.0296,.0328), and the # yields a final interval of [.03248,.0328). The interval, or alternatively any number i within the interval, may now be used to represent the source ensemble.

Source Probability Cumulative Range

message probability

A .2 .2 [0,.2)

B .4 .6 [.2,.6)

C .1 .7 [.6,.7)

D .2 .9 [.7,.9)

# .1 1.0 [.9,1.0)


Adaptive Huffman coding was first conceived independently by Faller and Gallager [Faller 1973; Gallager 1978]. Knuth contributed improvements to the original algorithm [Knuth 1985] and the resulting algorithm is referred to as algorithm FGK. A more recent version of adaptive Huffman coding is described by Vitter [Vitter 1987]. All of these methods are defined-word schemes which determine the mapping from source messages to codewords based upon a running estimate of the source message probabilities. The code is adaptive, changing so as to remain optimal for the current estimates. In this way, the adaptive Huffman codes respond to locality. In essence, the encoder is "learning" the characteristics of the source. The performance of the adaptive methods can also be worse than that of the static method. Upper bounds on the redundancy of these methods are presented in this section. As discussed in the introduction, the adaptive method of Faller, Gallager and Knuth is the basis for the UNIX utility compact. The performance of compact is quite good, providing typical compression factors of 30-40%.


The basis for algorithm FGK is the Sibling Property, defined by Gallager [Gallager 1978]: A binary code tree has the sibling property if each node (except the root) has a sibling and if the nodes can be listed in order of nonincreasing weight with each node adjacent to its sibling. Gallager proves that a binary prefix code is a Huffman code if and only if the code tree has the sibling property. In algorithm FGK, both sender and receiver maintain dynamically changing Huffman code trees. The leaves of the code tree represent the source messages and the weights of the leaves represent frequency counts for the messages. At any point in time, k of the n possible source messages have occurred in the message ensemble.

Figure 4.1 -- Algorithm FGK processing the ensemble EXAMPLE (a) Tree after processing "aa bb"; 11 will be transmitted for the next b. (b) After encoding the third b; 101 will be transmitted for the next space; the tree will not change; 100 will be transmitted for the first c. (c) Tree after update following first c.

Initially, the code tree consists of a single leaf node, called the 0-node. The 0-node is a special node used to represent the n-k unused messages. For each message transmitted, both parties must increment the corresponding weight and recompute the code tree to maintain the sibling property. At the point in time when t messages have been transmitted, k of them distinct, and k < n, the tree is a legal Huffman code tree with k+1 leaves, one for each of the k messages and one for the 0-node. If the (t+1)st message is one of the k already seen, the algorithm transmits a(t+1)'s current code, increments the appropriate counter and recomputes the tree. If an unused message occurs, the 0-node is split to create a pair of leaves, one for a(t+1), and a sibling which is the new 0-node. Again the tree is recomputed. In this case, the code for the 0-node is sent; in addition, the receiver must be told which of the n-k unused messages has appeared. At each node a count of occurrences of the corresponding message is stored. Nodes are numbered indicating their position in the sibling property ordering. The updating of the tree can be done in a single traversal from the a(t+1) node to the root. This traversal must increment the count for the a(t+1) node and for each of its ancestors. Nodes may be exchanged to maintain the sibling property, but all of these exchanges involve a node on the path from a(t+1) to the root. Figure 4.2 shows the final code tree formed by this process on the ensemble EXAMPLE.

Figure 4.2 -- Tree formed by algorithm FGK for ensemble EXAMPLE.

Disregarding overhead, the number of bits transmitted by algorithm FGK for the EXAMPLE is 129. The static Huffman algorithm would transmit 117 bits in processing the same data. The overhead associated with the adaptive method is actually less than that of the static algorithm. In the adaptive case the only overhead is the n lg n bits needed to represent each of the n different source messages when they appear for the first time. (This is in fact conservative; rather than transmitting a unique code for each of the n source messages, the sender could transmit the message's position in the list of remaining messages and save a few bits in the average case.) In the static case, the source messages need to be sent as does the shape of the code tree. Figure 4.3 illustrates an example on which algorithm FGK performs better than static Huffman coding even without taking overhead into account. Algorithm FGK transmits 47 bits for this ensemble while the static Huffman code requires 53.

Figure 4.3 -- Tree formed by algorithm FGK for ensemble "e eae de eabe eae dcf".


Figure 4.4 -- FGK tree with non-level order numbering.

The adaptive Huffman algorithm of Vitter (algorithm V) incorporates two improvements over algorithm FGK. First, the number of interchanges in which a node is moved upward in the tree during a recomputation is limited to one. This number is bounded in algorithm FGK only by l/2 where l is the length of the codeword for a(t+1) when the recomputation begins. Second, Vitter's method minimizes the values of SUM{ l(i) } and MAX{l(i)} subject to the requirement of minimizing SUM{ w(i) l(i) }. The intuitive explanation of algorithm V's advantage over algorithm FGK is as follows: as in algorithm FGK, the code tree constructed by algorithm V is the Huffman code tree for the prefix of the ensemble seen so far. The adaptive methods do not assume that the relative frequencies of a prefix represent accurately the symbol probabilities over the entire message. Therefore, the fact that algorithm V guarantees a tree of minimum height (height = MAX{ l(i) } and minimum external path length (SUM{ l(i) }) implies that it is better suited for coding the next message of the ensemble, given that any of the leaves of the tree may represent that next message.

Figure 4.5 -- Algorithm V processing the ensemble "aa bbb c".

Figure 4.6 ilustrates the tree built by Vitter's method for the ensemble of Figure 4.3. Both SUM{l(i)} and MAX{l(i)} are smaller in the tree of Figure 4.6. The number of bits transmitted during the processing of the sequence is 47, the same used by algorithm FGK. However, if the transmission continues with d,b,c,f or an unused letter, the cost of algorithm V will be less than that of algorithm FGK. This again illustrates the benefit of minimizing the external path length SUM{l(i)} and the height MAX{l(i)}.

Figure 4.6 -- Tree formed by algorithm V for the ensemble of Fig. 4.3.

Vitter proves that the performance of his algorithm is bounded by S - n + 1 from below and S + t - 2n + 1 from above [Vitter 1987]. At worst then, Vitter's adaptive method may transmit one more bit per codeword than the static Huffman method. The improvements made by Vitter do not change the complexity of the algorithm; algorithm V encodes and decodes in O(l) time as does algorithm FGK.


Empirical tests of the efficiencies of the algorithms presented here are reported in [Bentley et al. 1986; Knuth 1985; Schwartz and Kallick 1964; Vitter 1987; Welch 1984]. These experiments compare the number of bits per word required and processing time is not reported. While theoretical considerations bound the performance of the various algorithms, experimental data is invaluable in providing additional insight. It is clear that the performance of each of these methods is dependent upon the characteristics of the source ensemble.

n k Static Alg. V Alg. FGK

100 96 83.0 71.1 82.4

500 96 83.0 80.8 83.5

961 97 83.5 82.3 83.7

Figure 5.1 -- Simulation results for a small text file [Vitter 1987];

n = file size in 8-bit bytes,

k = number of distinct messages.

Vitter tests the performance of algorithms V and FGK against that of static Huffman coding. Each method is run on data which includes Pascal source code, the TeX source of the author's thesis, and electronic mail files [Vitter 1987]. Figure 5.1 summarizes the results of the experiment for a small file of text. The performance of each algorithm is measured by the number of bits in the coded ensemble and overhead costs are not included. Compression achieved by each algorithm is represented by the size of the file it creates, given as a percentage of the original file size. Figure 5.2 presents data for Pascal source code. For the TeX source, the alphabet consists of 128 individual characters; for the other two file types, no more than 97 characters appear. For each experiment, when the overhead costs are taken into account, algorithm V outperforms static Huffman coding as long as the size of the message ensemble (number of characters) is no more than 10^4. Algorithm FGK displays slightly higher costs, but never more than 100.4% of the static algorithm.

n k Static Alg. V Alg. FGK

100 32 57.4 56.2 58.9

500 49 61.5 62.2 63.0

1000 57 61.3 61.8 62.4

10000 73 59.8 59.9 60.0

12067 78 59.6 59.8 59.9

Figure 5.2 -- Simulation results for Pascal source code [Vitter 1987];

n = file size in bytes,

k = number of distinct messages.

Witten et al. compare adaptive arithmetic coding with adaptive Huffman coding [Witten et al. 1987]. The version of arithmetic coding tested employs single-character adaptive frequencies and is a mildly optimized C implementation. Witten et al. compare the results provided by this version of arithmetic coding with the results achieved by the UNIX compact program (compact is based on algorithm FGK). On three large files which typify data compression applications, compression achieved by arithmetic coding is better than that provided by compact, but only slightly better (average file size is 98% of the compacted size). A file over a three-character alphabet, with very skewed symbol probabilities, is encoded by arithmetic coding in less than one bit per character; the resulting file size is 74% of the size of the file generated by compact. Witten et al. also report encoding and decoding times. The encoding time of arithmetic coding is generally half of the time required by the adaptive Huffman coding method. Decode time averages 65% of the time required by compact. Only in the case of the skewed file are the time statistics quite different. Arithmetic coding again achieves faster encoding, 67% of the time required by compact. However, compact decodes more quickly, using only 78% of the time of the arithmetic method.


Data compression is still very much an active research area. This section suggests possibilities for further study. Strategies for increasing the reliability of these codes while incurring only a moderate loss of efficiency would be of great value. This area appears to be largely unexplored. Possible approaches include embedding the entire ensemble in an error-correcting code or reserving one or more codewords to act as error flags. For adaptive methods it may be necessary for receiver and sender to verify the current code mapping periodically. For adaptive Huffman coding, Gallager suggests an "aging" scheme, whereby recent occurrences of a character contribute more to its frequency count than do earlier occurrences. This strategy introduces the notion of locality into the adaptive Huffman scheme. Cormack and Horspool describe an algorithm for approximating exponential aging [Cormack and Horspool 1984]. However, the effectiveness of this algorithm has not been established.

Both Knuth and Bentley et al. suggest the possibility of using the "cache" concept to exploit locality and minimize the effect of anomalous source messages. Preliminary empirical results indicate that this may be helpful. A problem related to the use of a cache is overhead time required for deletion. Strategies for reducing the cost of a deletion could be considered. Another possible extension to algorithm BSTW is to investigate other locality heuristics. Bentley et al. prove that intermittent-move-to-front (move-to-front after every k occurrences) is as effective as move-to-front. It should be noted that there are many other self-organizing methods yet to be considered. Horspool and Cormack describe experimental results which imply that the transpose heuristic performs as well as move-to-front, and suggest that it is also easier to implement.


Data compression is a topic of much importance and many applications. Methods of data compression have been studied for almost four decades. This paper has provided an overview of data compression methods of general utility. The algorithms have been evaluated in terms of the amount of compression they provide, algorithm efficiency, and susceptibility to error. While algorithm efficiency and susceptibility to error are relatively independent of the characteristics of the source ensemble, the amount of compression achieved depends upon the characteristics of the source to a great extent. Semantic dependent data compression techniques are special-purpose methods designed to exploit local redundancy or context information. A semantic dependent scheme can usually be viewed as a special case of one or more general-purpose algorithms. It should also be noted that algorithm BSTW is a general-purpose technique which exploits locality of reference, a type of local redundancy.

Susceptibility to error is the main drawback of each of the algorithms presented here. Although channel errors are more devastating to adaptive algorithms than to static ones, it is possible for an error to propagate without limit even in the static case. Methods of limiting the effect of an error on the effectiveness of a data compression algorithm should be investigated.