Huffman Coding

Huffman Coding Problem: Find prefix code for given characters occurring with certain frequency.

  • Huffman code was introduced by David Huffman at MIT. Many variations have been proposed by various researchers on traditional algorithms. Huffman code has a good application in losing less data compression.
  • It encodes the entropy of the data in terms of variable length code. Fix length codes are not always desirable. The ANSI uses 8 bits. As such, all other encoding methods, e.g. UTF-8, EBCDIC etc., use fixed length codes.
  • Fix length codes are better to access, but it’s not efficient in terms of memory utilization. Idea is to assign sort code to more frequent symbols and long code to less frequent symbols.
  • Assume that we want to encode following six alphabets: S = <A, B, C, D, E, F>.
  • If we use a 4 bit fixed length code for them, and assume that the code for each alphabet is as follows :
  • A : 0000, B : 0101, C : 1010, D : 0011, E : 1100, F : 0110 If we want to encode the string “a b b d d e e b a c d f f a b”, then we need a total of 15 * 4 = 60 characters. Assume that we know the probability of occurrence of each character in advance.
AlphabetABCDEF
Probability of occurrence0.400.200.150.100.080.07
Fixed length code000001011010001111000110
Variable length code010110011111001111
  • Suppose we want to encode a string of 1000 characters, then fixed length encoding will take, 1000 * 4 = 4000 symbols.
  • While, variable length encoding needs,  = [(1000 * 0.40) * 1] + [(1000 * 0.20) * 3] + [(1000 * 0.15) *3] + [(1000 * 0.10) * 3] + [(1000 * 0.08) * 4] + [(1000 * 0.07) * 4] = 400 + 600 + 450 + 300 + 320 + 280 = 2350 symbols.
  • From the above example, we can see that with variable length encoding method, we need almost 60% symbol compare to fixed length encoding method. So we are getting about 40% compression ratio, which will vary according to probability of alphabets.

Prefix Code:

  • Variable length code should be such that decoding will not create any ambiguity. Let’s take code for A = 0, B = 01 and C = 001. If our message is ABC, than its encoding will be 001001. As A is prefix of B and C and AB is prefix of C, it is difficult to decode the string. 001001 can be decoded in many ways like CC, ABAB, CAB, ABC etc.
  • To prevent such ambiguity, code of any character must not be prefix of any other code. Such codes are known as prefix code. Prefix code always provide optimal text data compression without any ambiguity.
  • Encoding process is very simple. We can construct encoding message just by concatenating prefix code of each character of the message. Decoding process needs to traverse Huffman tree from root to leaf till encoded string is not over.

Algorithm for Huffman Coding

Greedy approach for solving Huffman coding problem is described below:

Algorithm HUFFMAN_CODE (PQ)
// PQ is the priority queue, in which priority is frequency of each character.
// PQ contains all n characters in decreasing order of their priority

for i ← 1  to n – 1 do
    z ← CreatNode()
    x ← LeftChild[z] ← deque(PQ)
    y ← RightChild[z] ← deque(PQ)
    z.priority ← x.priority + y.priority
    enqueue (PQ, z)
end
return deque(Q)

Complexity Analysis of Huffman Coding

Sorting of n characters according to their frequency can be achieved in O(nlog2n) time. Least frequent two elements are added, it can be done in constant time. Merged node should be inserted in priority queue, that will take linear time O(n). So over all time required by Huffman code algorithm is

O(nlog2 n) + O(n) = O(nlog2 n).

Examples

Example: Given that for character set S = <A, B, C, D, E> occurrence in text file is P = <35, 12, 8, 25, 20>. Find prefix code for each symbol.

Solution:

Step 1 : Arrange all characters in decreasing order of their frequency. S = <A, D, E, B, C> and corresponding P = <35, 25, 20, 12, 8>

Huffman Coding

Step 2 :    Merge last two nodes and arrange it again in order

Huffman Coding

Step 3 :    Merge last two nodes and arrange it again in order,

Huffman Coding problem

Step 4 :    Merge last two nodes and arrange it again in order,

Huffman Coding problem

Step 5 :    Merge last two nodes and arrange it again in order. Label all left arc by 1 and all right arc by 0

Huffman encoding

Visit all leaf nodes and read its edge value to find its prefix code.

CharacterPrefix Code
A11
B001
C000
D10
E01

Example: Find Huffman code for each symbol in following text : ABCCDEBABFFBACBEBDFAAAABCDEEDCCBFEBFCAE

Solution:

For given text, occurrence of each character is as follow : S = <A, B, C, D, E, F>, P = <8, 9, 7, 4, 6, 5>

Step 1 : Arrange all characters in decreasing order of their frequency.

S = <A, B, C, D, E, F>, P= <8, 9, 7, 4, 6, 5>

Huffman encoding example

Step 2 : Merge last two nodes and arrange it again in order

Step 3 : Merge last two nodes and arrange it again in order

Step 4 : Merge last two nodes and arrange it again in order

Step 5 : Merge last two nodes and arrange it again in order

Step 6 : Merge last two nodes and arrange it again in order.

Label all left arc by 1 and all right arc by 0

Visit all leaf nodes and read its edge value to find its prefix code.

CharacterPrefix Code
A00
B01
C111
D100
E110
F101

Greedy algorithms are used to find an optimal or near-optimal solution to many real-life problems. A few of them are listed below :


Additional Reading: Read on Quora

Leave a Reply

Your email address will not be published. Required fields are marked *