Huffman Coding
Huffman Coding Problem: Find prefix code for given characters occurring with certain frequency.
- Huffman code was introduced by David Huffman at MIT. Many variations have been proposed by various researchers on traditional algorithms. Huffman code has a good application in losing less data compression.
- It encodes the entropy of the data in terms of variable length code. Fix length codes are not always desirable. The ANSI uses 8 bits. As such, all other encoding methods, e.g. UTF-8, EBCDIC etc., use fixed length codes.
- Fix length codes are better to access, but it’s not efficient in terms of memory utilization. Idea is to assign sort code to more frequent symbols and long code to less frequent symbols.
- Assume that we want to encode following six alphabets: S = <A, B, C, D, E, F>.
- If we use a 4 bit fixed length code for them, and assume that the code for each alphabet is as follows :
- A : 0000, B : 0101, C : 1010, D : 0011, E : 1100, F : 0110 If we want to encode the string “a b b d d e e b a c d f f a b”, then we need a total of 15 * 4 = 60 characters. Assume that we know the probability of occurrence of each character in advance.
Alphabet | A | B | C | D | E | F |
Probability of occurrence | 0.40 | 0.20 | 0.15 | 0.10 | 0.08 | 0.07 |
Fixed length code | 0000 | 0101 | 1010 | 0011 | 1100 | 0110 |
Variable length code | 0 | 101 | 100 | 111 | 1100 | 1111 |
- Suppose we want to encode a string of 1000 characters, then fixed length encoding will take, 1000 * 4 = 4000 symbols.
- While, variable length encoding needs, = [(1000 * 0.40) * 1] + [(1000 * 0.20) * 3] + [(1000 * 0.15) *3] + [(1000 * 0.10) * 3] + [(1000 * 0.08) * 4] + [(1000 * 0.07) * 4] = 400 + 600 + 450 + 300 + 320 + 280 = 2350 symbols.
- From the above example, we can see that with variable length encoding method, we need almost 60% symbol compare to fixed length encoding method. So we are getting about 40% compression ratio, which will vary according to probability of alphabets.
Prefix Code:
- Variable length code should be such that decoding will not create any ambiguity. Let’s take code for A = 0, B = 01 and C = 001. If our message is ABC, than its encoding will be 001001. As A is prefix of B and C and AB is prefix of C, it is difficult to decode the string. 001001 can be decoded in many ways like CC, ABAB, CAB, ABC etc.
- To prevent such ambiguity, code of any character must not be prefix of any other code. Such codes are known as prefix code. Prefix code always provide optimal text data compression without any ambiguity.
- Encoding process is very simple. We can construct encoding message just by concatenating prefix code of each character of the message. Decoding process needs to traverse Huffman tree from root to leaf till encoded string is not over.
Algorithm for Huffman Coding
Greedy approach for solving Huffman coding problem is described below:
Algorithm HUFFMAN_CODE (PQ)
// PQ is the priority queue, in which priority is frequency of each character.
// PQ contains all n characters in decreasing order of their priority
for i ← 1 to n – 1 do
z ← CreatNode()
x ← LeftChild[z] ← deque(PQ)
y ← RightChild[z] ← deque(PQ)
z.priority ← x.priority + y.priority
enqueue (PQ, z)
end
return deque(Q)
Complexity Analysis of Huffman Coding
Sorting of n characters according to their frequency can be achieved in O(nlog2n) time. Least frequent two elements are added, it can be done in constant time. Merged node should be inserted in priority queue, that will take linear time O(n). So over all time required by Huffman code algorithm is
O(nlog2 n) + O(n) = O(nlog2 n).
Examples
Example: Given that for character set S = <A, B, C, D, E> occurrence in text file is P = <35, 12, 8, 25, 20>. Find prefix code for each symbol.
Solution:
Step 1 : Arrange all characters in decreasing order of their frequency. S = <A, D, E, B, C> and corresponding P = <35, 25, 20, 12, 8>
Step 2 : Merge last two nodes and arrange it again in order
Step 3 : Merge last two nodes and arrange it again in order,
Step 4 : Merge last two nodes and arrange it again in order,
Step 5 : Merge last two nodes and arrange it again in order. Label all left arc by 1 and all right arc by 0
Visit all leaf nodes and read its edge value to find its prefix code.
Character | Prefix Code |
A | 11 |
B | 001 |
C | 000 |
D | 10 |
E | 01 |
Example: Find Huffman code for each symbol in following text : ABCCDEBABFFBACBEBDFAAAABCDEEDCCBFEBFCAE
Solution:
For given text, occurrence of each character is as follow : S = <A, B, C, D, E, F>, P = <8, 9, 7, 4, 6, 5>
Step 1 : Arrange all characters in decreasing order of their frequency.
S = <A, B, C, D, E, F>, P= <8, 9, 7, 4, 6, 5>
Step 2 : Merge last two nodes and arrange it again in order
Step 3 : Merge last two nodes and arrange it again in order
Step 4 : Merge last two nodes and arrange it again in order
Step 5 : Merge last two nodes and arrange it again in order
Step 6 : Merge last two nodes and arrange it again in order.
Label all left arc by 1 and all right arc by 0
Visit all leaf nodes and read its edge value to find its prefix code.
Character | Prefix Code |
A | 00 |
B | 01 |
C | 111 |
D | 100 |
E | 110 |
F | 101 |
Some Popular Problems Solved by Greddy Algorithm
Greedy algorithms are used to find an optimal or near-optimal solution to many real-life problems. A few of them are listed below :
- Binary Knapsack Problem
- Fractional Knapsack Problem
- Job Scheduling Problem
- Activity Selection Problem
- Huffman Coding
- Optimal Storage on Tapes
- Optimal Merge Pattern
- Prim’s Algorithm
- Kruskal’s Algorithm
- Dijkstra’s Algorithm
Additional Reading: Read on Quora