Faculty of Education - Computer Science

Huffman Coding

Introduction

There are many different reasons for and ways of encoding data, and one of these ways is Huffman coding. This is used as a compression method in digital imaging and video as well as in other areas. The idea behind Huffman coding is simply to use shorter bit patterns for more common characters, and longer bit patterns for less common characters.

An Example

Say you want to encode the letters A (0.12), E (0.42), I (0.09), O (0.30), U (0.07), listed with their respective probabilities. Go through the following steps:
  1. Consider each of the letters as a symbol with its respective probability.
  2. Find the two symbols with the smallest probability and combine them into a new symbol with both letters by adding the probabilities.
    (Note: There may be a choice between two symbols with the same probability. If this is the case, either symbol can be chosen. The final tree and codes will be different, but the overall efficiency of the code will be the same)
    (Note 2: Frequency counts or other values may be used instead of probabilities)
  3. Repeat step 2 until there is only one symbol left with a probability of 1.
  4. To see the code, redraw all the symbols in the form of a tree, where each symbol contains either a single letter or splits up into two smaller symbols. Label all the left branches of the tree with a 0 and all the right branches with a 1. The code for each of the letters is the sequence of 0's and 1's that lead to it on the tree, starting from the symbol with a probability of 1.
You should get a tree like the following:

Thus the codes for each letter are:

Now using this code, any string of vowels can be written uniquely. Notice that each string of 0's and 1's can be uniquely decoded. Try encoding and decoding some strings of vowels on your own.

Activities

  1. Create Huffman trees and codes for the following sets of letters with the given probabilities:
    1. A (0.20), B (0.09), C (0.15), D (0.11), E (0.40), F (0.05)
    2. A (0.05), C (0.04), E (0.16), G (0.02), I (0.04), L (0.07), M (0.09), N (0.08), O (0.12), R (0.08), S (0.09), T (0.10), U (0.04), Y (0.02)
    3. A (0.04), B (0.02), C (0.03), D (0.03), E (0.13), F (0.01), G (0.02), H (0.02), I (0.03), J (0.01), K (0.02), L (0.06), M (0.07), N (0.06), O (0.10), P (0.02), Q (0.01), R (0.06), S (0.07), T (0.08), U (0.03), V (0.02), W (0.02), X (0.01), Y (0.02), Z (0.01)
  2. For the codes you created, try encoding some words and have a friend decode them.
  3. The following character counts were obtained from a document: D (894), E (3320), F (698), M (661), O (1749), R (1600). Create a Huffman tree and code for these characters and encode the word 'freedom'.
  4. Write a computer program which will input characters and their frequencies, create a Huffman code, and encode and decode strings.

References


© 1998 by Warren Hagey
E-mail: wchagey@undergrad.math.uwaterloo.ca