Information theory
Project detail
The test file “EE6743_grail_testfile.txt” is the script from the movie “Monty Python and the Holy Grail.” It has been modified so that it contains only 30 different characters:
lowercase a-z (ascii 9710-12210),
“space (sp)” (ascii 3210),
“line feed (lf)” (ascii 1010)
“carriage return (cr)” (ascii 1310),
and “end of text (etx)” (ascii 0310).
All uppercase, punctuation, and numerals have been removed. The file is in “.txt” format, and contains 61,392 characters. It has also been compressed using zip. First, unzip the file, and note the size of the zipped file vs. the unzipped .txt file.
Write the following programs:
1. Write a program to calculate the 30 letter probabilities directly from the file.
2. Write a program to construct a binary Huffman code based on these probabilities.
3. Print out a neat, concise table of your Huffman code, showing the character, probability, Huffman code representation, and code length for each character.
4. Write a program to Huffman encode the data, so that the output file is a sequence of ones and zeros. Then group the output bits symbols into groups of 8 bits (i.e. bytes) so that the final output file is a sequence of 8-bit ascii characters.
5. Have your program calculate the final length, in bits, of the compressed output file. Also have your program calculate the entropy of the source, based on your derived probability values, (assuming that source symbols are independent, which is not really true).
6. Have your program calculate the ratio of output bits per input character. This is Lavg for this compression scheme.
7. Compare Lavg to the calculated entropy of the source. Explain the differences you see.
8. Write a program to decode your encoded data. Confirm that it perfectly reconstructs the original file.
Turn in your source code, your Huffman table, evidence that the programs work, and a summary of all calculated values. Include a write-up (1 page max) discussing your results. In particular, discuss the relationship between the (approximate) entropy of the source and the average length of the code.