This is a proof-of-concept Java implementation of arithmetic coding (and Huffman coding for benchmarking). This project is submitted as my final project to ORIE 4742.
- The encoders can only encode ASCII characters (0-127) and uses 128 as the end-of-file (aka stop) symbol.
- Decoders must be constructed using the same probability model (for AC) or codes/Huffman tree (for Huffman coding) as the encoder. Otherwise, the output would not make sense (and is not guaranteed to terminate since the end-of-file symbol may be encoded differently).
- A new instance of a probability model must be created for each encoder and decoder since the probability models can have internal states depending on the text it has read.
- For arithmetic coding, the encoded bytes represent a decimal fraction 0.(bytes) with infinitely many zeros padded at the end.
- Only basic I/O functionalities are implemented: Encoding or decoding files currently reads the entire file into memory and then perform the operations. Therefore, attempting to encode or decode large files may result in an out-of-memory error.
The encoder and decoder takes a probability model that predicts the probability of each character given the text it has seen so far,
The fixed probability model always assigns fixed probabilities
The Dirichlet model is an adaptive model that counts the frequency
where
The Bigram Dirichlet model is an adaptive model which uses a Dirichlet model for each character. It counts the frequency
This aims to take advantage of the fact that in English and other languages, the letters in the text are not independent of each other. For example, in English, the next letter after q is almost always u.
The encoder and decoder supports renormalization and underflow handling. For encoding, we keep track of the possible doubles that can be used to encode what we've seen so far as a range [low, high). If the range of possible doubles falls entirely in
For decoding, we also keep track of a range [low, high) as well as a truncated version of the encoded bitstring, encoded. As more characters are decoded, we bring in more and more bits from the encoded bitstring. The binary fraction represented by the bitstring is therefore in the range [encoded, encoded + LSB brought in), where the lower bound is what the double would be if all the later bits are 0 and the upper bound is what the double would be if all the later bits are 1. If the encoded bitstring range falls entirely inside the range [low, high), we can decode a character and shrink the range [low, high) to be the range for that character within the original interval.
If [low, high) falls entirely in encoded accordingly, then bring in a new bit from the encoded bitstring. We keep decoding until the end-of-file symbol has been decoded.
Huffman coding is a symbol code that assigns a codeword for each symbol. By our convention, the symbols are a subset of 0 to 128, inclusive, where 128 stands for the end-of-file symbol. We use a greedy algorithm to decide which codewords to assign to each symbol: we take the two least frequent symbols and assign them the longest codewords, where the last bit of the codeword is 0 and 1, respectively. We then merge the two symbols into one and repeat until there is just one symbol left.
The implementation uses a priority queue and a union-find data structure to keep track of which symbols have been merged. We construct the Huffman tree using the greedy algorithm, where the leaves represent the symbols, and the path to reach to leaf represent the codeword: we add a 0 when moving left and a 1 when moving right. In addition, we keep a codeword table codes to look up the codewords for each symbol efficiently, which is convenient for encoding.
For encoding, we encode character by character using the codeword table, and then add the codeword for the end-of-file symbol at the end. For decoding, since Huffman coding is a prefix code, it is uniquely decodable, and we can decode by moving down the tree according to the encoded bitstring. When we reach a leaf, we output that symbol and start again from the root. We know that the decoding is complete when we reach the end-of-file symbol.
For benchmarking, we used the following files. Some of the files are randomly generated while others are from the internet.
alice_full.txt: A full copy of Alice's Adventures in Wonderland by Lewis Carroll obtained from The Project Gutenberg.all_a.txt: 1 million copies of the letter 'a'.biased_random_50.txt,biased_random_99.txt: A randomly-generated string of 1 million ASICC characters (0-127, inclusive), with 50% and 99% probability of generating an 'a' and equal probability to generate any other character. The characters are independent of each other.english_words.txt: A list of 10,000 English words separated by newlines, provided by MIT.random_bits.txt: A randomly-generated string of 1 million characters consisting of '1's and '0's, where each character is independent and generated with equal probability (0.5 probability for '1', 0.5 probability for '0').unif_random.txt: A randomly-generated string of 1 million ASICC characters (0-127, inclusive), with equal probability to generate any character. The characters are independent of each other.
In addition to the Huffman, AC with Fixed Probability (with equal probability for each character), AC with Dirichlet, and AC with Bigram Dirichlet encoding schemes, we also included the size of the original file and the size of the file compressed by zip, an industry-standard compression scheme.
Here are the benchmarking results (all file sizes are in bytes):
| alice_full | all_a | biased_random_50 | biased_random_99 | english_words | random_bits | unif_random | |
|---|---|---|---|---|---|---|---|
| original | 144580 | 1000000 | 1000000 | 1000000 | 75879 | 1000000 | 1000000 |
| bigram (alpha=0.01) | 62729 | 9 | 565343 | 18880 | 31563 | 125015 | 873581 |
| bigram (alpha=1) | 64879 | 232 | 557154 | 21625 | 32796 | 125428 | 863490 |
| bigram (alpha=100) | 95295 | 12281 | 622121 | 35964 | 48858 | 146097 | 861912 |
| dirichlet (alpha=0.01) | 79329 | 8 | 552319 | 18676 | 34668 | 125009 | 858692 |
| dirichlet (alpha=1) | 79383 | 232 | 552255 | 18699 | 34778 | 125230 | 858613 |
| dirichlet (alpha=100) | 83996 | 12280 | 553894 | 28064 | 39193 | 137122 | 858790 |
| fixed_prob | 123755 | 876405 | 869581 | 876278 | 57738 | 876405 | 862693 |
| huffman | 79888 | 125001 | 553187 | 133572 | 34932 | 187411 | 860248 |
| zip | 52824 | 1154 | 652319 | 33462 | 27994 | 159144 | 876965 |
As percentage of original file size:
| alice_full | all_a | biased_random_50 | biased_random_99 | english_words | random_bits | unif_random | |
|---|---|---|---|---|---|---|---|
| original | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% |
| bigram (alpha=0.01) | 43.39% | 0.00% | 56.53% | 1.89% | 41.60% | 12.50% | 87.36% |
| bigram (alpha=1) | 44.87% | 0.02% | 55.72% | 2.16% | 43.22% | 12.54% | 86.35% |
| bigram (alpha=100) | 65.91% | 1.23% | 62.21% | 3.60% | 64.39% | 14.61% | 86.19% |
| dirichlet (alpha=0.01) | 54.87% | 0.00% | 55.23% | 1.87% | 45.69% | 12.50% | 85.87% |
| dirichlet (alpha=1) | 54.91% | 0.02% | 55.23% | 1.87% | 45.83% | 12.52% | 85.86% |
| dirichlet (alpha=100) | 58.10% | 1.23% | 55.39% | 2.81% | 51.65% | 13.71% | 85.88% |
| fixed_prob | 85.60% | 87.64% | 86.96% | 87.63% | 76.09% | 87.64% | 86.27% |
| huffman | 55.26% | 12.50% | 55.32% | 13.36% | 46.04% | 18.74% | 86.02% |
| zip | 36.54% | 0.12% | 65.23% | 3.35% | 36.89% | 15.91% | 87.70% |
A visual representation (green = more compression, red = less compression):

With original and fixed probability AC scheme removed:

We have a few observations:
- The fixed probability arithmetic coding scheme consistently achieves a compression rate of 87.5%, which is not great compared to other methods. This is expected since each character is given the same range size (equal probability), so the scheme cannot take advantage of the distribution of the characters and the relations between characters. The 87.5% comes from the fact that we're only encoding ASCII characters 0-127, which means that for every byte, the most-significant bit is 0. Therefore, for every 8 bits, the encoding scheme only needs 7 bits to encode the information.
- Note that
english_words.txtis a significant outlier for unknown reasons.
- Note that
- For the file of uniform random and independent ASCII characters, none of the schemes (including zip) can achieve a compression better than around 87.5%. This is expected because that's the amount of entropy that these files have.
- The compression rate of Huffman is significantly higher than the adaptive arithmetic coding schemes for the file with all 'a's and the file with roughly 99% 'a's. This demonstrates the power of adaptive arithmetic coding: It can assign smaller and smaller probabilities to symbols that don't appear at all, while assigning larger and larger probabilities to symbols that always appear. In Huffman, however, each symbol must take a minimum of 1 bit. In the extreme case of all 'a's, Huffman has to use a million and 1 bits, one for each 'a' and one for the stop symbol, whereas the adaptive models can just use 8 or 9 bits to represent the entire sequence of 'a's since it has very little entropy. In these extreme cases, the adaptive arithmetic coding schemes do even better than the industry standard
zip! - For the Dirichlet and Bigram Dirichlet models, a smaller alpha seems to work better than a larger alpha, meaning that models that adapt more quickly to the context does better. If alpha is large, the model has to keep assigning large probabilities to symbols that it hasn't seen before, whereas if alpha is small, the probability for unseen symbols diminishes very quickly.
- For natural English text (Alice's Adventures in Wonderland), the Bigram Dirichlet models with small alpha do better than the Dirichlet models with small alpha. This is because, as noted above, the letters in the English language are not drawn uniformly at random, and the Bigram Dirichlet model is able to learn the correlations between letters while the Dirichlet model can't. This is also true but to a lesser extent for the list of English words. For the random characters, since the characters are generated independently, the Bigram Dirichlet models do not perform better than the corresponding Dirichlet models. In fact, they have more overheads since they require more space to store all the frequencies.
- David J.C. MacKay's book Information Theory, Inference, and Learning Algorithms
- Wikipedia page on arithmetic coding: https://en.wikipedia.org/wiki/Arithmetic_coding
- Wikipedia page on Huffman coding: https://en.wikipedia.org/wiki/Huffman_coding
- https://www.youtube.com/watch?v=RFWJM8JMXBs
- https://michaeldipperstein.github.io/arithmetic.html
- I asked ChatGPT a few high-level conceptual questions and to look up documentation, but didn't use it to generate large chunks of Java code.
- I used Alice's Adventures in Wonderland as a sample text. This is found in The Project Gutenberg: https://www.gutenberg.org/ebooks/11. I replaced non-ASCII characters with ASCII characters.
- The 10,000 English words for benchmarking are from https://www.mit.edu/~ecprice/wordlist.10000.