View Full Version : Huffman Codes
NEW COMPETITION!
Introduction
Huffman coding is way to generate an optimal variable-width representation of characters. Compare this to ASCII, which is a fixed-width representation. Variable width codes are much more space efficient than fixed with codes; in order to be optimal, you obviously want the most frequently occurring characters to have the smallest widths..
Huffman's algorithm takes the characters and uses their frequencies to build a binary tree (not a BST), which can then be traversed to determine the code for each character. I don't have much time to go into it, but there are many, many websites that are full of information. If you don't know what huffman codes are, you can do a google "Huffman Codes" this first two hits:
http://www.rasip.fer.hr/research/compress/algorithms/fund/huffman/
http://www.cs.uidaho.edu/~karenv/cs213/cs213.useful.pages/huffman.html
The Task
We'll keep it pretty basic, to start with. Write a program that does the following:
(1) reads a text file
(2) determines the frequency of each character in the file
(3) generates a huffman code to maximize space efficency for that file.
(4) prints out a table with the encoding for that file. i.e something that looks like:
a -> 0
b -> 10
c -> 110
A table for a file consisting of only two characters, say 'a' and 'b', would look like:
a -> 1
b -> 0
(note the 1's and 0's are obviously reversable)
I believe that since we're all using Huffman's algorithm, we should all get the same optimal prefix code (allowing for the above mentioned reversal of 1's and 0's); there may be multiple optimal prefix codes for any given set of frequencies, though.
Extensions
I plan to ask you to write compressed files (using the huffman codes you generated) and read them. Perhaps we can come up with a standard header, so any one person's app can read a binary file that another person wrote.
If I haven't been clear enough about
the task, or you need more explanation of what huffman codes are, let me know.
Oh, btw, huffman coding is a greedy algorithm, so it's fast. :D I just learned about greedy algorithms in my algorithms class.
Dru Lee Parsec
10-03-2002, 05:27 PM
Oh man, I wish I still had my code for all my projects from college. We did that very task as one of my final projects in our algorithms class.
I was a C++ coder back then. It would be fun to look at my own code from so many years ago and see if I can understand it.
jemfinch
10-04-2002, 02:14 AM
So when you encode a file with variable-length encodings, how do you determine where one variable length character begins and the other ends?
(I'm jumping the gun a little, obviously, that won't be needed until you ask us to write the actual compressed file.)
Jeremy
Originally posted by jemfinch
So when you encode a file with variable-length encodings, how do you determine where one variable length character begins and the other ends?
Assuming you have the tree it was encoded with, you just read bits and follow the tree until you hit an endpoint. The second link kmj posted illustrates it pretty well.
sans-hubris
10-04-2002, 02:49 AM
I'll look through my old code. I may still have this program.
Originally posted by jemfinch
So when you encode a file with variable-length encodings, how do you determine where one variable length character begins and the other ends?
(I'm jumping the gun a little, obviously, that won't be needed until you ask us to write the actual compressed file.)
Jeremy
prefix codes. (http://www.nist.gov/dads/HTML/prefixcode.html)
These guarantee that while reading a stream of bits, as soon as you recognize a character in that stream, you have that character. I.e, given the first code I mentioned above,
11010100000110
must be
cbbaaaac
(unless I messed up somewhere) there's no other way to interpret that string, given the code.
Alright; I have the code generation part done in python. When I get some time, I'll put up some example files so that we have a common basis for comparison.
Here's a sample text file and the output from my code generator:
http://kmj.homelinux.org/~kmj/sample.txt
http://kmj.homelinux.org/~kmj/sample.out
It'd be cool if someone culd verify my algorithm with their own output. :D
Oh, and "Heil Smoochie!!!"
jemfinch
10-05-2002, 01:17 PM
kmj: your output looks good:
- printKmjStyle ss
' ' (124) 110
'e' (75) 000
'a' (64) 1110
't' (62) 1011
'o' (61) 1001
'i' (61) 1010
's' (50) 0111
'n' (49) 0110
'r' (45) 0100
'l' (29) 10001
'h' (24) 01010
'p' (20) 00101
'd' (20) 00110
'm' (19) 111111
'c' (19) 00100
'\n' (16) 111101
'f' (14) 100001
'y' (13) 010111
'u' (12) 010110
'g' (11) 001110
',' (8) 1111001
'v' (7) 1000000
'.' (7) 1000001
'w' (6) 0011111
'k' (5) 11111001
'b' (5) 11111010
'T' (5) 11111011
'x' (3) 00111100
'q' (3) 00111101
'z' (2) 111100011
':' (2) 111100000
'\"' (2) 111100001
'j' (1) 1111000100
'S' (1) 1111000101
'O' (1) 1111100010
'M' (1) 1111100011
'F' (1) 1111100000
'B' (1) 1111100001
val it = () : unit
-
The numbers aren't exactly the same, but they're all the same length, which is what's important. Interestingly, though, I get a length of 849 characters for sample.txt. Also, how do you get 219 bytes compressed?
I'll post my code (it's in SML) when I figure out how to make it a standalone program :)
Jeremy
jemfinch
10-05-2002, 01:39 PM
I'll attach my SML code as a file, so people who don't want to look at it don't have to.
Not all the files are included, of course -- if you want to see the definitions in String2, List2, or My, just ask, I'll post them too.
No, I won't post the code by attachment, because it "doesn't have the right extension." To the complaint forum I go...
Jeremy
Interesting; your output reports 16 newlines, mine is 14. That must be the discrepency. Did you C&P it?
D'oh. The compressed size messed up! Thanks Jemfinch. I've updated that file.
Here's my version, in python: (completely uncommented :p)
jemfinch
10-05-2002, 06:11 PM
Ah, good. Our "compressed sizes" are exactly the same (which means that we either both implemented the correct algorithm or we both implemented the same incorrect one :))
I'll see if I can post my code now...nope.
Jeremy
jemfinch
10-05-2002, 07:03 PM
kmj: I took a look at your code. Interestingly enough, my SML code is actually 10 lines shorter than your Python code, which I found to be quite surprising. (And then I did a "perl -ne 'print if !m/^\s*$/' huffman.py | wc -l" and realized we had the exact same number of lines with text :))
I hope you don't mind me picking on PriorityQueue a bit, but I noticed a few things about it that might help some people to note...
I don't know if PriorityQueue should be a subclass of list -- there are definite operations on list that aren't valid for a PriorityQueue. For instance, list addition -- given that implementation, adding two priority queues wouldn't maintain the second queue's priorities. Similarly, slicing, indexing, sorting, extending, and counting really don't have any meaning on priority queues, but in an implementatation that derives from list, they're defined. Additionally, a subclass should be usable whereever its superclass is usable )that is, it should be covariant with its superclass), but PriorityQueue modifies the signature of both insert and pop, making it unusable where lists are used. PriorityQueue can (and probably should) use an underlying list implementation, but since it isn't by nature a list, supporting all the operations that lists support in the way they support them, it shouldn't subclass list.
I'm sure you know that the standard idiom for insertion into a list is aList.insert(index, item) instead of aList[index:index] = [item] -- just chalk that up as another reason why PriorityQueue shouldn't subclass list :)
You haven't run into the bug yet, and may not ever in this case due to the nature of list.__init__, but you should (almost) never use mutable default arguments. The default argument is only evaluated once, so any modifications you make to that argument will be visible in anything that uses that default argument later on. So instead of "data=[]", you should have "data=None" and then, in the function, have an "if data is None: data = []".
While we're on the topic of default arguments, you might consider making the built-in function "cmp" a default for PriorityQueue's cmp_func argument.
The "view" method is really just a specialized version of a standard fold (reduce, in Python terms) on a data structure. You could write it more generally like this:
def foldl(self, f, acc):
for elt in self:
acc = f(elt, acc)
return acc
And so view becomes a special case of such behavior, being equivalent to "foldl(lambda x, acc: acc + x.view(), '')".
In Python, however, such an idiom really isn't all that common. Much more common is to simply define a standard iterator (an __iter__ method) for the class and let the user use map/filter/reduce/for loops on the class. Then view looks like "reduce(lambda x, y: x.view() + y.view(), Q)".
Another issue with view is that it introduces some very serious coupling between the PriorityQueue and what it contains. For view to be useful, whatever is contained by priority queue would need to define a view() method, which isn't very common in the Python world. It would probably be better to rename "view" on both PriorityQueue and Node to "__str__", to fit into the standard names in the Python world.
Now, after saying all that...
This PriorityQueue isn't really a PriorityQueue. It's really just a sorted list. A real priority queue should take a priority as an argument on insert, and then insert the element into the underlying data structure (usually a heap, which, like your implementation, is usually based on an array (in Python, a list)) based on that priority. The "proper" way to do sorted insertions into a list is of course a binary search, but that's a lot harder to get coded correctly than a linear search. Fortunately, Python did the work for you already with its bisect module and its insort function.
In 2.3, a priority_queue/heap class will be included by default.
Jeremy
jemfinch
10-06-2002, 12:11 PM
Ok, here's my SML version :)
Jeremy
jeremy,
Thanks for your comments. I've never actually implemented, or even used a priority queue before. I'm sure you're right that a "real" priority queue wouldn't not be implemented the same way this is, or have the same interface; I mainly implemented this as a convenience for myself, so that's why I inherited from list and did't go as far as to implement a "real" priority queue. Perhaps I should have chosen a different name, to prevent confusion.
Regarding the "view" and coupling issues, I'm not to worried about that. As you can see, it's not even used in my code, and that was just something I implemented while I was writing the code, in order to test it.
Again, thanks for the comments.
P.S - You may want to ignore that print_tree function; I just realized it was there... In later versions I had gotten rid of that, and I don't know if it works.
Halide
10-07-2002, 10:19 AM
Originally posted by kmj
Here's my version, in python: (completely uncommented :p)
cool :)
/me likes to look at 'advanced' code that he has a minimal understanding of... :P
hehe; Halide; this stuff is easy! you can do it no problem.
Halide
10-07-2002, 09:33 PM
which halide are you referring to?
[edit] hey look, i got it to run ;) (after strike told me to install 2.2 :P)
http://halide.cc/images/snapshot5.png
sedarious
10-10-2002, 10:43 AM
is this competition to be judged? and is there a "final submission" date?
No; and no. :) The most judging you'll get is a peer review.
sedarious
10-10-2002, 11:07 AM
Thats cool. Just thought I would ask. I haven't coded is asm for some time and thought this would be a good challenge to get back into the swing of things. Should have something in after the long weekend.
Strike
10-10-2002, 02:06 PM
I'm working on it, but I'm also trying to do it with some cool new stuff that I learn, so it's taking a while. Plus, I've got a lot of other things going on as well. So I may just have to dredge this thread up from the dead after a while.
stuka
10-15-2002, 12:51 PM
Had a thought this morning - wouldn't Huffman codes be a useful technique for cryptography? (Are they already being used?)
Well, for a given language, there are a limited number of optimal trees, so trying each tree would solve that. That is, if the code were based on frequencies in languages... now, if the code were based on the frequencies of characters in the message itself, it would obviously have to be sent to the recipient also.. know what I'm saying? So, while I don't know either way whether or not they'd be useful in cryptography, I don't immediately see their applicability. Do you have any specific ideas?
stuka
10-15-2002, 01:08 PM
Well, I was thinking that rather than using an optimal tree for a given language, you used one from an exotic text (or in a different language with the same or very similar alphabet), and took the resulting bit string and broke it up into 8 or 16 bit groups and sent the message as ASCII or Unicode, it wouldn't be terribly easy to break, but trivial to decode. I'm sure that it could be broken, of course, but for non-national security issues, it seems like the method should be relatively safe.
stuka
10-15-2002, 01:51 PM
I kinda thought so, but didn't want to go off half-cocked, so I thought I'd see what y'all thought.
Strike
10-15-2002, 03:48 PM
Stuka: but then you have to worry about "key exchange". The key in this instance would be the text of choice. Though for nonessential stuff, yeah I can see this being useful.
stuka
10-15-2002, 04:10 PM
Strike: yeah, I knew about that, but you have to worry about key exchange in ANY key based protocol - though admittedly PKI has an elegant (though technically, but not practically, breakable) method.
bwkaz
10-15-2002, 09:33 PM
So do like the SSL protocol does -- use PKI for key exchange, then some symmetric-key algorithm (your Huffman code) for the actual messages.
stuka
10-16-2002, 03:16 PM
Oooh...another fun thought - code up a Factory class that churns out Huffman encoders based on input texts...
/me loves Design Patterns :D
jemfinch
10-16-2002, 04:00 PM
Originally posted by Stuka
Oooh...another fun thought - code up a Factory class that churns out Huffman encoders based on input texts...
/me loves Design Patterns :D
To me and my functional mind, that seems to be just two functions:
makeEncoder : string -> encoder
encode : encoder -> string -> string
/me loves functional programming!
Jeremy
stuka
10-16-2002, 04:18 PM
hmm...well, in a well done OOP design, it would look like:
encoder = factory.create_encoder(string);
encoder.encode(string2);
Still 2 functions.....
jemfinch
10-16-2002, 10:49 PM
One might also make an argument for making the constructor for an Encoder take a string:
encoder = Encoder(string)
encoder.encode(string2)
Jeremy
stuka
10-17-2002, 11:28 AM
Yes, but that wouldn't be an example of the Factory pattern, now would it? ;)
jemfinch
10-17-2002, 04:21 PM
Originally posted by Stuka
Yes, but that wouldn't be an example of the Factory pattern, now would it? ;)
Actually, it wouldn't. For it to be a true factory, it would have to return instances of different types (classes) of encoders. The Encoder constructor as detailed above produces only one encoder: a huffman encoder. As per this (http://exciton.cs.oberlin.edu/javaresources/DesignPatterns/FactoryPattern.htm) link (which is the same place you linked me to in order to show me the Command pattern awhile back) Factories by definition must produce differerent classes. If that was a Factory pattern, then every class constructor (and practically every pure function) in existence would qualify as a Factory, and the utility of having a pattern named "Factory" would be completely lost.
Now, for a little comment on design patterns, since I can't help but editorialize: too many people emphasize design patterns far too much. Design patterns, IMO, are simply a vocabulary of OO design. They're not the end-all-be-all of programming philosophy, nor are they really even intended to facilitate actual programming. They exist to make communication between programmers go more smoothly. For a long time people in the procedural programming community have had their own vocabulary to facilitate communication between programmers, and the people in the functional programming community have had their own (including such gems as "phantom types" and "contravariance" and such). These Design Patterns are the Object-Oriented programming community developing their own vocabulary to facilitate communication about OOP. That's one reason accuracy is really important in the discussion of design patterns -- if an OO programmer can't be sure what another OO programmer means when he says "Factory Pattern," then design patterns' whole raison d'etre is gone.
Jeremy
jemfinch
10-17-2002, 04:25 PM
Oh...and I just realized I read your "Yes, but that wouldn't be an example of Factory Pattern, now would it?" statement wrong. I missed the "wouldn't" and read "would." Sorry about that.
Actually, the reason I posted the example with the constructor was to show that the problem wasn't complex enough to warrant the Factory Pattern, IMO :)
Jeremy
stuka
10-18-2002, 10:06 AM
Actually, the Factory thought wasn't really to solve the problem. And you are correct, as I listed it, it really wouldn't qualify as a factory - now, if the the encoder I mentioned above was a subclass of an abstract factory that happened to create Huffman encoders, then what I wrote was correct (admittedly incomplete, but correct ;) ).
As for your editorializing, I think you are correct in part...while Design Patterns are, as you mention, a vocabulary for discussing design, they also form a knowledge base - and as in any discipline, familiarity with already developed designs is a darn good way to learn more and avoid the ever-dreaded wheel reinvention. Design patterns in object oriented programming are a way of reusing design - which, if the design is ggod, serves much the same purpose as code reuse.
jemfinch
10-18-2002, 08:15 PM
Originally posted by Stuka
As for your editorializing, I think you are correct in part...while Design Patterns are, as you mention, a vocabulary for discussing design, they also form a knowledge base - and as in any discipline, familiarity with already developed designs is a darn good way to learn more and avoid the ever-dreaded wheel reinvention. Design patterns in object oriented programming are a way of reusing design - which, if the design is good, serves much the same purpose as code reuse.
I can't disagree :)
Jeremy
jschen
10-21-2002, 05:34 PM
Actually I'm doing this for a class right now and was wondering if anyone could help. I'm doing this in C++. I'm supposed to implement a priority queue using a min heap.
I'm having trouble creating a heap and keeping the frequency and character together. Do I need to use nodes? And how would I do this? Any help would be appreciated since this is giving me lots of grief. I understand how the hufmann encoding works its just writing the code.
stuka
10-21-2002, 06:01 PM
Well, I've never had the pleasure of writing a heap, but you would definitely need a node for this. Being C++, I'd have a node class that contained the character and frequency. I'd also have a set of overloaded operators (==, <, >, <=, >=) that compared the nodes, so that the heap code itself could be templatized for generic nodes.
jemfinch
10-21-2002, 08:52 PM
Actually, heaps are most often implemented with arrays, not binary nodes. The standard algorithm should be easy enough to find with google.
Jeremy
Strike
10-21-2002, 09:05 PM
http://www.nist.gov/dads/HTML/heap.html
yay DADS
stuka
10-21-2002, 11:40 PM
Fair enough...but it's gotta be an array of some sort of node-like structure...and if you're doing any sorting/comparing, everything I said is valid.
jschen
10-22-2002, 12:01 PM
Can anyone help me create the heap/nodes? I haven't programmed with pointers and nodes forever...IM at ValidzWrk if you feel like you have the time.
jschen
10-22-2002, 01:54 PM
This is what I did before I noticed I needed to use a heap for my priority queue. I read in my file and have the characters in one array and the frequencies in another array. How would I make these into nodes? or should I backtrack and when I read in the file then make them nodes and how would I go about doing that?
stuka
10-22-2002, 02:02 PM
Use a node class or struct, with members that are a character and a frequency. You could store these nodes in an array. It shouldn't take much adaptation of your code to do this.
inkedmn
10-22-2002, 02:07 PM
Originally posted by jschen
Can anyone help me create the heap/nodes? I haven't programmed with pointers and nodes forever...IM at ValidzWrk if you feel like you have the time.
i'm sure none of us would feel right cheating you out of a valuable learning experience... :)
jemfinch
10-22-2002, 02:14 PM
You really don't need nodes. Most heaps are just represented as arrays of sortable elements. The heap property is maintained by using placing the element in the last position on insert and then running a "sift_up" function on the heap.
Now, for the huffman tree, you'll need a tree with nodes for sure. I have a pretty simple one in my SML source:
datatype huffman_node = SingleChar of char
| Node of huffman_node * huffman_node
So basically, a node can be either a single character, or a struct of pointers to two other nodes.
Jeremy
stuka
10-22-2002, 02:32 PM
Jeremy: okay, he doesn't need 'nodes' for a heap. However, whether he uses an array or a tree behind the scenes, the best way to do it would be to 'bundle' the character and frequency data in a struct or class. Then you can reuse that object in either situation, providing a better, more modular design.
<edit>/me forgot to be extremely careful of semantics in any jemfinch-related discussion :P</edit>
sedarious
10-22-2002, 03:22 PM
Is there some standard header structure/size that we should use to in our .out file so it can be decoded by anyone elses huffman decoder/encoder?
sedarious: we never got that far.. Tell ya what, once you submit the first part we'll come up with a standard header. :)
jschen
10-22-2002, 09:15 PM
Can someone tell me what's wrnog with my heap code?
Used a book to do it but doesn't work correct?
Thanks
sedarious
10-23-2002, 01:37 PM
Here is the x86 code. It takes a file specified in the command line and then outs stuff to the screen. This is designed to work under windows. It should work under most pure DOS platforms also. It assumes the PSP is in the same segment at the program (which is usually okay). An execution example would be
huffman sample.txt
here is the binary coding that it outputs...
:011111
:001
":000010011
,:0000101
.:0000111
::000010010
B:0000100011
F:0000100010
M:0000100001
O:0000100000
S:0000001111
T:00000010
a:0001
b:00000001
c:000001
d:11010
e:111
f:011110
g:110001
h:10011
i:0110
j:0000001110
k:00000000
l:01110
m:11011
n:1010
o:0101
p:11001
q:10010101
r:1011
s:1000
t:0100
u:110000
v:0000110
w:1001011
x:10010100
y:100100
z:000000110
you can either download the zip or go to these links for the code
http://zero-entropy.net/misc/huffman.asm
http://zero-entropy.net/misc/io.inc
vBulletin® v3.7.0, Copyright ©2000-2008, Jelsoft Enterprises Ltd.