View Full Version : Anagrams! (yet another "as many languages as possible" thread :))
jemfinch
07-01-2002, 12:50 PM
Write a program to generate all the anagrams in a given dictionary. Output should look like this:
...
banana: <none>
...
post: pots stop tops
...
zebra: <none>
That is, anagram lists should be sorted by their lexicographically least anagram, and the lists themselves should be sorted in alphabetical order. Words should be listed only once in the results; "pots," "stop," and "tops" would not be listed on their own lines with their anagrams. In short, a user should be able to search the results for a given word and have it show up only once.
You can download the dictionary I used from here (http://www.freebsd.org/cgi/cvsweb.cgi/~checkout~/src/share/dict/web2?rev=1.10&content-type=text/plain). First, you'll want to "normalize" it by making all words lowercase and removing all duplicates. Here's an untested Python script to do so:
#!/usr/bin/env python
import sys
d = {}
for line in sys.stdin.readlines():
d[line[:-1].lower()] = 1
l = d.keys()
l.sort()
for word in l:
print word
That's untested, but should work; it takes its input as stdin and outputs on stdout, like a standard *nix filter. I have a reference normalizer at home written in SML that can output the "reference dictionary" for the project.
There is a secret to doing this, as anyone who's read Jon Bentley's "Programming Pearls" knows. I'm not going to post the secret, but I'm sure it'll be easily read from people's posts as the solutions are posted. Rest assured, however, the problem is most definitely not computationally infeasible. My SML reference implementation takes about 22-26 seconds to process the entire dictionary. In languages that are geared toward text processing, if your code is much longer than 20-30 lines, you're probably not doing it the right way.
I'm really just curious to see two things: what this problem looks like in other languages, and how they compare speedwise to my languages of choice :) I'll post my SML reference implementation later when people have started posting their solutions. I'm good for SML, O'Caml, and Python, but as I'm sure others will be posting O'Caml and Python solutions just as good as my own, I probably won't bother posting (well, writing and then posting :)) those two implementations.
Let's see what people come up with!
Jeremy
GnuVince
07-01-2002, 02:03 PM
Maybe you should've tested:
[vince@vincent: ~/prog/ocaml/contest/anagram]% ./jemfinch.py
banana
post
spin
zebra
['banana', 'post', 'spin', 'zebra']
['banana', 'post', 'spin', 'zebra']
['banana', 'post', 'spin', 'zebra']
['banana', 'post', 'spin', 'zebra']
[vince@vincent: ~/prog/ocaml/contest/anagram]%
Also, your FreeBSD.org link is broken.
jemfinch
07-01-2002, 02:16 PM
It's hard to test when you don't have an interpreter and you're not using your own computer :)
FreeBSD's cvsweb seems to be messing up a bit -- maybe it'll be working later. The process can, of course, be done with any decent-sized dictionary of words.
Jeremy
GnuVince
07-01-2002, 02:33 PM
Link removed.
stuka
07-01-2002, 03:09 PM
Well, heck, I'd almost feel like I was cheating if I wrote this in C++ now...:(
GnuVince
07-01-2002, 03:15 PM
Hrmmm... I almost got it I think, but I've got a nasty annoying bug:
[vince@vincent: ~/prog/ocaml/contest/anagram]% ./vince
tops
stop
pots
spot
tops stop pots spot
tops stop pots
tops stop
tops
jemfinch
07-01-2002, 03:58 PM
GnuVince: There's a reason I didn't post the efficient way to do this in my original post. It's because people learn more by searching and thinking about answers than by having them spoonfed to them.
It's not hard to find the information you found, it's just that now people are going to read your solution before writing their answers. You've stolen a whole lot of people's learning oppurtunities.
Jeremy
GnuVince
07-01-2002, 05:29 PM
My solution in O'Caml:
let print_string_list l =
List.iter (fun s -> print_string (s ^ " ")) l;
print_char '\n'
let explode s =
let rec doit l size = match size with
-1 -> l
| _ -> doit (s.[size]::l) (size-1)
in
doit [] ((String.length s) - 1)
let get_sig s = List.sort compare (explode s)
let rec insort el lst = match lst with
[] -> [el]
| h::t -> if el > h then h::(insort el t) else el::lst
let _ =
let h = Hashtbl.create 450000 in
try
while true do
let word = String.lowercase (input_line stdin) in
let sign = get_sig word in
try
Hashtbl.replace h (sign) (insort word (Hashtbl.find h sign))
with Not_found ->
Hashtbl.add h (sign) ([word])
done
with End_of_file -> ();
Hashtbl.iter (fun k v -> if (List.length v > 1) then print_string_list v) h
Takes 35 seconds with a dict of 45000 entries. One thing though, the output is not ordered. I'll have to work on that.
Edit: added sorting.
jemfinch
07-02-2002, 01:07 PM
The link to the dictionary seems to work now (I didn't change it, it just seems that FreeBSD's CVS web interface isn't broken now.)
GnuVince: Try running your program on the normalized dictionary I was using, I'm curious how long it takes. Also, what computer are you running your program on? (And if I posted the SML/NJ version of my program, would you be able to tell me how long it took?)
Here's just a stylistic suggestion:
let explode = function
| "" -> []
| s ->
let rec aux acc = function
| 0 -> s.[0] :: acc
| n -> aux (s.[n] :: acc) (pred n)
in
aux [] (String.length s - 1)
I like that better for explode. In SML, it would look like this:
fun explode "" = []
| explode s = let
fun aux (acc, 0) = (String.sub (s, 0)) :: acc
| aux (acc, n) = aux ((String.sub (s, n)) :: acc) (n-1)
in
aux ([], String.size s - 1)
Do add sorting to the program so you can really benchmark with the other available programs. What you might do also is to compare the efficiency of different data structures for your dictionary.
Here's a function to sort two lists of sortable items, written in SML: I'll leave the translation to O'Caml as an exercise for you. It's of type:
sortList : (('a * 'a) -> order) -> ('a list * 'a list) -> order
fun sortList cmp ([], []) = EQUAL
| sortList cmp ([], _) = LESSER
| sortList cmp (_, []) = GREATER
| sortList cmp (hd :: tl, hd' :: tl') =
(case cmp (hd, hd') of
EQUAL => sortList cmp (tl, tl')
| x => x)
Of course, all code is untested. I have nothing on this computer.
Rather than sort each individual anagram list at the end, you might find it easier and/or more efficient to maintain order by inserting the elements in proper order:
fun insort gt (elt, []) = [elt]
| insort (elt, l as (hd :: tl)) = if elt > hd then hd :: (insort (elt, tl)) else elt :: l
Where are the other solutions? Where are the Python coders with thier solutions? Where are the Perl coders with their one-liners? Where are the C coders with their fast-as-lightning-but-you-better-not-have-words-bigger-than-MAXWORDLENGTH solutions? Where are the C++ coders showing us the wonders of the STL? Where are Ruby coders showing us...well, showing us something. Where's Dru Lee Parsec with his Java GUI version? Where are the Lisp and Scheme coders with their parentheses? Where are the SmallTalk programmers doing theirs fully-object oriented? Where are the prolog coders doing theirs with constraints?
I want more versions!
:)
Jeremy
stuka
07-02-2002, 02:39 PM
Jeremy - I'll work on this tonight after my midterm - in C++, with the wonders of the STL! ;)
GnuVince
07-02-2002, 03:57 PM
Jeremy: I have a Athlon 1GHz with 384 megs of RAM running Debian. And I need to fix a few things I think:
./vince < dict2 2980.45s user 14.07s system 99% cpu 50:05.29 total
If you did it in 30 seconds, there's definitly something wrong
GnuVince
07-02-2002, 07:48 PM
Hrmmm... I made another version using tuples (string * ref string list), but it's even slower.
Dru Lee Parsec
07-02-2002, 08:17 PM
I'm busy designing database tables and documenting them at work and building cabinets and guitars at home. I think I'm just too tired to take on another project right now.
GnuVince
07-02-2002, 10:19 PM
ARGH!!!!!!!!!!! How can I make it go faster?!?!??!?!
iDxMan
07-02-2002, 10:26 PM
I'm taking a break from work so I'll poke at this for a second..
I used the following to normalize the dict file
$ sort dict | tr 'A-Z' 'a-z' >dict.new
jemfinch
07-02-2002, 11:10 PM
iDxMan: I don't know what the default behavior of sort is, but you'll want to remove all non-unique entries. Some entries only differ by capitalization (which also means you'll want to do the "tr" command before the sort.)
GnuVince: I don't know exactly what your performance problem is (it doesn't seem to be adding an extra order of complexity anywhere) but I can make a few suggestions:
First, instead of using a try...with block, instead use Hashtable.mem. Sometimes exception blocks mess with tail recursion (which "while" really probably is implemented as)
You'll probably find your algorithms easier to understand (at least complexity-wise) if you re-implement everything you can recursively, e.g., your "while true" loop.
Oh! I think I found it!
You're using the exploded string as your hash key. Hashing lists takes a lot more time than hashing strings. "implode" (another standard SML function) your list of characters back down to a string and use that as your hashtable key.
Let me know how it goes.
Stuka: Write it well -- I'm really curious how C++ with the STL compares codewise and speedwise with these other languages :)
Jeremy
GnuVince
07-03-2002, 12:03 AM
Thank you Jeremy! Here's my solution:
(* Function to print a string list *)
let print_string_list l =
List.iter (fun s -> print_string (s ^ " ")) l;
print_char '\n'
(* Transforms a string into a char list *)
let explode s =
let rec doit l size = match size with
-1 -> l
| _ -> doit (s.[size]::l) (size-1)
in
doit [] ((String.length s) - 1)
let implode l =
let result = String.create (List.length l) in
let rec imp i = function
[] -> result
| c :: l -> result.[i] <- c; imp (i + 1) l
in
imp 0 l
let rec uniq l = match l with
[] -> []
| h::t -> if List.mem h t then uniq t else h::uniq t
let list_values h =
let vals = ref [] in
Hashtbl.iter (fun k v -> if (List.length v > 1) then vals := (uniq v)::!vals) h;
List.sort compare !vals
(* Get an ordered char list *)
let get_sig s = implode (List.sort compare (explode s))
(* Sorted insert *)
let rec insort el lst = match lst with
[] -> [el]
| h::t -> if el > h then h::(insort el t) else el::lst
let _ =
let h = Hashtbl.create 450000 in
(* Read a word, get its signature, and add it to the hash table *)
try
while true do
let word = String.lowercase (input_line stdin) in
let sign = get_sig word in
try
Hashtbl.replace h (sign) (insort word (Hashtbl.find h sign))
with Not_found ->
Hashtbl.add h (sign) ([word])
done
with End_of_file -> ();
(* Print lists of anagrams (2 and more) *)
List.iter (fun l ->
if (List.length l > 1) then print_string_list l) (list_values h)
./vince2 < dict2 3.30s user 0.10s system 90% cpu 3.766 total
ChefNinja
07-03-2002, 12:14 AM
Bah, I'm still learning :)
/me waits for a simpler challenge, so he can actually participate
GnuVince
07-03-2002, 12:22 AM
ChefNinja: you could check "Programming Challenge!" by Dru Lee Parsec. Simple, yet fun!
ChefNinja
07-03-2002, 03:08 AM
I think I already have, I've read through pretty much all of the posts on this board... to an extent. :o
I'll go take another look at it though :D I already responded to the hex dumper one :P
iDxMan
07-03-2002, 01:14 PM
Originally posted by jemfinch
iDxMan: I don't know what the default behavior of sort is, but you'll want to remove all non-unique entries. Some entries only differ by capitalization (which also means you'll want to do the "tr" command before the sort.)
Good point. A better version would be something like :
cat dict|tr 'A-Z' 'a-z' |sort|uniq >dict.out2
I'm almost done with the generator (in perl), but I need to catch up on work so I'll post it later.
-r
jemfinch
07-03-2002, 01:49 PM
Originally posted by iDxMan
cat dict|tr 'A-Z' 'a-z' |sort|uniq >dict.out2
Ah, from the gratuitous-use-of-cat department :)
Just use tr, there's no need for cat:
tr 'A-Z' 'a-z' < dict | ...
I can't test from this computer, but you might not even need the angle bracket -- remember, almost all *nix commands take as an argument the filename to process.
Jeremy
iDxMan
07-03-2002, 02:17 PM
Ah, from the gratuitous-use-of-cat department
Of course, but the kitty needs attention too!
stuka
07-05-2002, 06:11 PM
Hmm...here's my first effort. The speed isn't bad, and the words are coming out just once (I think, it's a long file to test, and I'm @ work on a Win32 box w/o grep...), but I haven't got it sorted properly. It's sorting by the 'key' value...it'll give me a performance hit to sort it out, but I guess that's the joy of requirements, eh? Anyway, here it is - anagram.h#include <fstream>
#include <map>
#include <iostream>
#include <algorithm>
#include <string>
class AnagramFinder{
public:
AnagramFinder(std::string filename);
void printAnagramList();
private:
bool loaded;
std::multimap<std::string, std::string> anagramHash;
protected:
static void printAnagram(std::pair<std::string, std::string>);
};
and anagram.cpp#include "anagram.h"
AnagramFinder::AnagramFinder(std::string filename){
std::string input, temp;
std::ifstream infile(filename.c_str());
if (infile){
while (infile){
std::getline(infile, input);
if (input.empty()) continue;
temp = input; //Needed to set the correct string
//size for following algorithms
//First normalize the input
std::transform(input.begin(), input.end(), temp.begin(), tolower);
//Copy it back
input = temp;
//Now create the 'handle' in temp
std::sort(temp.begin(), temp.end());
anagramHash.insert(std::pair<std::string, std::string>(temp, input));
}
loaded = true;
}
else loaded = false;
}
void AnagramFinder::printAnagramList(){
std::for_each(anagramHash.begin(), anagramHash.end(), printAnagram);
std::cout << std::endl;
}
void AnagramFinder::printAnagram(std::pair<std::string, std::string> anagram_key){
static std::string curKey = "";
static int count = 0;
if (anagram_key.first != curKey){
if (0 == count && curKey != "") std::cout << "none";
curKey = anagram_key.first;
count = 0;
std::cout << std::endl << anagram_key.second << ": ";
}
else{
std::cout << anagram_key.second << "\t";
count++;
}
}
int main(int argc, char* argv[]){
if (argc < 2){
std::cout << "Usage: anagram <dictionary_file>." << std::endl;
}
AnagramFinder* af = new AnagramFinder(argv[1]);
af->printAnagramList();
return 0;
}
<edit> Disabled smilies...std::pair, heh</edit>
<edit2>Removed extraneous output</edit>
jemfinch
07-23-2002, 09:57 PM
I'll be posting a Python version here soon, since I'll probably write an anagram module for my soon-to-be-released IRC bot.
Jeremy
vBulletin® v3.7.0, Copyright ©2000-2009, Jelsoft Enterprises Ltd.