PDA

View Full Version : Cryptanalysis challenge


Strike
05-22-2002, 06:21 PM
This challenge is largely in part a result of inkedmn's challenge (in fact, I used the resulting program I got to get some of the data below). Basically, this challenge is to figure out how to decrypt the attached file. It has been encrypted using a simple one-time substitution cipher. For simplicity's sake, the substitution only occurs on letters and not numbers or punctuation. Also, substituting a lowercase letter produces the same substitution as the uppercase letter, only the output case matches the input. So, "a" might map to "g" which means that "A" maps to "G" as well - reducing the work you have to do by at least half.

For those of you who are unfamiliar with substitution ciphers, here is a brief explanation of what they are. Basically, a pre-determined "table" of substitutions is constructed and then applied to each letter of a message (the plaintext) to produce the encrypted message (the ciphertext). For example, ROT13 is a substitution cipher (it's also a special class of substitution cipher known as a Caesar cipher, which all ROTn ciphers are). Its substitution table looks like:

a->n, b->o, c->p, ..., n->a, o->b, p->c, ... etc.

So, your job is to PROGRAMMATICALLY figure out what the substitution table for this encrypted text is. I specify "programmatically" explicitly because: a) it saves you the time of looking at each of the 26! or so outputs to figure out which one is right, b) it makes the challenge more interesting. And, if you want, since this seems to be the trend in all programming challenges here - time how long it takes to get this table.

Hints:
1. The plaintext is indeed in plain English.
2. 5 of the occurrence percentages of the PLAINTEXT (this practically gives you 5 of the entries) are:
'e' = 8.4%
'p' = 2.4%
'l' = 3.3%
'A' = 0.04% (corrected)
'G' = 0.07% (corrected)
3. I was able to type in the translation table (in the form "<letters to be substituted>", "<letters to be output>") in about ... 10-15 seconds. In other words, it's not an entirely random table.
4. No letter maps to itself.

So, these are the requirements for a program to be accepted:
- must correctly determine (and output, in your style of choice) the correct substitution table (can do just the lowercase letters since doing both upper and lowercase would be redundant)

And, ... that's it!

Style points for:
1. listing the top N substitution tables (like top 3, 5, or 10)
2. producing the decrypted output [for the top N tables, if you wish]
3. fastest time finding the correct table
4. figuring out what is so special about the translation table (in hint 3 above)

Go to it!


---edit----
previewing apparently doesn't carry over the path to the attached file ... fixed now

Bradmont
05-22-2002, 06:25 PM
I smell a DMCA violation... :D

Strike
05-22-2002, 06:27 PM
:D For fun, I might try my old sourceforge project - www.sf.net/projects/subcipher on this, and see how it works :) (warning to all who follow the link: I think there are some menu items in there which don't even work and never got much past the "intended to implement" phase)

GnuVince
05-22-2002, 07:11 PM
I will do this one instead of Ink's: I can't figure out how to sort that hash table :-/

Maybe we shouldn't post answers here, jsut give a link, because the key will surely figure in the code.

Dru Lee Parsec
05-22-2002, 07:12 PM
Dude! Inkedman and I were discussing this very challenge idea via AIM just yesterday. We were going to have his challenge wrap up, and then have another challenge to write a program that does a single substitution encryption, and then one more challenge that decrypts it.

You beat us to it. :) I guess the great ones really do think alike.

Oh by the way, I would forget about how long it takes to decrypt the message. If anyone here really tries to do this challenge then getting it right will be it's own reward.

This is a sweet idea. (but you still beat us to it you bastard!) ;)

Strike
05-22-2002, 07:32 PM
Right, do not post the answer here right away, I would say. The answer can easily be confirmed simply by checking the percentages I gave or by producing the output yourself - once you read the plaintext, you'll now you got it.

Posting the CODE, however, is very encouraged. I haven't even written a solutIon yet, though it's planned. I'd love to see some of the ideas you all have, and possible optimizations.

DLP: sorry to have stolen your thunder :) Glad you approve of the challenge though ;)

GnuVince
05-22-2002, 07:36 PM
ARGH!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! G** D*** F*****' code that still requires the use of F***** hash tables :((((((((( I hate Hash tables in O'Caml, they suck more than Korean prostitues!!!!! They're like a ton of rock: unflexible.

Dru Lee Parsec
05-22-2002, 08:14 PM
DLP: sorry to have stolen your thunder Glad you approve of the challenge though

Naw, it just shows what a cool idea this is. And besides, I mis-spoke myself. I believe the proper usage of the term is "swarthy bastard". ;)

<pirate voice>
Arrrrg, 'Tis true he's a swarthy bastard but I'd think twice before choosing to draw a sword upon such a man. It could turn out to be your last thought on this earth.
</pirate voice>

Strike
05-22-2002, 09:45 PM
Yeah, actually my plan was going to be to have a multi-level cipher of a plaintext in a nonsense language whose letter occurences I would generate. Then the challenge would have been to find the most PROBABLE plaintext based on the letter frequencies. This has the drawback of no "eureka" factor when the cipher is solved. You never really know for sure that you have solved it, just that the resulting text you got would most liely be it. So, I chose the simpler route, deciding that even if we do the other one, this would be a good start.

setuid_w00t
05-23-2002, 03:28 AM
What programming language do people typically use to do these challenges?

I'm going to give it a try in C. I haven't coded any C in a while, so it's going to be pretty tough. Right now I'm working on a way to figure out what the 5 hint letters are. After that....I don't really know what the next step is.

inkedmn
05-23-2002, 04:29 AM
Originally posted by setuid_w00t
What programming language do people typically use to do these challenges?


that's the thing, all sorts of languages are used :)

just pick your favorite :)

good luck!

GnuVince
05-23-2002, 08:22 AM
I use O'Caml.

Strike
05-23-2002, 08:30 AM
I have another hint in store in case some people are having a difficult time.

setuid_w00t: yeah, just use whatever language you want - any language can do this. Also, for the hint, just remember that for every letter occurrence in the ciphertext, that corresponds to a letter occurence in the plaintext and vice versa.

Strike
05-23-2002, 11:57 AM
I just realized that there's no real simple way to check if you got the right table (since you can get tons of correct tables with those 5 frequencies being right), short of using an English dictionary of some sort to see if most/all of the words are in there. As such, I MD5summed the plaintext and provide it as a mechanism for checking the solution:


[ddipaolo@quinn ..rypto-challenge]% md5sum plaintext.txt
edd31d430534018f0e7f1a9f15f304f3 plaintext.txt


and just to check your ciphertext, here's its md5sum as well:

[ddipaolo@quinn ..rypto-challenge]% md5sum ciphertext.txt
fb57bc48807895dbfc800e07edd17f17 ciphertext.txt

Dru Lee Parsec
05-23-2002, 12:41 PM
What programming language do people typically use to do these challenges?

One of the great things about these challenges is that we get to see the same problem solved in lots of different languages Use whatever language you're most familier with. OR, if you want to learn a new language thenuse these challenges as a task to use to learn it.

The main reason for doing this are: Sharing knowledge and having fun. Go for it!. :goodjob:

kmj
05-23-2002, 12:43 PM
re md5sums: good idea. I was thinking it wouldn't be easily to programmatically verify you had actually found the correct translation.

Strike
05-23-2002, 01:05 PM
Originally posted by kmj
re md5sums: good idea. I was thinking it wouldn't be easily to programmatically verify you had actually found the correct translation.
Thanks. I figure md5sums are pretty universal in terms of checking the content of one file against another and that it doesn't really make it much more difficult to do in any language. If there is a better way to do it, please tell me.

I suppose I could rot13 the plaintext and post it for people to diff against, but then it's easy to get the plaintext itself and I want it to be a surprise :)

Strike
05-23-2002, 05:55 PM
oh wow, big oops on two of those frequencies (sorry guys, I'm trying here...)

'A' = 0.04%
'G' = 0.07%

off by an order of magnitude :-/ I apologize

GnuVince
05-24-2002, 12:03 AM
How many possibilites are there? 26^25?

Bradmont
05-24-2002, 12:22 AM
Originally posted by GnuVince
How many possibilites are there? 26^25?
26!

Strike
05-24-2002, 12:35 AM
well, just using the hint that upper and lower map to the same char (and its upper/lower versions matching), then it has 26! combinations - not as big as 26^25, but still plenty big

But, using the hints I gave (no letter maps to itself) eliminates 1/26 of them, bringing it to 25!. However, you can get at least 2 characters right away from the percentages (maybe 3), and you can narrow down the other 2 or 3 to a set of maybe 5 or 6 possibilities each. So we now have

1*1*1*6*6*(21*20*19*...) = 36*21! = 1839273918181539840000 = about 2 billion billion possibilites

as a high estimate.

And if you tack on this new hint:
* It's not a ROTn cipher
you can eliminate a few more (not very many, just 26, so almost none by comparison)

Of course, if you are clever, you can use other statistical studies to figure out a small subset of likely matches for certain letters and construct tables based on that. Heck, you might even use the hint about how fast I could type the table in creating your list of tables, stipulating that no two subsequent letters can be separated by more than 2 keys on the keyboard. Plenty of room for cleverness to try and shrink this down to a manageable level. Personally, I haven't even gotten it yet. I've gotten everything but the table generation function finished, though.

Since this number of combinations is far greater than what I thought it'd be, I might post 5 or 10 more plaintext frequencies. Posting ten more that are distinct (that give away what the letter is), would shrink the solution space to 1 billion solutions approximately, something far more manageable (since decrypting the text on my slow machine using Python took only about 1/10 of a second, though that still comes out to a millenium of tries without using more cleverness to shrink the solution space).

kmj
05-24-2002, 10:11 AM
heh; factorial is a subset of exponential time. :) Woohoo my complexity class was worth something!

edit: even though that statement is pretty obvious, I guess.

Dru Lee Parsec
05-29-2002, 11:47 AM
[Bump]

I got a lot of work done on my GUI decrypter last night. But I still have one major bug to fix before I can really test it.

So far I have a tabbed pane with 4 panes.
The 1st pane has instructions,

The 2nd sets up a hash based on plain text. I use the gpl.txt file that Inkedmn supplied although there is a standard File Chooser dialog box so you can select any file you want.

The 3rd sets up a hash based on the cipertext.txt file (but you can also select any file via a File Chooser again)

The 4th pane matches the plain text Hash to the encrypted Hash and does a decryption of the ciper and displays it. Also, there will be a set of radio buttons, one for each letter of the alphabet. By selecting 2 of these radio buttons you will swap those to letters in the encryption hash.

It's getting there.

Strike
05-29-2002, 12:06 PM
Cool, I've kind of put this on pause, personally. If anyone wants the solution table, I can give it to them - it's entirely uninteresting, since you are supposed to come up with a way to deduce it from the clues anyway.

Dru Lee Parsec
05-29-2002, 05:56 PM
I'm working from the point of view that I'm not doing any deduction at all until I first get as close as I can via pure histogram matching. I want to see how close I can get solely via a software solution.