Message Hash: 1893fdecd74ee028cbbe8b544451c662de039c491002b7b54d959b905562abcb
Message ID: <9305241859.AA11273@toad.com>
Reply To: N/A
UTC Datetime: 1993-05-24 18:59:53 UTC
Raw Date: Mon, 24 May 93 11:59:53 PDT
From: firstname.lastname@example.org Date: Mon, 24 May 93 11:59:53 PDT To: email@example.com Subject: Steganography and Steganalysis Message-ID: <9305241859.AA11273@toad.com> MIME-Version: 1.0 Content-Type: text/plain Summary: Steganography is essential for private communication since well-encrypted messages stand out too easily and no "solidarity" of sophisticated cryptography users is likely to make such messages less obvious any time soon. By "steganography" I mean inserting a hidden message into ordinary text in such a way that even if the algorithm for inserting the hidden message is public, only the intended receiver can read the hidden message or even show that a hidden message exists. I list several types of measures of "normal" English text that may be useful for steganalysis and then I present calculations suggesting that English has a steganographic capacity of about 10 percent. Note: This is my "newbie" post to cypherpunks. It asks many questions because there is a lot that I do not know, but I hope it also has several thought-provoking ideas. I am mostly trying to elicit feedback from those who are more knowledgeable about cryptology-related matters by providing them with some problems that are both useful and mentally stimulating. Failed PGP Social Program In his introduction to PGP, Phil Zimmerman compares plaintext messages to mail sent on postcards and encrypted messages to mail in sealed envelopes. Currently, using envelopes does not arouse suspicion because almost everyone uses envelopes, but using encryption does arouse suspicion because almost nobody uses encryption. Zimmerman's proposed solution is for almost everyone to use encryption routinely, so that encrypted messages will be the norm. I do not believe that this will succeed, at least not in the way Zimmerman hopes. Even though PGP is highly regarded, free, and fairly readily accessible, no "solidarity" of PGP users will arise unless email with PGP encryption becomes transparently convenient to use and also does not invite civil lawsuits or criminal charges. (An RSAREF version of PGP would help, though.) The kinds of encryption that *will* become readily available, easy-to-use, and legally hassle-free will be the crippled kinds of encryption. Encryption that is not crippled always will be suspect, perhaps illegal. By using sufficiently intelligent steganographic techniques, however, we will not need any "solidarity" from other people at all. If our "envelopes" look like "postcards," they will not arouse the stormtroopers. Steganography and Steganalysis A few people have experimented with inserting messages into image files. But most of our email traffic is text, so I am most interested in steganographic techniques for normal English prose. Furthermore, we need to have a reasonably high efficiency for inserting the hidden message while not contorting the text too far from normal. Peter Wayner's Mimic functions for producing a baseball game commentary are notable. (No, I still haven't done the C conversion of the Think Pascal version I received almost two years ago. But I haven't forgotten!) I am not certain how efficiently his program encodes the hidden message, but I do want the resulting text to be less conspicuous. Imagine thousands of messages per day consisting of similar sounding commentary on the Whappers and the Blogs! That's too obvious. Gus Simmons [CRYPTO83] has described subliminal messages, which certainly are suitably innocuous, but unfortunately far too low bandwidth. A good steganographic system should insert encrypted messages into English text so unobtrusively that nobody but the intended receiver can show that a hidden message exists, even if the algorithm for the steganographic system is made public. (Perhaps I should call this "stealthography"?) The examples of steganography described in [KAHN] all fail this test. Similarly, so do silly kinds of "steganography" such as the following "SECRET": So how have you been doing? Everything is fine here. Can we visit soon? Remember when we went white-water rafting? Everyone got soaked! That would be fun to do again! This is silly not only because the hidden message is not encrypted but also because anyone who knows the insertion algorithm can readily discover that a hidden message does indeed exist. To create a good cryptographic system, one must first do cryptanalysis. Similarly, I suggest that to create a good steganographic system, one should first do steganalysis. For that reason, the next section of this message focuses on potential tools for steganalysis. Perhaps people more knowledgeable about steganalysis will tell how best to make use of these, and other, tools for steganalysis. Disclaimer: I admit that my knowledge of steganalysis is limited. Perhaps at this point I should just ask what I should read to learn more about this, but I suspect that the public literature is sparse and scattered. For example, we have the words "encryption" and "decryption", but what do we call the corresponding words for steganography: steganization and desteganization? If we don't even have good terminology for the process, I suspect that we do not have much well-organized literature on it, either. What follows is my best guess concerning steganographic issues. The first goal of steganalysis is to determine that a hidden message is likely. The second goal is extracting that hidden message and the third goal is decrypting that hidden message. To be able to infer that a hidden message is likely, we need measures that distinguish normal from unusual English text. Measures of Normal English Text What is normal English text? In general, this is unsolvable, and not even well-defined. It depends on the context, author, subject, etc. Nevertheless, I can think of several kinds measures that are likely to be useful and I hope that other people can suggest more. (1) letter frequency Letter frequency is just the first order Markov model for English. Shannon showed how 2nd order, 3rd order, etc. Markov models enable increasingly English-like output from a memoryless source. How much deviation from these standard frequencies is normal? What other kinds of letter frequency-related statistics might be useful? For example, if you measure the number of characters between each occurrence of a particular character, what type of distribution of intervals should you get? (An exponential distribution? A Poisson distribution? An Erlang distribution?) (2) word frequency Shannon also constructed 1st order, 2nd order, etc. Markov approximations to English using words rather than characters as the elements. How much variation should we expect from these approximations in ordinary English? Zipf's Law [WELSH, p. 97] states that the word frequency for a language obeys the formula: p(n) = A / n where A is a constant chosen so that: SUM p(n) = 1 n For example, in English, the most frequently used words are, in order, "the", "of", "and", and "to". According to Zipf's Law, the word "the" should be used about twice about as often as the word "of" and about four times as often as the word "to". Mandelbrot suggested a more complex formula: p(n) = A / (n + V)^(1/D) where V and D are independent parameters. I suppose that the intelligence agencies have even more sophisticated models. (3) compressibility According to [WELSH, p. 96], Shannon's experiments measured the entropy of English (over a 26 letter alphabet plus a space) as only 0.6 to 1.3 bits per character. Since normal English text has both upper and lower case, digits, and other characters, perhaps a better value for normal English is about 2.5 bits per character. (If so, then shouldn't compression programs be able to achieve about a factor of 8 / 2.5 > 3 compression?) Is "dense" writing less compressible than "fluff"? Apparently so, since measurements of the redundancy of various English texts [WELSH, p. 100] show significant differences. Since well-encrypted messages are incompressible, will a message that hides an encrypted message be less compressible than normal English text? (4) grammar, style, and readability Grammar checkers can distinguish normal sentences from text such as: "Distinguish normal can grammar checkers text sentences from." that may satisfy other statistics for normal English text. But what is an ordinary distribution of legal grammars of English sentences? Also, how does one allow for the different conventions in formal, written English vs. conversational English vs. slang vs. email/USENET netspeak vs. special sublanguages such as computer languages or mathematics? Bear in mind that netspeak has several distinguishing features. For example, email addresses of the form firstname.lastname@example.org, quoted text with a ">" in column 1, and smilies are typical net conventions. Mail headers and signatures (especially PGP signatures) have a special structure, too. Can a grammar checker help to distinguish normal text from text that may have a hidden message? What useful clues may style and (Kincaid, Coleman-Liau, Flesch, etc.) readability scores give? An interesting experiment would be to compare automated readability scores with the compressibility of the text. (5) semantic continuity and logic Do the sentences in a paragraph relate somehow to each other, or are they separate, independent constructions? How can that be measured automatically? (6) message context Does the content of the message look normal in its context? (For example, a baseball play-by-play would look out-of-place in sci.med.) How can that be measured automatically? (7) obvious Some people are known suspects, no matter how innocuous-looking their messages are. All their messages are suspect. (8) other measures What other measures might be useful for detecting the likely presence of a hidden message? The distribution of number of words in a sentence? The distribution of number of sentences (or words) in a paragraph? What programs and/or databases are readily available for making these measures? Steganographic Capacity of English Text If the public English text is N characters long, how long can a perfectly hidden message within that public text be? I think that it can be about N/10 characters long, for a steganographic capacity of 10%. I will show two ways to hide information in the public text: (1) the grammatical structure of the sentence and (2) the word choice in the sentence. (These are not the only methods, but they may be the two best methods.) Do you recall back in school when you "diagrammed" sentences in your English class? That was actually imposing a parenthesization on the sentence. For example, the sentence: The tall boy ate the big pie. becomes: (The (tall boy)) (ate (the (big pie))) The number of possible parenthesizations of a sentence of N words is related to the number of ways to match N pairs of parentheses. The number of matchings is the Nth Catalan number: C(2N, N) N-2 X(N) = -------- >= 2 [AHU, p. 73] N + 1 where C(2N, N) is the number of combinations of 2N objects, taken N at a time, which is (2N)!/(N!^2). The number of parenthesizations is the N-1st Catalan number. If all parenthesizations were equally likely, then the parenthesization of a sentence of N words would give greater than (N-1)-2 = N-3 bits of information for 1 - 3/N bits per word. (Of course, not all parenthesizations are equally likely. But X(N) is also much larger than 2^(N-2), so for now I'll assume that those two roughly cancel out.) Since the average word length in English is about 4 characters [WELSH, p. 101], or 5 characters counting a separating space, and each ASCII character has 8 bits, we get a steganographic efficiency of (1 - 3/N) / 40. (Notice that I am ignoring punctuation in my count of characters in English text. Since this count is just a rough approximation anyway, the effect of punctuation should get lost in the noise.) Another way to hide information in the public text is with the choice of words. Since English has a large vocabulary, I think that almost always we can get one bit of information per word, just from the word choice alone. (Unusual words should not be used often, though, since normal English text does not use them often.) For example, we might XOR all the bits of all the characters of the word and use its parity. Can we get two bits per word? Probably most of the time. Suppose that we try to get two bits per word from our word choice but succeed only with probability p. The channel capacity of a BSC is: 1 + p log p + (1-p) log (1-p) which is: 1 - H(p) By Shannon's noiseless coding theorem, we should be able to achieve an error correcting coding that approaches this capacity. (Use of that encoding unfortunately may alter the statistics of the hidden message sufficiently to expose the use of steganography, however.) For what values of p will it be worthwhile to insert an uncertain two bits per word rather than a (nearly) certain one bit? Since H(0.11) = 0.5 (approximately), p had better be .89 or higher. If p is .95, then H(p) = .29 (approximately), giving 1.4 bits / word rather than just 1 bit / word. I doubt that we can get better than 1.4 bits per word with this method and still have normal looking English, though, because of Zipf's Law. The normal frequencies of the four words "the", "of", "and", and "to" are high, totalling at least 10%, so the public text has to include many of them, whether we want their particular parity bit patterns or not. We can improve the efficiency by attempting two bits of information only for the long words and attempting only one bit for the short words. Maybe we should attempt to achieve |K/5| bits for words of K characters, where "|x|" means "x rounded down to the next integer". Or maybe we should not try to hide any bits at all in the extremely short words. I don't have enough information about typical English to analyze that. What is the total steganographic efficiency we achieve by exploiting both the grammatical structure and the word choice? My estimates total: ( (1 - 3/N) + 1.4 ) / 40 = 0.06 - 3/(40N) Just to get a number, let's assume that N = 10 words per sentence. That gives us 0.0525, which I'll round down to 0.05. That actually gives us much better than 5%, though, because the hidden message is first compressed and then encrypted. If compression halves the length of the hidden message, we get effectively a 10% efficiency for the Steganographic capacity of English. This estimate will decrease by whatever amount typical English parenthesization departs from uniform over all possibilities but it will increase by improved exploitation of word choice and, especially, by improved compression. Of course, the effectiveness of this camouflage depends on the sophistication of one's model of English text. Perhaps normal English has enough variation that a good, but not perfect, model of English will yield public text that is indistinguishable from normal text, even to the more resourceful eavesdroppers. Kevin Q. Brown INTERNET email@example.com or firstname.lastname@example.org AHU - The Design and Analysis of Computer Algorithms, Aho, Hopcroft, and Ullman, Addison-Wesley, 1974. KAHN - The Codebreakers, David Kahn, Macmillan, 1967. WELSH - Codes and Cryptography, Dominic Welsh, Claredon Press, 1988.
Return to May 1993
Return to ““Perry E. Metzger” <email@example.com>”