1993-10-14 - Re: Generating random numbers from english text

Header Data

From: nobody@alumni.cco.caltech.edu
To: cypherpunks@toad.com
Message Hash: f975c476221f61058ac3776bc687ed63981a2dffa8bca55597d33ceaa7945d4a
Message ID: <9310141748.AA11655@alumni.cco.caltech.edu>
Reply To: N/A
UTC Datetime: 1993-10-14 17:52:16 UTC
Raw Date: Thu, 14 Oct 93 10:52:16 PDT

Raw message

From: nobody@alumni.cco.caltech.edu
Date: Thu, 14 Oct 93 10:52:16 PDT
To: cypherpunks@toad.com
Subject: Re: Generating random numbers from english text
Message-ID: <9310141748.AA11655@alumni.cco.caltech.edu>
MIME-Version: 1.0
Content-Type: text/plain

On Thu, 14 Oct 93 09:29:16 PDT., baldwin@LAT.COM (Bob Baldwin) writes:

<  	With the current random number discussion going on, I thought
<  I would point out one convenient way to generate random numbers in
<  the situation where you do not need to generate them frequently.  I believe
<  that Claude Shannon (Dr. Information Theory) proposed this:
<  1. Ask a person to speak about any topic for a paragraph or two.
<     Instruct them to generate original sentences, not just repeats of
<     some passage of a written work.  Write down what they say.
<  2. The english language has an entropy of 1.0 to 1.5 bits per letter,
<     so it is safe to extract one bit per character.  If you read Shannon's
<     work and its modern interpretation, you will understand that this
<     bit per character is truely random.  It is uncorrelated with all
<     the other bits.  The reason for avoiding a pre-existing written
<     work (poem, article, story, etc), is to avoid a brute force search
<     of the source language space.

   can you provide a citation?  i recently read a paper Shannon wrote [1] on
   a study of randomness in english text, and this is definitely not the 
   impression i got from it.

   in this study Shannon demonstrates that english-speaking humans are very
   capable of predicting the next letter of an unknown text given the letters
   up to that point (he also shows this for the reverse direction; that is,
   starting at the end of the text and working forward, subjects were almost
   as capable of predicting correctly as they were starting from the front of
   the text).  this seems to imply that there is some non-random relationship
   (statistical or otherwise) among the letters in a text.  might this
   redundancy be carried over into any encoding (MD5 hash, e.g.) of the text?

   theoretically, at least, this may compare to Vigne're ciphers; the key
   is based on some text other than the plaintext.

<  	The modern version of this algorithm, is to compute the
<  MD5 digest of a long text string.  For a 128 bit digest, you need
<  at least 128 characters of source language.  For a larger random number,
<  you can concatenate multiple MD5 values from multiple pieces of source text.

   could someone here describe how this compares to other pseudo-random sources?
   how well would this function as a seed to some other pseudo-random number
   generation process?

[1]  Shannon, Claude E., "Prediction and Entropy of Printed English," Bell
     System Technical Journal, vol. XXX, No. 1, Jan. 1951, pp. 50-64.