1994-01-25 - Re: Randomness of a bit string

Header Data

From: tcmay@netcom.com (Timothy C. May)
To: cypherpunks@toad.com
Message Hash: 5dc0e66ea548fe74f9320e2cf8c293b02576c0bcbff17ef2538f2a0fb69f94ec
Message ID: <199401250859.AAA20830@mail.netcom.com>
Reply To: <199401241857.KAA06412@mail.netcom.com>
UTC Datetime: 1994-01-25 09:06:44 UTC
Raw Date: Tue, 25 Jan 94 01:06:44 PST

Raw message

From: tcmay@netcom.com (Timothy C. May)
Date: Tue, 25 Jan 94 01:06:44 PST
To: cypherpunks@toad.com
Subject: Re: Randomness of a bit string
In-Reply-To: <199401241857.KAA06412@mail.netcom.com>
Message-ID: <199401250859.AAA20830@mail.netcom.com>
MIME-Version: 1.0
Content-Type: text/plain


There seems to be a misinterpretation of the point I was making about
randomness and how no number (or sequence) can be _proved_ to be random.

> This has some fascinating tie-ins to "cryptoregular" strings, which
> are strings which appear to be "regular" (a variant of randomness,
> meaning all digits are equally represented...high entropy) but which,
> with the right transformation, suddenly lose their regularity. 
...
> Basic definition: A random string has no shorter description than
> itself. That is, it is incompressible.
...

> A fascinating discovery by Chaitin and others (Kolmogorov, Solomnoff,
> Martin-Lof, Levin all worked in this area) is that one can never prove
> a given sequence or string is "random." As in some diabolically clever
> IQ test, an apparently random sequence may have some shorter
> description, or compression, that means it does not fit this
> definition of randomness.

The point here is for a number or sequence which is _given_, just
presented, as in:

"Is the sequence 100010001010110010101 random?"

Or,

"Is the number 9045886804 random?"

Variants of this question come up all the time, as in predicting the
next term of a sequence, trying to determine if a sequence of
characters is likely to be just noise or is instead likely to be a
message, and in issues of whether data is maximally compressed or can
be compressed still further.

These numbers are "found objects" in the sense that one generally has
no idea what "model" or "theory" generated them.

Someone looking at the first example, 100010001010110010101, might
subject it to all kinds of tests:

-visual inspection to see if it's some "obvious" number (such as
"1010101010101010" would be, or "01011101110111" might be)

-statistical tests, to see if it deviates "significantly" from the
expected pattern of random numbers (regular distribution of digits, of
pairs, triples, quadruples, etc.). The usual arsenal of entropy
measurements, chi-square tests, null hypothesis testing, etc.

-other tests to see if the number is related to other known numbers,
which could be things like the day of the year, the digits of pi, the
phone number of whoever generated the number, etc.

-other tests and guesses that cryptanalysts and puzzle-solvers are
familiar with

A plausible result for someone to announce, after such a series of
tests, is "I can't find any patterns, and the distribution of digits
falls within expected ranges. We've compared the number against the
suspect's various numbers and can find no linkages. It looks pretty
random to us." (By "random" he essentially means "like the result of a
sequence of coin tosses." Fair coin, of course.)

But can he ever say "I can prove the number is random"? No. There's
always some chance an even-cleverer puzzle solver will find the
pattern, the key that unlocks the randomness. For example, most
ciphertexts pass nearly all statistical tests for randomness, "look"
random, and even _act_ like random numbers (recall the Blum-Blum-Shub
pseudorandom number generator and how good it is). But simple
application of the key turns the seemingly random
"100010001010110010101" into "ATTACK."

Let's look at the second example. Is the number "9045886804" random or
not? And can we _prove_ it's random? (If you're worred that these
numbers are somehow too small, don't worry. The same reasoning applies
to any number or sequence one might encounter, including short numbers
and multi-page numbers or sequences (such as PGP might generate)).

The cryptanalyst or problem-solver looks for the patterns, the
statistical distributions and entropies, and _any other_ links he can
think of. That is, his "models" for the generator of this number are
not known to him, but he may make some guesses based on the owner of
the number, the score in the SuperBowl, the age of Bill Clinton, etc.
That is, he'll look to see if the number is some sort of simple cipher
or transpostion based on one of the "unrandom" numbers around him.

To cut to the chase, can he ever "prove" the number is random? Can he
even claim that the generator of the number "must have" used a process
that is commonly used to generate numbers with a good approximation to
a random process (flippin coins, alpha counts, etc.)?

Suppose he declares to his boss, Admiral Inman, that he has "proved"
the number is "random." Inman says to him: "This post was written by
this trouble-maker Tim May, who even gives his phone number in every
post he writes. What happens if we reverse the digits of his number?
408-688-5409 turns into 9045886804! Some "random" number! Clean out
your desk tonight."

Now is it kosher to take the "theory" of my phone number and allow it
to be included in the analysis of wheter a number is random or not? Of
course it is! In the real world, this is what we mean by randomness
and predictabilty, whether we can find patterns and structure. And
this is what cryptanalysts really do, and what good password-guessing
programs do: they take account owner information such as name,
spouse's name, pet's name, birthdate, and any other information they
can scrounge about an account owner and then run permuations and hope
for the best. Some percentage of the time, the passwords are
"guessed," meaning that they were not very random at all.

(This was the point I was making about famous numbers (like "729"),
paradoxes (there are no "uninteresting" numbers, because the smallest
"uninteresting" number is automatically interesting, and in fact is
has a short description), and the number listed in Chaitin's book. I
hope this explanation here makes it a bit clearer.)

In this real world of trying to break cyphers, all is fair. All models
may be considered, though not all models can be (e.g., one would not
try applying the phone number of Chester Umbizi in Nairobi, Kenya at
random!).

No number can be proved to have no shorter description than itself.
And as various shorter descriptions are found, with whatevr effort it
takes, it cannot be proved that the description is the shortest that
will ever be found. It may be strongly susepected that no shorter
description exists. In fact, most numbers are incompressible, but a
simple counting argument, in any theory. (For example, of the
100-binary-digits, not many of them have 50-digit compressions, and
even fewer have 10-digit descriptions. Work out the numbers.)


So, if someone tells you they've "proved" a particular number is
random, just smile.

--Tim May

--




-- 
..........................................................................
Timothy C. May         | Crypto Anarchy: encryption, digital money,  
tcmay@netcom.com       | anonymous networks, digital pseudonyms, zero
408-688-5409           | knowledge, reputations, information markets, 
W.A.S.T.E.: Aptos, CA  | black markets, collapse of governments.
Higher Power:2**859433 | Public Key: PGP and MailSafe available.









Thread