1993-08-27 - REMAIL: Attacks on remailers

Header Data

From: hughes@ah.com (Eric Hughes)
To: cypherpunks@toad.com
Message Hash: 7efa6aa6db3e901b579d14ce0ca029b750f10a671a6350549c5241dd2b009b02
Message ID: <9308271714.AA11676@ah.com>
Reply To: <9308271052.AA07922@achilles.ctd.anl.gov>
UTC Datetime: 1993-08-27 17:22:51 UTC
Raw Date: Fri, 27 Aug 93 10:22:51 PDT

Raw message

From: hughes@ah.com (Eric Hughes)
Date: Fri, 27 Aug 93 10:22:51 PDT
To: cypherpunks@toad.com
Subject: REMAIL: Attacks on remailers
In-Reply-To: <9308271052.AA07922@achilles.ctd.anl.gov>
Message-ID: <9308271714.AA11676@ah.com>
MIME-Version: 1.0
Content-Type: text/plain


Attack (7) is made by an opponent who monitors all network traffic,
but has no access to the insides of the remailer nodes.

The defense is more subtle, however, than proposed.

>(7): Look at all messages coming out of the first remailer, and
>follow them into their 2nd remailers; take all messages from those and
>follow them on, and so on.  This will eventually lead to a number of
>destinations, one of which must have been the destination of the original
>message.  Over a period of time, look for correlations between destinations
>and sources.

Let us assume that these remailers have the basic characteristics of
mixes: encryption rewriting, size quantization, and message
reordering.  Furthermore, let us assume that the defense of using
'large' chains of 'popular' remailers is being used.

>With enough
>mixing at each stage of the chain, the number of possible destinations
>will become astronomically large, 

The possible number of destinations should increase exponentially with
each hop.  If gather-and-rearrange mixing is done, then the number is
the product of the rearrangement thresholds for each remailer.  If a
radioactive decay model for reordering is used, then it is the
expected value of the number of destinations which grows
exponentially, that is, the possible number of destinations (those
with non-zero probability) grows faster that the expected value of the
number of destinations.  They are both exponential, but one has a
larger base than the other.

What is more important than the reordering algorithm is that the
expected number of destinations grows exponentially with the number of
hops.  There will be correlations, but with linear increase in cost we
can get rid of them, we hope.  

>making correlations statistically impossible

What is the nature of the remailer path, however, for which we have an
assurance that the correlations are too difficult to carry out?  Or to
ask a simpler question for a simpler environment where we assume all
remailers are equal, just how long does the path have to be?

We know that by making the paths "long enough" that we can prevent
correlations from becoming significant.  The question is how do we
find out what is long enough?

>such an attack (PROLONGED monitoring of all
>remailers) would be very difficult to perform, esp. with use of
>remailer-remailer socket connections.

The fact that it would be difficult is not the issue for the theory,
but for the practice.  The extremely high cost, however, could be
justified for 'national security' reasons against a few targets, or
to break the system completely open looking for 'tax evaders.'

If our theory is good against an arbitrarily strong opponent, then the
system can withstand sustained attack.  If the existence of the system
is seen as sufficiently threatening, for any number of different
threats, we should plan for a sustained attack.  We need to know what
the limits of the capability are and not just guess.

I've been thinking about an invariant for communications systems proof
against traffic analysis which I call 'privacy diffusion'.  The
privacy diffusion is a probability distribution over possible
recipients.  One characteristic of the privacy diffusion is the
expected value of the number of different recipients.  This is a good
first measure, but I suspect it won't be enough.

The expected number of recipients is multiplicative in the diffusion
per node, as described above.  If different downstream nodes have
different mixing thresholds, they'll need to be weighted.  Since the
system is multiplicative, the weighting should be by geometric mean,
i.e. a downstream node with probability 1/10 should multiply by the
10th root of its own threshold.  One can see that if all the
downstream nodes have equal likelihood and identical thresholds, that
this formula degenerates into the simple one above.

In fact, there is a simple closed form expression for this value,
namely, e^-E(ln p), the inverse exponential of the expected log
probability.  This is exactly e^H, where H is the entropy of the
probability distribution.  e^H is also the expected size of the search
space, were we looking for encryption keys.

On the other hand, this situation is unlike a key search space in that
every value is not equally likely and that the priors are not
independent.  In a phrase, not everybody talks to everyone else, but
everyone who talks talks to someone else.  We can make a baseline
model of a communications graph with probabilities on it.  (This
doesn't take into account state, e.g. conversations tend to happen
alternately.)  Most edges on this graph will have p=0, i.e. these two
people have never communicated.  Let us remove these null edges.  What
we are left with is a sparse graph with lots of clustering (friends of
friends).  In this situation, if our message could have gone to ten
million people (say, 7 hops each with threshold 10), it is more likely
that it went to one of twenty or fifty.

Even if you don't know what the graph looks like, you'll know that it
is sparse, and you'll have some idea of what the characteristic
distributions are.  This is exactly the equivalent of studying letter,
digram, and higher order statistics for English and other natural
languages.  The statistics gathered as to the prior distribution will
appear in the observed output unless one has some good idea of how to
'confuse and diffuse' them.

I am pushing an analogy here between cracking codes and cracking
traffic patterns.  I am pretty sure that there are more parallels than
meet the eye.  The appearance of the entropy in the expected number of
recipients may be only the tip of a much larger correlation.

traffic			cipher
=======			======
statistics of		letter frequencies
interconnection		of the plaintext

observed messages	ciphertext

path through		key
remailers

mixing algorithms	encryption

null messages		padding

This whole mix system needs a lot more thought before we'll have an
assurance that it will be secure against sustained attack.

-------------------------------------------------------

On the lighter side, I couldn't resist this next one.

>(8): Correlate messages being sent from person A with messages being
>received a certain time later by person B.  

>Defense: [...] The sender's traffic pattern is
>then constant and no information can be gained from it.

And for the receiver, just subscribe to cypherpunks under several
different aliases.

Eric





Thread