1996-06-29 - anonymous mailing lists

Header Data

From: JMKELSEY@delphi.com
To: cypherpunks@toad.com
Message Hash: 7cb305a7bfd0bcca0060cbfcf3000f38a7b4c2d88927063cbd9b35a13bdc9a96
Message ID: <01I6GF62YWG291VYF3@delphi.com>
Reply To: N/A
UTC Datetime: 1996-06-29 04:09:15 UTC
Raw Date: Sat, 29 Jun 1996 12:09:15 +0800

Raw message

From: JMKELSEY@delphi.com
Date: Sat, 29 Jun 1996 12:09:15 +0800
To: cypherpunks@toad.com
Subject: anonymous mailing lists
Message-ID: <01I6GF62YWG291VYF3@delphi.com>
MIME-Version: 1.0
Content-Type: text/plain


-----BEGIN PGP SIGNED MESSAGE-----

[ To: Cypherpunks ## 06/28/96 04:34 pm ##
  Subject: anonymous mailing lists ]

>Date: Fri, 14 Jun 1996 02:15:03 +0000 (GMT)
>From: Ecafe Mixmaster Remailer <mixmaster@remail.ecafe.org>
>Subject: Hackerpunks and C2

>The proposal for a Hackerpunks nym based mailing list is
>interesting, however, there are some concerns regarding the
>susceptibility of the list to traffic analysis.

I was thinking about attacks that can be carried out on remailers in
general, and came up with something that is potentially pretty
nasty, especially for anonymous mailing lists and people who post a
lot of stuff anonymously using ``nyms.''

Let's imagine an anonymous remailer network as a ``black box'' which
functions perfectly.  Messages (broken into equal-sized packets and
strongly encrypted) are sent into the network by the sender, and at
some later time, they come out at the receiver.  Let's assume there
is no possible way for an attacker to trace a message through this
network.  Now, we still have to deal with two more issues--mail goes
into the network, and comes out of the network.  At those two
points, there is trafic analysis available.  Specifically, we can
see how much data goes over the line.  (Naturally, if it's
encrypted, we can't tell how much of it is real data and how much is
padding.)

Generally, when we're attacking this system, we're trying to figure
out either the sender or the receiver of a message (or a sequence of
messages), based on what we can observe coming into and going out of
the network of remailers.  There are basically five scenarios:

1.   The sender wants to know who the receiver of his message is.
2.   The receiver wants to know who the sender of his message is.
3.   An outsider wants to identify the sender of a message.
4.   An outsider wants to identify the receiver of a message.
5.   One receiver of a message wants to know who the other receivers
     of the message are.  (This is the case for anonymous lists.)

Now, there are a couple of different ways these attacks can be
carried out.  Usually, I've seen people talk about ``tracing''
attacks, in which a message is traced from one side of the network
to another, without any clear idea of who might be on the other
side.  However, I think a more realistic situation is to imagine the
attacker trying to test the hypothesis that some person is on the
other end of the anonymous transmission.  If relatively few people
regularly send or receiver anonymous e-mail, then this is practical
for many kinds of test.  It's even more practical when we're dealing
with relatively small populations of interested people in some
technical subject.  (This is conceptually similar to the
``dictionary attack'' on passphrases.)  Basically, what we're
looking at, in that case, is some test which (with some reasonably
high probability) determines whether some person is the sender or
receiver of a given message or stream of messages.

This leads to some interesting insights.

1.   In reasonably large text messages, it's probably easy to test
hypotheses about senders.  There are metrics that can more-or-less
identify the writer of a piece of prose.  While it's no doubt
possible to defeat this kind of analysis for some things (i.e.,
blackmail notes or rigidly-defined messages in a cryptographic
protocol), I suspect that this is very hard to defeat for a mailing
list where the objective is to discuss serious technical issues.
(This kind of analysis also causes headaches for people trying to do
strong steganography in text.)

2.   If an attacker (i.e., the NSA) logs the total volume of all
traffic in and out of the remailer network, and to whom each message
came from or went to, then that attacker can probably mount some
very powerful hypothesis-testing attacks.

It's these attacks I want to discuss.

If Alice sends a message to Bob through the remailer network, two
things must happen to prevent it from being trivially traceable.

1.   The message has to change size.  If the message is already
encrypted, then compression isn't much of an option--so what's left
is padding it out by a random amount.  The amount of padding per
message is probably a uniformly distributed random variable.

2.   The message has to be delayed somewhat.  The delay is probably
also a uniformly distributed random variable, or possibly the result
of adding N such variables, where N is the number of chained
remailers.

For a single mailing, this is probably not much of a threat. There
will be enough ``noise'' in the delay and padding that most
transmissions will be masked.  However, consider the situation of a
mailing-list.  Alice and Bob are both recipients of the list. Alice
wants to decide whether Bob is receiving the list.  Let D be a delay
such that, if Alice received her copy at time T, 90% of the other
list members received their copy between T-D and T+D.  Now, Alice
looks at Bob's anonymous e-mail volume during that time span vs. at
all other times.  If he's receiving the same stuff she is, then
there should be an increase within that span of time, on average.

The random distribution of the arrival time will mask individual
transmissions, but with many messages, it probably will not.  (This
is conceptually similar to the situation in Paul Kocher's timing
attacks--adding some random pauses doesn't hurt the attack as much
as most people expect it to, because those random pauses, summed up
over many messages, become a normally distributed random variable.)

The average amount of anonymous e-mail Bob gets per day doesn't have
much effect, nor do occasional worst-cases. The only ways I can see
to prevent this attack are either to ensure that Bob gets a constant
rate of information from the anonymous remailer network, or to make
the arrival time span so large that other randomness in the sample
makes the change in volume undetectable.  In general, I don't think
this second one will work without accepting incredible delays on
messages.

This can also be adapted to tracing back anonymous posters to
newsgroups and mailing lists, when they use a consistent nym.
(They could also be traced by textual analysis.)  In this case, the
attacker starts by posting some anonymous messages (not using a
nym--he doesn't need one), to get some statistics on what the
average delay is, and also what the average amount of padding is. He
may do this for several different ways of putting things
together--he's got almost unlimited time to gather acceptable data.

At this point, he observes in/out traffic logs for each hypothesized
sender during a wide timespan before the arrival of the post at its
destination.  He compares activity inside that span with activity
outside, over a large number of posts.  If there is a correlation,
then he's got the e-mail address of the nym.

There are ways to get around this second attack, at least to some
extent.  However, I don't think it's wise to count on even very good
remailer networks (i.e., the Mixmaster stuff) to protect your
anonymity in this situation.  (However, note that I'm thinking in
terms of a very well funded, determined adversary.  It's probably
not too bad to count on it to protect your anonymity from
technically unsophisticated attackers--but I wouldn't recommend
using it for things that (say) the FBI or NSA might get very
interested in.)

I think the best defense against this will be something like this:
Each user sets a quota of how much trafic he will take in and send
out per day.  Once per day, he goes through an interaction in which
he downloads and uploads that much stuff, whether there's any of it
for him or not.  (Naturally, this won't be detectable from looking
at the transmission, timing the interaction, etc.)  This makes any
volume variations per day disappear.  Unfortunately, it also limits
the user's total inflow and outflow, which means he'll have to set
it to something larger than the maximum he ever expects to get.  (It
would be possible to have occasional overflow onto the next day's
downloads, but not too often, or the user would fall further behind,
on average, each day.)  The size of these quotas will still leak
some information, though not enough for the kinds of attacks I
discussed above.

Note:  Please respond via e-mail as well as or instead of posting,
as I get CP-LITE instead of the whole list.

   --John Kelsey, jmkelsey@delphi.com / kelsey@counterpane.com
 PGP 2.6 fingerprint = 4FE2 F421 100F BB0A 03D1 FE06 A435 7E36

-----BEGIN PGP SIGNATURE-----
Version: 2.6.2

iQCVAwUBMdRWR0Hx57Ag8goBAQHCowQA71WBKkx1yonS0dEpy3pe7lgvSJPkpLUk
zLjm0KeFoP+HGQBep48iILRYBlbGy5czcxNCU4zhE6+c4PWwvD+BpaGGccWWkyRi
0l/rdo5L5/1KgnpCAQJ/HNyRH0fO2NNOHvGB3m7I0H3lfmfOlNed8oIIjPFDVB23
60wpMZ9S93w=
=HC1g
-----END PGP SIGNATURE-----





Thread