1993-11-08 - Re: The “Nymalizer” and Shannon’s Information Theory

Header Data

From: cfrye@ciis.mitre.org (Curtis D. Frye)
To: cypherpunks@toad.com
Message Hash: afca44284cd7371a82fa4e66a810e48ba1152b4694e3b8a3ff4055c3a29e5651
Message ID: <9311082202.AA01748@ciis.mitre.org>
Reply To: N/A
UTC Datetime: 1993-11-08 21:58:24 UTC
Raw Date: Mon, 8 Nov 93 13:58:24 PST

Raw message

From: cfrye@ciis.mitre.org (Curtis D. Frye)
Date: Mon, 8 Nov 93 13:58:24 PST
To: cypherpunks@toad.com
Subject: Re: The "Nymalizer" and Shannon's Information Theory
Message-ID: <9311082202.AA01748@ciis.mitre.org>
MIME-Version: 1.0
Content-Type: text/plain


Extending off Tim's comments:

>I want to briefly mention another way of looking at this issue, and
>will use Curtis' comments to start:
>
>> For the past few years I've looked at this issue (author identification
>> through text content analysis) a bit from a psycholinguistic point of view.
>...
>> A 1983 paper (which I also do not have the cite handy for) by Dr. Murray
>> Miron of Syracuse University gave his equations for analyzing two texts (of
>> roughly similar lengths) and establishing a probability that the two
>> writings were produced by the same individual.  In his paper, Dr. Miron
>...
>> The same idea applies here - for CUSUM or similar analysis to be valid, an
>> analyst needs large volumes of messages where one of the authors is known
>
>One can view this problem in terms of Shannon's theorem about the
>transmission of a message in the presence of noise:
>
>* Signal -- the identity of the poster (true name, pseudonym,
>whatever)
>- characteristic usage of words, of punctuation, and whatnot (see)

Absolutely true, though the question is whether or not this analysis
provides a *unique* style signature for a given individual.  Note how
quickly jargon is passed around and how quickly you, in conversation,
acquire new phrases from your surroundings.

>- even the ideologies expressed (which LD incorrectly used to
>conclude Jamie Dinkelacker and I "must" be the same person)

This is where I begin to disagree.  Suppose one or two individuals wanted
to conduct a debate on a subject and steer the discussion to a
predetermined issue space.  By creating multiple identities (as Peter and
Val did in _Ender's Game_ by Orson Scott Card, *shameless plug:  GREAT
BOOK*), one could achieve this or another goal by varying the ideologies
expressed in the writing and foil this style of analysis.  Also, there is a
question about the reliability of context-sensitive text analysis engines. 
If the engine is knowledge base-driven, it is unreliable outside its realm
of expertise; if it's a neural net-based package, the training collection
is very important.
>
>* Noise -- variations in spelling, usage, etc.

The Net doesn't place a lot of emphasis on spelling and most posts/messages
aren't spell-checked, increasing the noise in the system.

>- many people use similar constructions and whatnot (like this)

Again, the "unique signature" problem.  Perhaps one could determine
regional background (New England, Deep South of the US etc.) from this
reliably, though I'm not sure how much farther the analysis could be
extended.

>
>Now Shannon's theorem, which can be applied here if some care is taken
>(that is, don't apply it too simplistically or too mechanistically),
>says that no matter how much noise is present, one can extract the
>signal if one samples enough.
>
>(Caveats: for a stationary sequence, etc., whereas one's writings may
>change with time, with the topic at hand, etc.)

I would believe that these caveats are a significant barrier to establishng
a unique signature.

>
>But to Shannon's basic view one must also add _intereference_, whether
>deliberate (spoofing) or not. If I try to emulate the style of S.
>Boxx, for example, by writing in the form "I am becoming INCREASINGLY
>DISGUSTED by the blatant disregard for the Cypherpunks CAUSE and ...",
>then this "intereference" could greatly complicate the signal
>extraction.
>
>In fact, more obscure correlations would have to be looked at, ones
>which might require many more messages to analyze...possibly more
>message samples than exist.

There is also the question of whether or not current text analysis programs
are capable of making these distinctions/correlations and whether or not
these distinctions can be communicated to the program overseers once they
are established.  

My basic position is that text analysis packages are probably not advanced
enough to reliably analyze more or less extemporaneous utterances
transmitted in the form of email or posts.  However, I second Tim's call
for a text analysis on various messages, beginning with those of single
authors to check the reliability of the system.  I'm just an amateur and of
no real help technically, but there's bound to be someone in one of the
CS/AI departments around the world that could provide us with a reasonable
text analysis engine.


Curtis D. Frye
cfrye@ciis.mitre.org
"If you think I speak for MITRE, I'll tell you how much they
 pay me and make you feel foolish."







Thread