From: cfrye@ciis.mitre.org (Curtis D. Frye)
To: cypherpunks@toad.com
Message Hash: 1b1f55d33d038f3df5d7313394240947cb158f42416188faae9592832353d60d
Message ID: <9311081958.AA00500@ciis.mitre.org>
Reply To: N/A
UTC Datetime: 1993-11-08 19:53:04 UTC
Raw Date: Mon, 8 Nov 93 11:53:04 PST
From: cfrye@ciis.mitre.org (Curtis D. Frye)
Date: Mon, 8 Nov 93 11:53:04 PST
To: cypherpunks@toad.com
Subject: Re: ID of anonymous posters via word analysis?
Message-ID: <9311081958.AA00500@ciis.mitre.org>
MIME-Version: 1.0
Content-Type: text/plain
For the past few years I've looked at this issue (author identification
through text content analysis) a bit from a psycholinguistic point of view.
According to an occasional electronic digest coordinated by a woman from
the UK named Blackwell (I apologize that I don't remember her name or have
her email address handy), a technique that sums the probabilities of
various word occurrences (CUSUM) has come under fire recently and, if I
remember correctly, is not accepted in UK courts.
A 1983 paper (which I also do not have the cite handy for) by Dr. Murray
Miron of Syracuse University gave his equations for analyzing two texts (of
roughly similar lengths) and establishing a probability that the two
writings were produced by the same individual. In his paper, Dr. Miron
related the story of a trial where he was summoned as an expert witness and
was not allowed to testify as to whether an extortion note was authored by
the defendant based on analysis of the note vis a vis a known letter from
the defendant. However, the jury ended up finding the defendant guilty
based on identical misspellings of a word in each message. Dr. Miron noted
that the jury's decision agreed with the overall findings of the computer
analysis; however, the jury returned a guilty verdict based on a single
coincident misspelling that could happen (with relatively high probability)
in any two random messages.
The same idea applies here - for CUSUM or similar analysis to be valid, an
analyst needs large volumes of messages where one of the authors is known
(an anonymous id counts) and the documents compared are of similar lengths.
One note a while back indicated that matching anonymous id's could be done
through tracing misspellings and uncommon word usage. Definitely not true
without a large base of known messages from both id's and a high score on
an evaluation function as described in the literature.
Curtis D. Frye
cfrye@ciis.mitre.org
"If you think I speak for MITRE, I'll tell you how much they
pay me and make you feel foolish."
Return to November 1993
Return to “tcmay@netcom.com (Timothy C. May)”