1995-12-05 - Stylometry

Header Data

From: “Dana W. Albrecht” <dwa@corsair.com>
To: cypherpunks@toad.com
Message Hash: 98c1f8b3c246c57d3d93b18bd949a98c5efb6b8b51ac83eca53257abfb4578f3
Message ID: <199512050028.QAA04676@elmos.corsair.com>
Reply To: N/A
UTC Datetime: 1995-12-05 00:34:46 UTC
Raw Date: Mon, 4 Dec 95 16:34:46 PST

Raw message

From: "Dana W. Albrecht" <dwa@corsair.com>
Date: Mon, 4 Dec 95 16:34:46 PST
To: cypherpunks@toad.com
Subject: Stylometry
Message-ID: <199512050028.QAA04676@elmos.corsair.com>
MIME-Version: 1.0
Content-Type: text/plain



I recently came across an interesting book.  Detailed information follows.
It would seem (to me) to have interesting implications for anonymous
remailers.

Does anyone on the list have any knowledge of this subject?  I've seen
it hinted at, but never systematically explored.

In particular, does anyone know how it has advanced since the book was
written (1978)?  Additional references?

Dana W. Albrecht
dwa@corsair.com


------------------------------------------------------------------------------

Morton, A. Q. (Andrew Queen)
  Literary detection : how to prove authorship and fraud in literature and doc
uments / A. Q. Morton.  [Epping, Eng.] : Bowker, c1978.  xiii, 221 p. ; 25 cm.
 
LC CALL NUMBER: PN171.F6 M64
 
SUBJECTS:
  Authorship, Disputed.
  Language and languages--Style.
  Linguistics--Statistical methods.
 
DEWEY DEC:  801/.959
 
NOTES:
  Includes index.
  Bibliography: p. 221.
 
ISBN:  0859350622 : L10.50
LCCN:  79-310591 r85

------------------------------------------------------------------------------

Contents

    List of Tables                                                      vii
    List of Figures                                                     xi
    Preface                                                             xiii

SECTION I    THE THEORY OF STYLOMETRY

    1.  The Problems of Identification and Recognition                    3
    2.  The First Steps                                                  19
    3.  Statistics and Stylometry                                        29
    4.  Statistics as Description                                        40
    5.  The Second Stage in Statistical Description                      51
    6.  Like or Unlike?  The Statistics of Comparisons                   71
    7.  The Rules of the Game                                            75

SECTION II   THE FEATURES OF LANGUAGE
             WHICH ARE OF PARTICULAR
             INTEREST IN STYLOMETRY

    8.  The Writer in his Works                                          95
    9.  The Inflected Language                                          108
          (i) Positional Measurements and Word Mobility                 109
         (ii) Isotropic Distributions                                   114
        (iii) Anistropic Distributions                                  121
   10.  The Uninflected Language                                        130
   11.  The Occurrence of Proportional Word Pairs                       147

SECTION III  APPLICATIONS

   12.  Introductory                                                    153
   13.  The Homeric Problem                                             158
   14.  The Authorship of the Pauline Epistles                          165
   15.  The Shakespeare Problems                                        184
   16.  The Inimitable Jane                                             189
   17.  A Word from Baker Street                                        192
   18.  Let Justice be Done                                             195

   CONCLUSION                                                           208

   Appendix                                                             211
   Glossary                                                             215
   Bibliography                                                         221
   Index                                                                223

------------------------------------------------------------------------------

(From page 7)

   The main subject of this book is one special aspect of identification,
the determination of the authorship of texts.  Since the development of
photography it has been a simple matter to determine who wrote or who
typed out a text.  It is even possible to demonstrate which instrument was
used in the writing or typing.  But such physical comparisons do not
indicate who composed the text or altered it from its original form.  To
enable this to be done a descriptive science known as stylometry is
needed.  Stylometry is the science which describes and measures the
personal elements in literary or extempore utterances, so that it can be
said that one particular person is responsible for the composition rather
than any other person who might have been speaking or writing at that
time on the same subject for similar reasons.  Stylometry deals not with
the meaning of what is said or written but how it is being said or written.
Stylometry does not deal with the evidential value of statements.  It does
not asked whether this or that particular statement is true or reasonable,
but applies itself to the question, 'In whose words are these sentiments
expressed?'

------------------------------------------------------------------------------

Conclusion

Looking back, the development of stylometry is easy to see. De
Morgan was the first to point out the pattern of argument which should
be used in stylometry, statistics would describe samples and sampling
differences would become the measure of similarity or difference.  But to
suggest that something might well be true and to show that is is true are
two different propositions and it was a long time before anyone actually
developed a statistical test of authorship.  It should have been done by
Udny Yule in his book, _The Statistical Study of Literary Vocabulary_,
but he made an unfortunate error in calculating the standard errors of
sentence length distributions with the result that it was not until W.C.
Wake corrected the error and continued his study of sentence length
distributions of Greek authors in 1946 and 1957 that a reliable test was
established.
   With the pattern of argument confirmed, attention then turned to
what should be counted and analysed.  Like all his colleagues, the author
spent some time looking at those features of style which literary critics
had noted and used as the basis for their judgements.  This was making
stylometry the conversion of stylistic description into quantitative terms;
it was using the accumulated experience of scholars as it had been
expressed in traditional forms.  This proved to be an unexciting quest.  In
some cases, for example the suggestion of Sir Kenneth Dover that the
number of finite verbs used by a writer of Greek prose might be an
indicator of authorship, it turned out to be valid but required samples
impracticably large for any New Testament application, and in others it
soon became clear that the observations had no firm foundation in any
objective data.  It was the realisation that in Greek writing position was
of prime importance that gave stylometry its first general theory.  That such
a theory was justified was confirmed when a dramatic plea for help with a
police statement written in modern English posed a problem which was
rapidly solved by an adaptation of positional methods to the constraints
of an uninflected language.  In Greek where word movement is free, look
at word movement and position; in English where word movement is
restricted, look at immediate context.
   There is so much material available that routine applications of
stylometry will present few problems.  What remain intractable are
problems of mixed tests where one writer has been revised by another or
other situations in which the homogeneity of the text is in doubt.
   Immediate progress seems likely to be made in two directions.  Both
concern the efficiency of methods rather than the further development of
methods.  A simple way to increase the separation of two authors is to
combine a number of tests in multi-variate statistics.  Properly done this
will generate figures which show vast differences where such exist,
although the differences are diffused throughout a number of statistics
and are nowhere to be seen as clearly as the measure of their combined
effect.
   The difficulty is that multi-variate statistics can conceal the underlying
features and in some instances lead to confusion or misapprehension.
One such set of statistics showed the differences between brands of
cigarettes and showed very large differences.  But a study of the statistics
which were combined in the analysis revealed that the largest differences
concerned the packing and the printing on the packets.  While this might
be useful for anyone designing a machine to select brands and sort them
automatically, it was much less useful for any smoker who wanted to know
about the quality of the cigarette.
   The other development which is easy to forecast is the formation of
profiles of individual writers so that quick reference and resolution of
problems will be possible.  One way of doing this is to start with a set of
collocations.  It might be that after "and" a writer is very fond of using "the"
and hardly ever uses "so."  This can be made a test of how often "the" after "and"
occurs compared to "so" after "and".  The combination of a few such tests
based upon the personal maxima and minima of an author will soon
provide a measure by which he can be detected in a large number of
candidates.
  The ultimate aim has been set by the information theory experts who,
many years ago, calculated that in any 200 words, written or spoken,
there was enough information to enable their author to be picked out of
the human race.  This is like saying that every cubic mile of sea water
contains twenty tons of gold; it may be there but getting it out is not easy.
But the aim must be to be able to say of any couple of hundred words, it is
or is not the sole production of the person who produced this other
sample.  It may seem that we are a long way from being able to do so, but
how much nearer we have come in the last twenty years.  Who will say
that the next twenty years will not produce the desired result?

------------------------------------------------------------------------------







Thread