From: Randall Farmer <rfarmer@HiWAAY.net>
To: 3umoelle@informatik.uni-hamburg.de
Message Hash: c340893f89ba8301a21c21d0594ad8749daa41b0a5221afedf98bb401d0bb684
Message ID: <Pine.OSF.3.96.971118211117.5311A-100000@fly.HiWAAY.net>
Reply To: N/A
UTC Datetime: 1997-11-19 04:11:17 UTC
Raw Date: Wed, 19 Nov 1997 12:11:17 +0800
From: Randall Farmer <rfarmer@HiWAAY.net>
Date: Wed, 19 Nov 1997 12:11:17 +0800
To: 3umoelle@informatik.uni-hamburg.de
Subject: Re: Stylometry
Message-ID: <Pine.OSF.3.96.971118211117.5311A-100000@fly.HiWAAY.net>
MIME-Version: 1.0
Content-Type: text/plain
Here (here being at the bottom of the message :) is the code for the stylometry
program. Note that I specified that the stylometry also involved a calculator
-- that's because the shell script only processes your data to get the numbers
you need out; the tough part is still up to you.
After it runs, you have
A. a file, ./counts, containing wordcounts like so. (The first line is the
ever-present quirk, which occurs because I have yet to master sed.)
1689
550 THE
344 AND
316 TO
...
and B. Output to the screen, like so:
[wc/uwc]
1738 12561 77775 <-- Lines/words/bytes for the original file
2557 5113 31226 <-- First part is the number of *different*
words used in the document, ignore the rest.
[word counts] <-- A juicy excerpt from the counts file
550 THE
344 AND
316 TO
271 A
195 OF
[punc frequency: comma/period/hyphen/quote/semi]
584 <-- Number of commas
1536 <-- Periods
79 <-- Dashes
315 <-- Double-quote marks
10 <-- Semicolons
[and/or/but as sentence-splitters]
24 <-- Occurrences of "and," (including comma -- that's the point)
12 <-- "or,"
7 <-- "but,"
There are too many things you can calculate from this output for me to
enumerate (although the ratios of words to periods, commas, semicolons, and
conjunctions as sentence splitters are rather useful...compare two or three of
a known author's documents to find his/her characteristics, then compare that
to your unknown and see if you've got a match).
[Note that the whole sed mess is supposed to be one line]
#!/bin/sh
# prep: Prepares a text for analysis
sed "y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/;s/[^A-Z']/ /g;s/ / /g;s/ / /g;s/ / /g;s/ / /g;s/ / /g;s/ / /g;s/ / /g;s/ / /g;s/ / /g;s/ / /g;s/ / /g;s/ / /g;s/ / /g;s/ / /g;y/ /\n/;"<$1|sort|uniq -c|sort -rn>./counts
echo [wc/uwc]
wc<$1
wc<./counts
echo [word counts]
grep -wie "the" -e "and" -e "to" -e "a" -e "of" < counts
echo [punc frequency: comma/period/hyphen/quote/semi]
grep -c ","<$1
grep -c "."<$1
grep -c "-"<$1
grep -c \"<$1
grep -c "\;"<$1
echo [and/or/but as sentence-splitters]
grep -c "and,"<$1
grep -c "or,"<$1
grep -c "but,"<$1
---------------------------------------------------------------------------
Randall Farmer
rfarmer@hiwaay.net
http://hiwaay.net/~rfarmer
Return to November 1997
Return to “Randall Farmer <rfarmer@HiWAAY.net>”
1997-11-19 (Wed, 19 Nov 1997 12:11:17 +0800) - Re: Stylometry - Randall Farmer <rfarmer@HiWAAY.net>