From: rishab@dxm.ernet.in
To: cypherpunks@toad.com
Message Hash: caba46f73f535e160b8fbd3aca60982621d216cfd00edd8e2e891292bb9bdc2f
Message ID: <gate.q6FmXc1w165w@dxm.ernet.in>
Reply To: N/A
UTC Datetime: 1994-12-19 21:29:58 UTC
Raw Date: Mon, 19 Dec 94 13:29:58 PST
From: rishab@dxm.ernet.in
Date: Mon, 19 Dec 94 13:29:58 PST
To: cypherpunks@toad.com
Subject: NSA's text search algorithm
Message-ID: <gate.q6FmXc1w165w@dxm.ernet.in>
MIME-Version: 1.0
Content-Type: text/plain
"Ian Farquhar" <ianf@sydney.sgi.com>:
> I always imagined that the development of [NSA's text scanning]
> algorithm itself predated email, and started back with cable and
> telex traffic.
Stat text scanning is ancient, but has probably not been used on the scale
and efficiency that the NSA would require for net traffic.
> > Earlier this year, the agency began soliciting collaborations from
> > business to develop commercial applications of their technique.
>
> Has anyone got any further information about how this algorithm works?
> It sounds like Rishab has somewhat better info than was publicly
> available months ago when we last discussed this particular NSA
> "technology transfer".
Actually my 'info' about NSA's thing was mainly deduction put together
with some (limited) specs on Architext (http://www.atext.com
graham@atext.com). If you read NSA's note carefully, you easily rule out
NLP ("independent of...language") and sophisticated neural nets ("very fast").
The Economist story I mentioned in my last post (on the fact that I beat
them to the story!) goes into some detail on BT and Cornell's programs that
summarize textual matter. These are apparently successful (included is an
pretty good example of a computer-generated summary of the article), but
also quite different from NSA's.
BT uses basic NLP to get past articles, conjunctions etc (making it
language-dependent), and stems (removes -ing, -ed, -s etc, unlike NSA
which denies stemming, dictionaries etc; obviously language-dependent),
before creating statistical table of word frequencies which are used to
determine the subject of a sentence or the similarities between texts.
Cornell can search "gigabytes of data ... in a few seconds [for] a
subject" or similarity to an example text. It can figure out which
sentences are 'important' (by comparing frequency tables).
I suspect NSA's is much more pattern-oriented, as its USP is document
clustering; maybe it uses some NN at some level. Of course you don't really
need to know grammar to filter out articles and pronouns; you could do that
statistically too.
Rishab
-----------------------------------------------------------------------------
Rishab Aiyer Ghosh "In between the breaths is
rishab@dxm.ernet.in the space where we live"
rishab@arbornet.org - Lawrence Durrell
Voice/Fax/Data +91 11 6853410
Voicemail +91 11 3760335 H 34C Saket, New Delhi 110017, INDIA
Return to December 1994
Return to “rishab@dxm.ernet.in”