1995-10-11 - Re: spam detector algorithm?

Header Data

From: Jonathan Litt <littlitt@MIT.EDU>
To: cypherpunks@toad.com (Cypherpunks Mailing List)
Message Hash: 45f4de1c3c49335eeccdde2f649072c3c33885bca202cecf1ff21f023bcf73f9
Message ID: <199510111420.KAA00919@hazelwood.mit.edu>
Reply To: <199510110303.XAA12811@thor.cs.umass.edu>
UTC Datetime: 1995-10-11 14:20:27 UTC
Raw Date: Wed, 11 Oct 95 07:20:27 PDT

Raw message

From: Jonathan Litt <littlitt@MIT.EDU>
Date: Wed, 11 Oct 95 07:20:27 PDT
To: cypherpunks@toad.com (Cypherpunks Mailing List)
Subject: Re: spam detector algorithm?
In-Reply-To: <199510110303.XAA12811@thor.cs.umass.edu>
Message-ID: <199510111420.KAA00919@hazelwood.mit.edu>
MIME-Version: 1.0
Content-Type: text/plain



  futplex@pseudonym.com writes:
   > Subject: Re: spam detector algorithm?
   > Date: Tue, 10 Oct 1995 23:03:45 -0400 (EDT)
   >
   > Greg Broiles writes:
   > [many details elided...]
   >> Any thoughts about this? Interesting? Stupid? Like I said, my
   >> math is weak.  My intention is to try to cobble up a 2d version
   >> of this to see how it runs but I thought I'd see if anyone can
   >> point out why it can't work, or if it's useful enough that
   >> someone with a better math background than I've got wants to
   >> take this idea somewhere better.
   >
   > It sounds like you are liable to start reinventing parts of the
   > field of information retrieval. The automatic construction and
   > comparison of vectors of document parameters, as you suggested in
   > the part I omitted, is one approach that has met with some
   > success. (The common problem is, given a set of query attributes
   > or a model document, to find relevant documents matching the
   > query or similar to the model document. A variety of relevance
   > measures has been considered.)
   >
   > I can't give you any specific pointers, but I advise you to check out 
   > existing implementations of these and other techniques for information
   > retrieval before you spend too much time writing new code.

Check out SMART, which was originally developed by Gerard Salton at
Cornell. (He is one of the pioneers of IR.) The current release is
maintained by Chris Buckley (chrisb@balder.chrisb.com). Check out:

ftp://ftp.cs.cornell.edu/pub/smart

If you don't feel like installing the whole thing but are interested
in testing it out on some spam, then I could run some tests for you.

Here are some literary references for SMART:

@article{SB88-weight,
   author = {Gerard Salton and Chris Buckley},
   journal = ipm,
   number = {5},
   pages = {513-523},
   title = {Term-Weighting Approaches in Automatic Text Retrieval},
   volume = {24},
   year = {1988}
}

@inproceedings{BSA-trec1,
        author = {Chris~Buckley and Gerard~Salton and James~Allan},
        title = {Automatic Retrieval With Locality Information
                        Using {SMART}},
        booktitle = {Proceedings of the First Text
                        REtrieval Conference (TREC-1)},
        editor = {D. K. Harman},
        publisher = {NIST Special Publication 500-207},
        month = {March},
        year = {1993},
        pages = {59--72}
}

-jon





Thread