1996-04-20 - Re: Dictionary searching code

Header Data

From: frantz@netcom.com (Bill Frantz)
To: Adam Shostack <cypherpunks@toad.com (Cypherpunks Mailing List)
Message Hash: 31ed6dcd54bda8d30adeae55f0902e0c8b80578e1f17449979a53955839abb41
Message ID: <199604200639.XAA24430@netcom9.netcom.com>
Reply To: N/A
UTC Datetime: 1996-04-20 09:58:25 UTC
Raw Date: Sat, 20 Apr 1996 17:58:25 +0800

Raw message

From: frantz@netcom.com (Bill Frantz)
Date: Sat, 20 Apr 1996 17:58:25 +0800
To: Adam Shostack <cypherpunks@toad.com (Cypherpunks Mailing List)
Subject: Re: Dictionary searching code
Message-ID: <199604200639.XAA24430@netcom9.netcom.com>
MIME-Version: 1.0
Content-Type: text/plain


At  8:02 PM 4/19/96 -0500, Adam Shostack wrote:
>        Does anyone have some code that will search a dictionary, and
>tell me *quickly* if an arbitrary chunk of text is in the dictionary?
>Pre-indexing steps are fine, as is using big chunks of disk for hash
>tables.  The point of course, is to check arbitrary possible plaintext
>that a test decryption produces.

This application sounds perfect for Bloom filters.  The basic idea of a
Bloom filter is to build a database by taking each word in the dictionary
and hashing with N different hashes.  The hashes do not need to by
cryptographically secure, but they do need to be good hashes, XOR doesn't
make it.  You use those hashes as bit offsets in a giant bit map, which is
the database.  When building the database you turn on the bits at each of
these N locations.

When accessing the database, you hash the chunk of text with the same
hashes, and then test the bits in the database at those offsets.  If any of
the bits are zero, then the chunk of text is not in the dictionary.

The failure mode is to say something is in the dictionary when it isn't. 
If half the bits in the database are on, then the probability of failure is
2**(-N), so if N==10, then the failure rate is 1 in 1024.  If empirically,
you get a higher failure rate, check the quality of your hashes.

For cryptanalysis, I might pick a higher N and eyeball check for failures. 
Say you want 1 in a million failure rate, and have an 80,000 word
dictionary.  You need a 20 * 80,000 * 2 bit database, which is 400,000
bytes.

Regards - Bill


------------------------------------------------------------------------
Bill Frantz       | The CDA means  | Periwinkle  --  Computer Consulting
(408)356-8506     | lost jobs and  | 16345 Englewood Ave.
frantz@netcom.com | dead teenagers | Los Gatos, CA 95032, USA







Thread