1998-08-06 - Re: text analysis

Header Data

From: jim@mentat.com (Jim Gillogly)
To: cypherpunks@toad.com
Message Hash: 1038aeb9fb2cce0854373000247c0e4293113d604ac317ee716532780f5d59a0
Message ID: <9808061301.AA27007@mentat.com>
Reply To: N/A
UTC Datetime: 1998-08-06 13:03:18 UTC
Raw Date: Thu, 6 Aug 1998 06:03:18 -0700 (PDT)

Raw message

From: jim@mentat.com (Jim Gillogly)
Date: Thu, 6 Aug 1998 06:03:18 -0700 (PDT)
To: cypherpunks@toad.com
Subject: Re: text analysis
Message-ID: <9808061301.AA27007@mentat.com>
MIME-Version: 1.0
Content-Type: text/plain

CyberPsychotic writes:
>  So I was playing with some crypto-analysis algorythms, and some steps of
> them involve such thing as finding out the frequency of some character or
> set of characters. So I wonder what would be the optimal (speed, resources
> etc)way of coding this. While playing with single character frequency, I
> guess the best way would be having unsigned int 256 elements array (I
> refer to C coding) and each time, I find certain character, i just
> increase the element of the array with this char offset. This seems very
> neat to me (except the thing that some chars could be never found in
> text). Anyways, when things come to 2 characters set, i have to get 1024
> character set, and so on, which looks quite unreasonable to me to allocate
> memory for elements, which probably will be never found in text... I was
> thinking of other solution and came to two way connected lists (correct
> term?)  things, i.e. : i have some structure like: 

The term is "linked list".  If you want to collect all the arbitrarily
long repeats you may have to resign yourself to a two-pass algorithm.
For small cryptograms like the ones I usually deal with, it's
sufficient to use an N^2 algorithm, looking at each starting point and
each possible offset to see whether a string of at least k characters
is duplicated at that offset.  For a longer file, off the top of my
head, you might start with a set of pointers for where each possible
digraph starts; you can use a linked list, storing all the starting
locations for each digraph (a two-dimensional array of pointers, total
size 64K).  On your second pass you can do an N^2 search within those
reduced sets to see which repeated digraphs can be extended.  It will
help a lot if your target ciphertext fits in memory.  I make no claim
that this is optimal.

	Jim Gillogly