1998-08-06 - Re: text analysis

Header Data

From: Anonymous <nobody@replay.com>
To: cypherpunks@toad.com
Message Hash: 65f2d7685e23eb1e3cd2dcfb5d518e0957326efe698014f5d98fb865686ea780
Message ID: <199808061501.RAA26380@replay.com>
Reply To: N/A
UTC Datetime: 1998-08-06 15:01:10 UTC
Raw Date: Thu, 6 Aug 1998 08:01:10 -0700 (PDT)

Raw message

From: Anonymous <nobody@replay.com>
Date: Thu, 6 Aug 1998 08:01:10 -0700 (PDT)
To: cypherpunks@toad.com
Subject: Re: text analysis
Message-ID: <199808061501.RAA26380@replay.com>
MIME-Version: 1.0
Content-Type: text/plain

A good data structure is a tree:

                   ----->  2nd letter
                  /                    ------> 3rd letter
                 /                    /
      1st letter ------->  2nd letter -------> 3rd letter
                 \                    \
                  \                    ------> 3rd letter
                   ----->  2nd letter
                                      \
                                       -----> 3rd letter

Start with an array indexed by the 26 letters.  Each array entry contains
a count of that letter's frequence and a pointer to a subsidary array.
The pointer starts out null.

When you find a new digraph (like the first time you find a letter after
'g') you allocate a new structure to represent the letters following 'g'.
This will again consist of 26 entries like the first one.

Likewise when you find a new trigraph, you allocate a new third-level
array of 26 entries.

This will be a compromise between space and speed.  Most digraphs don't
exist so you will save considerable space over the brute force way of
allocating an array of 26^n spaces to count n-graphs.  However it is
a bit wasteful in that it allocates a full 26 entries each time even
though only a small fraction will be used.

The last layer does not need to have the pointers allocated per letter
since you won't be looking beyond that (e.g. if you are only interested
in counting trigraphs, the third layer above can consist of just an
array of counts).

struct entry {
	int count;
	struct entry *next;
} top;

Handle letter, given array of previous letters.  prev[0] is current
letter, prev[1] is 1st previous letter, prev[2] is 2nd previous letter,
etc.  Size of prev is n entries.  Letters are normalized to range 0-25.
Untested code, hopefully it will give you the idea.

handle (int prev[], int n)
{
	struct entry *cur;

	cur = top;
	for (i=0; i<n; ++i) {
		cur[prev[i]].count += 1;
		if (i==n-1)
			break;
		if (cur[prev[i]].next == NULL)
			cur[prev[i]].next = calloc (26, sizeof(struct entry));
		cur = cur[prev[i]].next;
	}
}

This actually counts the entries backwards, so that "the" is entered under
top['e'-'a'+1]->next['h'-'a'+1]->next['t'-'a'+1].  You can correct for
this when you print them out.  Or you can store the letters in the prev
array in the opposite order, putting the current letter in prev[n-1],
then they will be in the correct order.

Thread

Return to August 1998
Return to “Anonymous <nobody@replay.com>”
1998-08-06 (Thu, 6 Aug 1998 08:01:10 -0700 (PDT)) - Re: text analysis - Anonymous <nobody@replay.com>

cryptoanarchy.wiki

1998-08-06 - Re: text analysis

Header Data

Raw message

Thread