1992-10-26 - entropy, with code

Header Data

From: Eric Hughes <hughes@soda.berkeley.edu>
To: cypherpunks@toad.com
Message Hash: 5f88f40e56048dc684e34fc31f79e7a1a73f66a8cecc1c82fe0173e912f60644
Message ID: <9210261958.AA28999@soda.berkeley.edu>
Reply To: N/A
UTC Datetime: 1992-10-26 19:58:58 UTC
Raw Date: Mon, 26 Oct 92 12:58:58 PPE

Raw message

From: Eric Hughes <hughes@soda.berkeley.edu>
Date: Mon, 26 Oct 92 12:58:58 PPE
To: cypherpunks@toad.com
Subject: entropy, with code
Message-ID: <9210261958.AA28999@soda.berkeley.edu>
MIME-Version: 1.0
Content-Type: text/plain



The entropy-calculating code is at the end of this message.  I took
the opportunity to calculate some sample entropies:

entropy.c	5.283772	the source code
entropy.asc	6.052222	entropy.c, encrypted and armored 
entropy.as2	6.012493	entropy.asc, with the wrappers removed
entropy.pgp	7.830532	entropy.c, encrypted alone
entropy.obj	6.112890
entropy.exe	6.947111
randseed.bin	4.584963	from pgp, 24 bytes long
pubring.pgp	7.754017	my public key ring

The entropy of the source code is in the high end of the range for
English, which is not surprising given the amount of symbols in
ordinary C code.  The entropy increases with the object and the
executable, each of which has less overt structure to it.

The entropy of the encrypted and ascii armored source code is within
1% of 6 bits, just as predicted.  And with the wrappers removed, it's
even closer!  The binary version of the encrypted message has the
highest entropy of all these tests.

In randseed.bin, the entropy is much less than 8.  But the length of
the file is 24 bytes and log_2 24 = 4.584963, indicating that there
are no duplicate bytes in the file.  Hence the warning:  entropy
calculations with small samples _will_ be misleading.

Note that the entropy subroutine can be used to calculate the
frequencies of any distribution.  This will allow all you code-writing
cypherpunks to measure bit entropies, digram entropies, etc.

Eric

----------------------------------------------------------------

/* entropy.c -- Calculate monogram entropies of standard input.
 */

#include <stdio.h>
#include <math.h>

/* This next define is to counteract the foolish C 7.00 runtime mapping
 * of \x1A to EOF
 */
#define STUPID_NEWLINE_TRANSLATE	1

#ifdef STUPID_NEWLINE_TRANSLATE
#include <io.h>
#include <fcntl.h>
#endif

/*--------------------------------------*/
#define NUMBER_OF_BYTES		256
long byte_freq[ NUMBER_OF_BYTES ] ;

void
main()
{
	int c, verbose = 0 ;
	unsigned int j ;
	double entropy( long *, int ) ;

#if STUPID_NEWLINE_TRANSLATE
	_setmode( _fileno( stdin ), _O_BINARY ) ;
#endif

	for ( j = 0 ; j < NUMBER_OF_BYTES ; ++j ) {
		byte_freq[ j ] = 0 ;
	}

	while ( EOF != ( c = getchar() ) ) {
		++ byte_freq[ (unsigned int) c ] ;
	}


	if ( verbose ) {
		for ( j = 0 ; j < NUMBER_OF_BYTES ; ++j ) {
			printf( "%3d=%-3d ", j, byte_freq[ j ] ) ;
			if ( j % 8 == 7 ) printf( "\n" ) ;
		}
	}

	printf( "%lf\n", entropy( byte_freq, NUMBER_OF_BYTES ) ) ;
}

/*--------------------------------------*/
/* Calculates the entropy of the distribution given in list v of n elements.
 * The list v gives counts.  v is summed, and frequencies are assigned 
 * as v[i]/sum(v).
 *
 * Uses the following definitions and identities:
 *   A = \Sum_{i=0}^{n-1} v_i
 *   p_i = v_i / A
 *   H = \Sum_{i} - p_i \log p_i
 *	   = log A - 1/A \Sum_{i} v_i \log v_i
 *   lim_{x \rightarrow 0} x \log x = 0
 */

double
entropy( long *v, int n )
{
	double h ;
	long sum ;
	int j ;

	/* first sum the array
	 */
	sum = 0 ;
	for ( j = 0 ; j < n ; ++j ) {
		sum += v[ j ] ;
	}

	/* next calculate the entropy function
	 */
	h = 0 ;
	for ( j = 0 ; j < n ; ++ j ) {

		/* If the frequency is zero, the entropy contribution is zero
		 */
		if ( v[ j ] == 0 ) continue ;

		h -= v[ j ] * log( v[ j ] ) ;
	}
	h /= sum ;
	h += log( sum ) ;

	/* Now adjust the base of the logarithm to base 2
	 */
	h /= log( 2 ) ;

	return h ;
}






Thread