1993-04-28 - Program to measure entropy

Header Data

From: Peter Meyer <meyer@mcc.com>
To: cypherpunks@toad.com
Message Hash: 07001a37e75c6a1d4ca59f8c0df0475ad2f3a93dbd6018e8df956e7b28799e26
Message ID: <19930428212444.5.MEYER@OGHMA.MCC.COM>
Reply To: N/A
UTC Datetime: 1993-04-28 21:25:07 UTC
Raw Date: Wed, 28 Apr 93 14:25:07 PDT

Raw message

From: Peter Meyer <meyer@mcc.com>
Date: Wed, 28 Apr 93 14:25:07 PDT
To: cypherpunks@toad.com
Subject: Program to measure entropy
Message-ID: <19930428212444.5.MEYER@OGHMA.MCC.COM>
MIME-Version: 1.0
Content-Type: text/plain



Cypherpunks write code, so here's some (at the end, anyway).

Someone asked awhile back (just before the deluge of postings on the
Wiretap Chip swamped my announcement of the release of new versions of
our Dolphin Encrypt encryption software) about (something like) how to
tell whether a file consists of something like English text or just
(apparent) garbage.  Here's one way, a program to calculate the entropy
(and the relative entropy) of the set of bytes in a file.

First the documentation (extracted from Appendix III in the manual for
the Dolphin Encryption Library):

Information theorists have attempted to formalize and to quantify the
notion of randomness, also called entropy.  The usual definition of
entropy in a string of letters from some alphabet is due to Claude
Shannon (who formulated this concept in the 1950s).  Let S be a string
of letters from some alphabet A = { a(0), a(1), ..., a(k-1) } of k
letters, and let p(i) be the probability (that is, the relative
frequency) of occurrence of a(i) in the string S, then the entropy E of
the string S may be defined as:

                           k-1
                  E  =  - Sigma  ( p(i) * ln ( p(i) ) )
                          i = 0

where ln is the natural logarithm.  It can be shown that this value is
maximized when all letters occur in S with equal frequency (in this
case E = ln(k)), and is minimized when one letter occurs all the time
(in this case E = 0).

Since E ranges between 0 and ln(k), we may obtain a modified entropy
value E', which we call relative entropy, which ranges between 0 and 1
by dividing E by ln(k) thus:  E' = E / ln(k).

The program ENTROPY1.EXE calculates the relative entropy of the bytes
in a given file.  For a DOS text file consisting of English text the
relative entropy value is typically in the range 0.48 - .68.  The
relative entropy values for most non-random files, including .OBJ,
.COM and .EXE files, usually fall in the range 0.50 through 0.95.

Files consisting of bytes generated by pseudo-random-number generators
typically have relative entropy values in the range 0.970 - 0.999.
Thus a file with a relative entropy value of at least .98 looks (at
least according to this test) very much like a file consisting of
random bytes.  ENTROPY.EXE can thus be used to test whether a file
appears to consist of random bytes or something like natural language.

The ENTROPY1.EXE program takes two parameters on the command line, a
file specification (wildcard characters are not allowed in this
version) and (optionally) a byte space size, e.g. ENTROPY1 FILE.TXT 150.

The program produces results such as:

           File           Size        Entropy   Rel. entropy    Diff. bytes

     HAMLET.TXT           1459       3.037405       0.547756             42
       PTRS.TXT           3683       3.415741       0.615984            108
     CHAP04.TXT          51162       3.339292       0.602198            100

      FILE1.RND           1762       5.473655       0.987102            255
      FILE2.RND           3400       5.503647       0.992511            256
      FILE3.RND          29225       5.541324       0.999305            256

     HAMLET.ENC           1762       5.478605       0.987995            256
       PTRS.ENC           3400       5.501231       0.992075            256
     CHAP04.ENC          29225       5.540785       0.999208            256

       NULLFILE          20000       0.000000       0.000000              1

The file called NULLFILE consists of 20,000 null (zero) bytes, and has a
relative entropy value of zero (as do all files which contain only a
single byte value).  Note that the relative entropy values for the .ENC
files (encrypted using Dolphin Encrypt) are about .99, as are those for
the .RND files (created by using a pseudo-random-number generator
similar to Microsoft's rand() function) of the same size.

The last column gives the number of different bytes found in the file.
This may be less than the size of the byte space for the file.  If the
size of the byte space is less than 256, as is the case with text
files, then the space size parameter may be included in the command
line, as in  ENTROPY1 HAMLET.TXT 108.  In this case the program
produces results such as:

           File           Size        Entropy   Rel. entropy    Diff.bytes  

     HAMLET.TXT           1459       3.037405       0.648723             42
       PTRS.TXT           3683       3.415741       0.729527            108
     CHAP04.TXT          51162       3.339292       0.713199            100

Thus decreasing the value for the byte space increases the entropy
measure.  Relative entropy tends to be larger for larger files.

Now the C source code:

/*  ENTROPY1.C
 *  Written by Peter Meyer, last revised 1993-04-27.
 *  Calculates the relative entropy of the bytes in a file
 *  defined as the negative of the sum for each byte of the product of
 *  the relative probability of that byte times the natural log
 *  of that byte, divided by the natural log of the number of
 *  different bytes occurring in the file; values can range from 0 to 1.
 */

#include <STDIO.H>          /*  Microsoft header files  */
#include <STDLIB.H>
#include <MEMORY.H>
#include <MATH.H>

unsigned long n[256];
double p[256];

unsigned char *usage =
    "\nUse: ENTROPY1 filespec [space_size]"
    "\nspace_size = number of possible bytes (default = 256)\n";

void measure_entropy(unsigned char *filename, unsigned long *total,
    double *entropy, double *relative_entropy, unsigned int*num_diff_bytes,
    unsigned int *space_size, int *err_flag);

/*-----------------------------*/
void main(int argc, char *argv[])
{
int err_flag;
unsigned int num_diff_bytes, space_size;
unsigned long total;
double entropy, relative_entropy;

if ( argc == 1 )
    {
    printf(usage);
    exit(0);
    }

if ( argc == 2 )
    space_size = 256;
else
    {
    space_size = (unsigned int)atoi(argv[2]);
    if ( space_size == 0 || space_size > 256 )
        {
        printf("\nInvalid space size.\n");
        exit(1);
        }
    }

measure_entropy(argv[1],&total,&entropy,&relative_entropy,&num_diff_bytes,
    &space_size,&err_flag);
switch ( err_flag )
    {
    case 0:     /*  no error  */
        printf("Space size = %u\n",space_size);
        printf("\n%15s%15s%15s%15s%15s",
            "File","Size","Entropy","Rel. entropy","Diff. bytes");
        printf("\n%15s%15lu",argv[1],total);
        printf("%15.6f%15.6f%15d\n",entropy,relative_entropy,num_diff_bytes);
        exit(0);
    case -1:
        printf("\nCannot open file %s.\n",argv[1]);
        exit(2);
    case -2:
        printf("\n%15s is inconsistent with space size %d.\n",
            argv[1],space_size);
        exit(3);
    }
}

/*-----------------------------------------*/
void measure_entropy(unsigned char *filename,
                     unsigned long *total,
                     double *entropy,
                     double *relative_entropy,
                     unsigned int *num_diff_bytes,
                     unsigned int *space_size,
                     int *err_flag)
{
int j;
FILE *file;

*err_flag = 0;
file = fopen(filename,"rb");
if ( file == NULL )
    {
    *err_flag = -1;
    return;
    }

/*  zero the frequency array  */
memset(n,0,256*sizeof(unsigned long));

/*  count the byte values  */
while ( !feof(file) )
    n[fgetc(file)]++;

/*  get the number of bytes and the number of different byte values  */
*num_diff_bytes = 0;
*total = 0L;
for ( j=0; j<256; j++ )
    {
    *num_diff_bytes += ( n[j] != 0 );
    *total += n[j];
    }

if ( *num_diff_bytes > *space_size )
    {
    *err_flag = -2;
    fclose(file);
    return;
    }

/*  calculate the probabilities  */
for ( j=0; j<256; j++ )
    p[j] = ((double)n[j])/(*total);

/*  calculate the entropy  */
*entropy = 0.0;
for ( j=0; j<256; j++ )
    {
    if ( p[j] )
        *entropy += p[j]*log(p[j]);
    }

*entropy = -1.0*(*entropy);

/*  calculate the relative entropy  */
*relative_entropy = *entropy/log(*space_size);

fclose(file);
}

If anyone wants the MS-DOS executable version of this program then send
me (meyer@mcc.com) a snailmail address and I'll send it to you on the
Dolphin Encrypt demonstration disk.

-- Peter Meyer





Thread