1994-08-06 - Re: fast 386 DES code figures

Header Data

From: Phil Karn <karn@unix.ka9q.ampr.org>
To: mpd@netcom.com
Message Hash: bad3cda46983f16e907e51487abc9f8a49a08ee133f4edc4f10bbfbe8ea1bd43
Message ID: <199408061700.KAA00742@unix.ka9q.ampr.org>
Reply To: <199408061608.JAA27681@netcom12.netcom.com>
UTC Datetime: 1994-08-06 17:00:19 UTC
Raw Date: Sat, 6 Aug 94 10:00:19 PDT

Raw message

From: Phil Karn <karn@unix.ka9q.ampr.org>
Date: Sat, 6 Aug 94 10:00:19 PDT
To: mpd@netcom.com
Subject: Re: fast 386 DES code figures
In-Reply-To: <199408061608.JAA27681@netcom12.netcom.com>
Message-ID: <199408061700.KAA00742@unix.ka9q.ampr.org>
MIME-Version: 1.0
Content-Type: text/plain


>Since 2k is exactly what is needed for a precomputed table which
>combines the S-boxes and the wirecrossing, I will assume this is
>the approach you used.

Yup, it is. I could look up more than 6 bits (i.e., more than 1 S-box)
at a time, but this really starts to eat RAM.

>The important trick is to reorder the S-boxes so that you do
>lookups on the odd numbered ones and the even numbered ones
>separately.  (1,3,5,7,2,4,6,8) works nicely.  This permits the

This is another trick from Outerbridge's code that I picked up. As you
say, it does make a difference. It's especially nice in 386 assembler
since I can do the key XOR E(R) AND mask in 32-bit operations, then
pick off the 4 resulting bytes individually to do the SP box
indexing. This trick took me from about 1.85 megabits/sec to the 2.45
megabit/sec figure I gave earlier.

>If you store the upper two bits of lookup table addressing in the
>precomputed key schedule and shift both it and the right hand
>block left two bits, all explicit table indexing vanishes and you
>can accumulate the result of a lookup with a single indexed OR
>instruction.

I'm doing this too, if I understand you correctly.  By left-adjusting
each subkey in the key schedule (i.e., shifting the 6 bits left 2
bits), I can pre-adjust for the x4 offset I need to index the SP
table, which has 4-byte elements. This saves two 32-bit shifts per
round.

BTW, some of the code (including Outerbridge's in Schneier)
accumulates the 8 intermediate SP results by ORing into a temporary,
then XORs the temporary into the output data block. This is
unnecessary; each table lookup can be XORed directly into the output
block. Since XOR and OR take the same time, this avoids a temporary
and an extra operation.

At the moment I'm really down in the noise. I've discovered that
286/386/486 specific instructions like ROR EAX,31 execute slightly
faster (2 clock cycles) on the 486 than the equivalent 8086
instruction ROL EAX,1 (3 clock cycles), even though the faster
instruction is more bytes.  Unexpected timings occur for several other
486 instruction sequences as well, such as LODS[BW] (5 clocks), which
is much slower than writing out the equivalent MOV/INC (or ADD)
sequence longhand (1 clock each).  I guess code size is unimportant as
long as everything lands in the cache.


Phil







Thread