From: Ray Cromwell <rjc@clark.net>
To: weidai@eskimo.com (Wei Dai)
Message Hash: 9892bc6491aa3265f829e21de793f7e6b9417fb34b4f33436e2defe324b1fe48
Message ID: <199509080048.UAA19561@clark.net>
Reply To: <Pine.SUN.3.91.950907131638.6760C-100000@eskimo.com>
UTC Datetime: 1995-09-08 00:49:09 UTC
Raw Date: Thu, 7 Sep 95 17:49:09 PDT
From: Ray Cromwell <rjc@clark.net>
Date: Thu, 7 Sep 95 17:49:09 PDT
To: weidai@eskimo.com (Wei Dai)
Subject: Re: fast modular reduction
In-Reply-To: <Pine.SUN.3.91.950907131638.6760C-100000@eskimo.com>
Message-ID: <199509080048.UAA19561@clark.net>
MIME-Version: 1.0
Content-Type: text/plain
>
> The numbers you give are a bit off. Assuming a 32-bit machine,
> n=64 implies a 2048-bit modulus, and a 4096-bit number to be reduced.
> Also, Karatsuba should use 1/3 (2*64^1.58 / 64^2) the multiplications
> rather than 1/5.
The n=64 implies two 2048-bit numbers are being multiplied. The 2048-bit
number comes from the fact that in a typical crypto app, modexp
will be reducing numbers as large as the modulus squared which runs
2048-bits for a 1024-bit modulus. The reciprocal is 1 block
bigger than the number to be reduced. Hence, you are dealing with
multiplying about two 2048-bit numbers. But since we only care
about the "fractional" part of the result, we can safely throw
away half the computation and only compute half the Karatsuba
recursion tree. (the number before the decimal point is the
quotient) Then, to determine the final remainder, we simply
multiply by the modulus again, throwing away non-significant
computation again. There is a normal n^2 method for reducing
via reciprocal that only uses 1/4 the number of ops as the obvious
technique.
Your right about the 1/3 vs 1/5, I dunno where the 5 came from, must
have been a typo in my calcs. The problem with Karatsuba is that it's
hard to implement efficiently. Temporary ints should be kept to
a minimum and be preallocated. The combine step requires 1 store,
and 5 additions, of multiprecision integers. The split step requires
no copying if you use pointer manipulation, and instead of shifting,
don't add in place, but add "with shift" to the destination. Most
of the implementations I've seen do too much copying and shifting.
Given that some modern processors have efficient hardware multiply,
it might not be worth all the trouble to trade mults for adds. If
a processor has an efficient hardware FFT, it might even be worthwhile
to use the FFT multiply method.
Do you have a ref for the Montgomery method? I'm unfamilar with
the name, I wonder if it's something I've seen before under
a different label.
Check out Schonhage's book "Fast Algorithms" They've implemented
all the asymtotic algorithms efficiently and gathered
performance data. I corresponded with Schonhage's grad student
and he told me that Karatsuba wins for n>=8, which I find difficult
to see, when it takes about n=32 for my own implementation (not
optimized) to break even.
-Ray
Return to September 1995
Return to “Wei Dai <weidai@eskimo.com>”