python modulo number-theory exponentiation modular-arithmetic

Python: speed up pow(base,exp,mod) for fixed exp and mod, or with vectorization

The bottleneck of my code is the repeated calling of pow(base,exponent,modulus) for very large integers (numpy does not support such large integers, about 100 to 256 bits). However, my exponent and modulus is always the same. Could I somehow utilize this to speed the computation up with a custom function? I tried to define a function as seen below (function below is for general modulus and exponent).

However, even if I hardcode every operation without the while-loop and if-statements for a fixed exponent and modulus, it is slower than pow.

def modular_pow(self, base, exponent, modulus):
    result = 1
    base = base % modulus
    while exponent > 0:
        if (exponent % 2 == 1):
            result = (result * base) % modulus
        exponent = exponent >> 1
        base = (base * base) % modulus
    return result

Another option would be if I can somehow "vectorize" it. I have to compute pow for about 100000000 different base values. While these values often change between my script runs (so a look up table will not be useful), I will know these values the moment I hit run (I could compute them all at once).

Any ideas? I got some speedup by using mpz datatypes from gmpy2, but it still is too slow.

Solution

Good news, bad news. Good news is that when the modulus m is fixed, there are ways to speed computing a*b % m. Search for "Barrett reduction" and "Montgomery reduction". They work, in different ways, by precomputing constants related to m such that % m can be computed via multiplication and shifting, without needing division.

The bad news: to find the remainder, both ways require (in addition to cheaper operations) two multiplications. So they don't pay overall unless multiplication is much cheaper than division.

For that reason, they're generally slower unless the modulus is "truly" large - your "about 100 to 256 bits" is still on the small side by modern standards, just a few times wider than native 64-bit machine ints. Things like speedy FFT-based multiplication require much larger integers before they pay off.

CPython's built-in modular pow is already using a "binary" scheme, along the lines of what you coded in Python, but fancier (if the exponent is "large enough", the built-in pow views it as being in base 32, consuming 5 exponent bits per loop iteration).

In a quick stab implementing Montgomery reduction in Python, and replacing the modular multiplies in your code by the Montgomery spellings, modular_pow() didn't get faster than the built-in before the modulus grew to tens of thousands of bits. For inputs around 256 bits, it was about 3x slower.

That's a mixed bag: the Python code didn't exploit the "base 32" tricks, which can give substantial benefits. But for large enough inputs, CPython uses faster-than-naïve Karatsuba multiplication, which the division-free Montgomery spelling can benefit from (CPython's int division has no speedup tricks regardless of input sizes, and CPython's built-in modular pow always uses division to find remainders).

So, short course: there's nothing I know of obvious you can do in CPython to speed a single instance of pow(a, b, c). It's possible some C-coded crypto library out there has something suitable, but not that I know of.

But the other good news is that your problem is "embarrassingly parallel". If you have N processors, you can give each of them 100000000/N of your inputs, and they can all run full speed in parallel. That would give a speedup of about a factor of N.

But the bad news is that your integers really aren't "large" (they're small enough that I bet you can still compute thousands of modular pows per second with the built-in pow), and interprocess communication costs may wipe out the benefits of doing N computations in parallel. That all depends on exactly how you get your inputs and what you want to do with the results.

FOLLOW UP

The "Handbook of Applied Cryptography" (HAC), chapter 14, essentially spells out the state of the art for gonzo modular exponentiation algorithms.

Looking into the code, GMP already implements every trick they have. This includes the things I mentioned (Montgomery reduction, and using a power-of-2 base higher than 2 to chew up more exponent bits per loop iteration). And others I didn't mention (e.g., GMP has a special internal routine for squaring, which saves cycles over a general product of possibly unequal integers). In all, it's a small mountain of implementation code.

I expect that's why you're not getting more answers: GMP is already doing, at worst, close to the best anyone has ever figured out how to do. The speedup for you isn't truly dramatic because, as already noted, the integers you're using are actually on the small side.

So if you need to pursue this, using GMP is likely the fastest you're going to get. As noted, multiprocessing is an obvious way to get a theoretical N-fold speedup with N processors, but as also noted you haven't said anything about the context (where these inputs come from or what you need to do with the outputs). So it's impossible to guess whether that might pay off for you. The more interprocess communication you need, the more that damages potential multiprocessing speedups.

Note: what you're doing is exactly what, e.g., RSA public-key cryptosystems do, although they generally use larger integers. That is, your "base" is their "message", and a public (or private) RSA key consists of a fixed exponent and fixed modulus. Only the base (message or encrypted bits) varies across en/de-cryption instances. For a given key, the exponent and modulus are always the same.

Many world-class mathematicians have studied the problem, and world-class hackers coded the algorithms for peak speed. That's why you should give up hope on that there's a faster way that HAC just forgot to mention ;-)

Speculative

Drawing the connection to RSA reminded me: RSA decryption, in practice, does not proceed in "the obvious" way. Instead, the holder of the private key knows the prime factorization of the key's modulus (in RSA, the modulus is the product of two distinct - but kept secret -large primes), and that can be used to significantly speed exponentiation with respect to that modulus.

So (can't guess), if the way you get your modulus instances is such that you can compute their prime factorizations efficiently, that can be used to get significant speedups, when they're composite.

Not so much for a prime modulus, though. About the only highly potentially valuable trick then is that, for p prime and a not divisible by p,

pow(a, b, p) == pow(a, b % (p-1), p)

That can save unbounded time if b can be much larger than p. It works because, by Fermat's little theorem,

pow(a, p-1, p) == 1

for p prime and a not divisible by p. For example,

>>> p
2347
>>> assert all(p % i != 0 for i in range(2, p))  # that is, p is prime
>>> pow(345, 1000000000000000000000000000000, p)
301
>>> 1000000000000000000000000000000 % (p-1)
1198
>>> pow(345, 1198, p) # same thing, but much faster
301

For a composite modulus, much the same is done for each of its prime power factors, and then the results are pasted together via the Chinese remainder theorem.

If you think your problem can be worked to exploit that, search for "modular exponentiation Chinese remainder" to find a number of good expositions.