cperformanceloopsmoduluslong-long

Accelerating performance of loop performing unsigned long long modulo operation


I need to perform a many operations of finding remainders of the division unsigned long long number by the 16-bit modulus:

unsigned long long largeNumber;
long residues[100];
unsigned long modules[100];
intiModules(modules); //set different 16-bit values

for(int i = 0; i < 100; i++){
     residues[i] = largeNumber % modules[i];
}

How I can accelerate this loop?

The iteration count is not large (32-128), but this loop performed very often so its speed is critical.


Solution

  • Division by a constant (and there are only 65536 of them) can be performed by multiplication of the reciprocal followed/preceded by some fine-tuning. Since this method is accurate for a limited range, one can use some techniques to reduce the 64-bit operand to a much smaller value (which is still congruent to the original value):

    // pseudo code -- not c
    a = 0x1234567890abcdefULL;
    a = 0x1234 << 48 + 0x5678 << 32 + 0x90ab << 16 + 0xcdef;
    
    a % N === ((0x1234 * (2^48 % N) +     // === means 'is congruent'
               (0x5678 * (2^32 % N)) +    // ^ means exponentation
               (0x90ab * (2^16 % N)) + 
               (0xcdef * 1)) % N;
    

    The intermediate value can be calculated with (small) multiplications only and the final remainder (%N) has potential to be calculated with reciprocal multiplication.