delphix86-64basm

Checking parameters of multiplication by constant in 64 bit


For my BigInteger code, output turned out to be slow for very large BigIntegers. So now I use a recursive divide-and-conquer algorithm, which still needs 2'30" to convert the currently largest known prime to a decimal string of more than 22 million digits (but only 135 ms to turn it into a hexadecimal string).

I still want to reduce the time, so I need a routine that can divide a NativeUInt (i.e. UInt32 on 32 bit platforms, UInt64 on 64 bit platforms) by 100 very fast. So I use multiplication by constant. This works fine in 32 bit code, but I am not 100% sure for 64 bit.

So my question: is there a way to check the reliability of the results of multiplication by constant for unsigned 64 bit values? I checked the 32 bit values by simply trying with all values of UInt32 (0..$FFFFFFFF). This took approx. 3 minutes. Checking all UInt64s would take much longer than my lifetime. Is there a way to check if the parameters used (constant, post-shift) are reliable?

I noticed that DivMod100() always failed for a value like $4000004B if the chosen parameters were wrong (but close). Are there special values or ranges to check for 64 bit, so I don't have to check all values?

My current code:

const
{$IF DEFINED(WIN32)}
  // Checked
  Div100Const = UInt32(UInt64($1FFFFFFFFF) div 100 + 1);
  Div100PostShift = 5;
{$ELSEIF DEFINED(WIN64)}
  // Unchecked!!
  Div100Const = $A3D70A3D70A3D71; 
  // UInt64(UInt128($3 FFFF FFFF FFFF FFFF) div 100 + 1); 
  // UInt128 is fictive type.
  Div100PostShift = 2;
{$IFEND}

// Calculates X div 100 using multiplication by a constant, taking the
// high part of the 64 bit (or 128 bit) result and shifting
// right. The remainder is calculated as X - quotient * 100;
// This was tested to work safely and quickly for all values of UInt32.
function DivMod100(var X: NativeUInt): NativeUInt;
{$IFDEF WIN32}
asm
        // EAX = address of X, X is UInt32 here.
        PUSH    EBX
        MOV     EDX,Div100Const
        MOV     ECX,EAX
        MOV     EAX,[ECX]
        MOV     EBX,EAX
        MUL     EDX
        SHR     EDX,Div100PostShift
        MOV     [ECX],EDX       // Quotient

        // Slightly faster than MUL

        LEA     EDX,[EDX + 4*EDX] // EDX := EDX * 5;
        LEA     EDX,[EDX + 4*EDX] // EDX := EDX * 5;
        SHL     EDX,2             // EDX := EDX * 4; 5*5*4 = 100.

        MOV     EAX,EBX
        SUB     EAX,EDX         // Remainder
        POP     EBX
end;
{$ELSE WIN64}
asm
        .NOFRAME

        // RCX is address of X, X is UInt64 here.
        MOV     RAX,[RCX]
        MOV     R8,RAX
        XOR     RDX,RDX
        MOV     R9,Div100Const
        MUL     R9
        SHR     RDX,Div100PostShift
        MOV     [RCX],RDX      // Quotient

        // Faster than LEA and SHL

        MOV     RAX,RDX
        MOV     R9D,100
        MUL     R9
        SUB     R8,RAX
        MOV     RAX,R8         // Remainder
end;
{$ENDIF WIN32}

Solution

  • As usual when writing optimized code, use compiler output for hints / starting points. It's safe to assume any optimization it makes is safe in the general case. Wrong-code compiler bugs are rare.

    gcc implements unsigned 64bit divmod with a constant of 0x28f5c28f5c28f5c3. I haven't looked in detail into generating constants for division, but there are algorithms for generating them that will give known-good results (so exhaustive testing isn't needed).

    The code actually has a few important differences: it uses the constant differently from the OP's constant.

    See the comments for an analysis of what this is is actually doing: divide by 4 first, so it can use a constant that only works for dividing by 25 when the dividend is small enough. This also avoids needing an add at all, later on.

    #include <stdint.h>
    
    // rem, quot ordering takes one extra instruction
    struct divmod { uint64_t quotient, remainder; }
     div_by_100(uint64_t x) {
        struct divmod retval = { x%100, x/100 };
        return retval;
    }
    

    compiles to (gcc 5.3 -O3 -mtune=haswell):

        movabs  rdx, 2951479051793528259
        mov     rax, rdi            ; Function arg starts in RDI (SysV ABI)
        shr     rax, 2
        mul     rdx
        shr     rdx, 2
        lea     rax, [rdx+rdx*4]    ; multiply by 5
        lea     rax, [rax+rax*4]    ; multiply by another 5
        sal     rax, 2              ; imul rax, rdx, 100 is better here (Intel SnB).
        sub     rdi, rax
        mov     rax, rdi
        ret
    ; return values in rdx:rax
    

    Use the "binary" option to see the constant in hex, since disassembler output does it that way, unlike gcc's asm source output.


    The multiply-by-100 part.

    gcc uses the above sequence of lea/lea/shl, the same as in your question. Your answer is using a mov imm/mul sequence.

    Your comments each say the version they chose is faster. If so, it's because of some subtle instruction alignment or other secondary effect: On Intel SnB-family, it's the same number of uops (3), and the same critical-path latency (mov imm is off the critical path, and mul is 3 cycles).

    clang uses what I think is the best bet (imul rax, rdx, 100). I thought of it before I saw that clang chose it, not that that matters. That's 1 fused-domain uop (which can execute on p0 only), still with 3c latency. So if you're latency-bound using this routine for multi-precision, it probably won't help, but it is the best choice. (If you're latency-bound, inlining your code into a loop instead of passing one of the parameters through memory could save a lot of cycles.)

    imul works because you're only using the low 64b of the result. There's no 2 or 3 operand form of mul because the low half of the result is the same regardless of signed or unsigned interpretation of the inputs.

    BTW, clang with -march=native uses mulx for the 64x64->128, instead of mul, but doesn't gain anything by it. According to Agner Fog's tables, it's one cycle worse latency than mul.


    AMD has worse than 3c latency for imul r,r,i (esp. the 64b version), which is maybe why gcc avoids it. IDK how much work gcc maintainers put into tweaking costs so settings like -mtune=haswell work well, but a lot of code isn't compiled with any -mtune setting (even one implied by -march), so I'm not surprised when gcc makes choices that were optimal for older CPUs, or for AMD.

    clang still uses imul r64, r64, imm with -mtune=bdver1 (Bulldozer), which saves m-ops but at a cost of 1c latency more than using lea/lea/shl. (lea with a scale>1 is 2c latency on Bulldozer).