For my BigInteger code, output turned out to be slow for very large BigIntegers. So now I use a recursive divide-and-conquer algorithm, which still needs 2'30" to convert the currently largest known prime to a decimal string of more than 22 million digits (but only 135 ms to turn it into a hexadecimal string).
I still want to reduce the time, so I need a routine that can divide a NativeUInt (i.e. UInt32 on 32 bit platforms, UInt64 on 64 bit platforms) by 100 very fast. So I use multiplication by constant. This works fine in 32 bit code, but I am not 100% sure for 64 bit.
So my question: is there a way to check the reliability of the results of multiplication by constant for unsigned 64 bit values? I checked the 32 bit values by simply trying with all values of UInt32 (0..$FFFFFFFF). This took approx. 3 minutes. Checking all UInt64s would take much longer than my lifetime. Is there a way to check if the parameters used (constant, post-shift) are reliable?
I noticed that DivMod100()
always failed for a value like $4000004B
if the chosen parameters were wrong (but close). Are there special values or ranges to check for 64 bit, so I don't have to check all values?
My current code:
const
{$IF DEFINED(WIN32)}
// Checked
Div100Const = UInt32(UInt64($1FFFFFFFFF) div 100 + 1);
Div100PostShift = 5;
{$ELSEIF DEFINED(WIN64)}
// Unchecked!!
Div100Const = $A3D70A3D70A3D71;
// UInt64(UInt128($3 FFFF FFFF FFFF FFFF) div 100 + 1);
// UInt128 is fictive type.
Div100PostShift = 2;
{$IFEND}
// Calculates X div 100 using multiplication by a constant, taking the
// high part of the 64 bit (or 128 bit) result and shifting
// right. The remainder is calculated as X - quotient * 100;
// This was tested to work safely and quickly for all values of UInt32.
function DivMod100(var X: NativeUInt): NativeUInt;
{$IFDEF WIN32}
asm
// EAX = address of X, X is UInt32 here.
PUSH EBX
MOV EDX,Div100Const
MOV ECX,EAX
MOV EAX,[ECX]
MOV EBX,EAX
MUL EDX
SHR EDX,Div100PostShift
MOV [ECX],EDX // Quotient
// Slightly faster than MUL
LEA EDX,[EDX + 4*EDX] // EDX := EDX * 5;
LEA EDX,[EDX + 4*EDX] // EDX := EDX * 5;
SHL EDX,2 // EDX := EDX * 4; 5*5*4 = 100.
MOV EAX,EBX
SUB EAX,EDX // Remainder
POP EBX
end;
{$ELSE WIN64}
asm
.NOFRAME
// RCX is address of X, X is UInt64 here.
MOV RAX,[RCX]
MOV R8,RAX
XOR RDX,RDX
MOV R9,Div100Const
MUL R9
SHR RDX,Div100PostShift
MOV [RCX],RDX // Quotient
// Faster than LEA and SHL
MOV RAX,RDX
MOV R9D,100
MUL R9
SUB R8,RAX
MOV RAX,R8 // Remainder
end;
{$ENDIF WIN32}
As usual when writing optimized code, use compiler output for hints / starting points. It's safe to assume any optimization it makes is safe in the general case. Wrong-code compiler bugs are rare.
gcc implements unsigned 64bit divmod with a constant of 0x28f5c28f5c28f5c3
. I haven't looked in detail into generating constants for division, but there are algorithms for generating them that will give known-good results (so exhaustive testing isn't needed).
The code actually has a few important differences: it uses the constant differently from the OP's constant.
See the comments for an analysis of what this is is actually doing: divide by 4 first, so it can use a constant that only works for dividing by 25 when the dividend is small enough. This also avoids needing an add at all, later on.
#include <stdint.h>
// rem, quot ordering takes one extra instruction
struct divmod { uint64_t quotient, remainder; }
div_by_100(uint64_t x) {
struct divmod retval = { x%100, x/100 };
return retval;
}
compiles to (gcc 5.3 -O3 -mtune=haswell
):
movabs rdx, 2951479051793528259
mov rax, rdi ; Function arg starts in RDI (SysV ABI)
shr rax, 2
mul rdx
shr rdx, 2
lea rax, [rdx+rdx*4] ; multiply by 5
lea rax, [rax+rax*4] ; multiply by another 5
sal rax, 2 ; imul rax, rdx, 100 is better here (Intel SnB).
sub rdi, rax
mov rax, rdi
ret
; return values in rdx:rax
Use the "binary" option to see the constant in hex, since disassembler output does it that way, unlike gcc's asm source output.
gcc uses the above sequence of lea/lea/shl, the same as in your question. Your answer is using a mov imm
/mul
sequence.
Your comments each say the version they chose is faster. If so, it's because of some subtle instruction alignment or other secondary effect: On Intel SnB-family, it's the same number of uops (3), and the same critical-path latency (mov imm
is off the critical path, and mul
is 3 cycles).
clang uses what I think is the best bet (imul rax, rdx, 100
). I thought of it before I saw that clang chose it, not that that matters. That's 1 fused-domain uop (which can execute on p0 only), still with 3c latency. So if you're latency-bound using this routine for multi-precision, it probably won't help, but it is the best choice. (If you're latency-bound, inlining your code into a loop instead of passing one of the parameters through memory could save a lot of cycles.)
imul
works because you're only using the low 64b of the result. There's no 2 or 3 operand form of mul
because the low half of the result is the same regardless of signed or unsigned interpretation of the inputs.
BTW, clang with -march=native
uses mulx
for the 64x64->128, instead of mul
, but doesn't gain anything by it. According to Agner Fog's tables, it's one cycle worse latency than mul
.
AMD has worse than 3c latency for imul r,r,i
(esp. the 64b version), which is maybe why gcc avoids it. IDK how much work gcc maintainers put into tweaking costs so settings like -mtune=haswell
work well, but a lot of code isn't compiled with any -mtune
setting (even one implied by -march
), so I'm not surprised when gcc makes choices that were optimal for older CPUs, or for AMD.
clang still uses imul r64, r64, imm
with -mtune=bdver1
(Bulldozer), which saves m-ops but at a cost of 1c latency more than using lea/lea/shl. (lea with a scale>1 is 2c latency on Bulldozer).