I am writing a code library in x86-64 assembly-language to provide all conventional bitwise, shift, logical, compare, arithmetic and math functions for s0128
, s0256
, s0512
, s1024
, s2048
, and s4096
signed-integer types and f0128
, f0256
, f0512
, f1024
, f2048
, and f4096
floating-point types.
Now I am writing some type conversion routines, and have run into something that should be trivial but takes a lot more instructions than I would expect. I feel like I must be missing something (some instructions) to make this easier, but so far no luck.
The low 128-bits of the s0256
result is simply a copy of the s0128
input argument, and all the bits in the upper 128-bits of the s0256
result must be set to the most-significant bit in the s0128
input argument.
Simple, huh? But here is the best I can figure so far to convert s0256
to s0128
. Ignore the first 4 lines (they're just argument error checks) and last 2 lines (returning from function with no error (rax == 0)). The 5 lines in the middle are the algorithm in question. Try to avoid [conditional] jump instructions.
.text
.align 64
big_m63:
.quad -63, -63 # two shift counts for vpshaq instruction
big_s0256_eq_s0128: # (s0256* arg0, const s0128* arg1); # s0256 = s0256(s0128)
orq %rdi, %rdi # is arg0 a valid address ???
jz error_argument_invalid # nope
orq %rsi, %rsi # is arg1 a valid address ???
jz error_argument_invalid # nope
vmovapd (%rsi), %xmm0 # ymm0 = arg1.ls64 : arg1.ms64 : 0 : 0
vmovhlps %xmm0, %xmm0, %xmm1 # ymm1 = arg1.ms64 : arg1.ms64 : 0 : 0
vpshaq big_m63, %xmm1, %xmm1 # ymm1 = arg1.sign : arg1.sign : 0 : 0
vperm2f128 $32, %ymm1, %ymm0, %ymm0 # ymm1 = arg1.ls64 : arg1.ms64 : sign : sign
vmovapd %ymm0, (%rdi) # arg0 = arg1 (sign-extended to 256-bits)
xorq %rax, %rax # rax = 0 == no error
ret # return from function
This routine is also non-optimal in that every instruction requires the result of the previous instruction, which prevents parallel execution of any instructions.
Is there a better instruction to right-shift with sign extension? I cannot find an instruction like vpshaq
that accepts an immediate byte to specify shift-count, though I don't know why (many SIMD instructions have immediate 8-bit operands for various purposes). Also, Intel does not support vpshaq
. Oops!
But look! StephenCanon has a brilliant solution to this problem below! Awesome! That solution has one more instruction than the above, but the vpxor
instruction can be put after the first vmovapd
instruction and should effectively take no more cycles than the 5 instruction version above. Bravo!
For completeness and easy comparison, here is the code with the latest StephenCanon enhancement:
.text
.align 64
big_s0256_eq_s0128: # (s0256* arg0, const s0128* arg1); # s0256 = s0256(s0128)
orq %rdi, %rdi # is arg0 a valid address ???
jz error_argument_invalid # nope
orq %rsi, %rsi # is arg1 a valid address ???
jz error_argument_invalid # nope
vmovapd (%rsi), %xmm0 # ymm0 = arg1.ls64 : arg1.ms64 : 0 : 0
vpxor %xmm2, %xmm2, %xmm2 # ymm2 = 0 : 0 : 0 : 0
vmovhlps %xmm0, %xmm0, %xmm1 # ymm1 = arg1.ms64 : arg1.ms64 : 0 : 0
vpcmpgtq %xmm1, %xmm2, %xmm1 # ymm1 = arg1.sign : arg1.sign : 0 : 0
vperm2f128 $32, %ymm1, %ymm0, %ymm0 # ymm1 = arg1.ls64 : arg1.ms64 : sign : sign
vmovapd %ymm0, (%rdi) # arg0 = arg1 (sign-extended to 256-bits)
xorq %rax, %rax # rax = 0 == no error
ret # return from function
I'm not certain, but not needing to read those two 64-bit shift-counts from memory might also speed the code up slightly. Nice.
You're over-complicating things. Once you have the sign in rax
, just do two 64b stores from there instead of trying to assemble the result in ymm0
. One less instruction and a much shorter dependency chain.
As the destination type gets larger, of course, it makes sense to use wider stores (AVX). With AVX2 you can use vbroadcastq
to do the splat more efficiently, but it looks like you're targeting baseline AVX?
I should also note that once you get to ~512b integers, for most algorithms the cost of super-linear operations like multiplication so completely dominates the running time that squeezing every last cycle out of operations like sign extension rapidly starts to lose value. It's a good exercise, but ultimately not the most productive use of your time once your implementations are "good enough”.
After further thought, I have the following suggestion:
vmovhlps %xmm0, %xmm0, %xmm1 // could use a permute instead to stay in integer domain.
vpxor %xmm2, %xmm2, %xmm2
vpcmpgtq %xmm1, %xmm2, %xmm2 // generate sign-extension without shift
This has the virtues of (a) not requiring a constant load and (b) working on both Intel and AMD. The xor to generate zero looks like an extra instruction, but in practice this zeroing idiom doesn’t even require an execute slot on recent processors.
FWIW, if targeting AVX2, I might write it like this:
vmovdqa (%rsi), %xmm0 // { x0, x1, 0, 0 }
vpermq $0x5f, %ymm0, %ymm1 // { 0, 0, x1, x1 }
vpxor %ymm2, %ymm2, %ymm2 // { 0, 0, 0, 0 }
vpcmpgtq %ymm1, %ymm2, %ymm2 // { 0, 0, s, s } s = sign extension
vpor %ymm2, %ymm0, %ymm0 // { x0, x1, s, s }
vmovdqa %ymm0, (%rdi)
Unfortunately, I don’t think that vpermq
is available on AMD.