Hello, I'm working on yet another arbitrary precision integer library. I wanted to implement multiplication but I got stuck when _m_pmulhw
in <mmintrin.h>
just didn't work. there is very little documentation on MMX instructions. When I test it out, it just gives me gibberish when I multiply two UINT64_MAXs.
uint_fast64_t mulH(const uint_fast64_t &a, const uint_fast64_t &b) {
return (uint_fast64_t)_m_pmulhw((__m64)a,(__m64)b);
}
uint_fast64_t mulL(const uint_fast64_t &a, const uint_fast64_t &b) {
return (uint_fast64_t)_m_pmullw((__m64)a,(__m64)b);
}
int main() {
uint64_t a = UINT64_MAX;
uint64_t b = UINT64_MAX;
std::cout << std::bitset<64>(mulH(a,b)) << std::bitset<64>(mulL(a,b));
}
output: 00000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000100000000000000010000000000000001
I don't know why it's not working i have an A6-4400M APU...
coreinfo's output:MMX * Supports MMX instruction set
So I think I can say, it isn't unsupported. If anyone can give me some tips on how to make this work thanks.
Compiler: gcc
IDE: visual studio code
I think you misunderstood what _m_pmulhw
does. It's actually very clearly documented on Intel's Intrinsics Guide: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_m_pmulhw&expand=4340. The corresponding instruction is pmulhw
, which is also clearly documented on e.g. Felix Cloutier's x86 instructions guide: https://www.felixcloutier.com/x86/pmulhw
It multiplies four pairs of 16-bit integers which are packed inside the two operands, and then produces the high half of all four multiplies (Packed Multiply High - Word). This means that, for inputs 0x12345678abcdef01, 0x9876543210fedcba, it would multiply 0x1234 * 0x9876
, 0x5678 * 0x5432
, 0xabcd * 0x10fe
, 0xef01 * 0xdcba
, and pack the high 16 bits of each result into the output.
For your example, you're multiplying 0xffff * 0xffff
four times, producing the 32-bit result 0x00000001
(-1 * -1
, since this is a signed 16-bit multiply), and therefore get 0x0000000000000000
in the high half and 0x0001000100010001
in the low half - which is exactly what you see in the bitset
output.
If you're looking for a 128-bit multiply, there isn't actually an intrinsic for that (except _mulx_u64
, but that uses the new mulx
instruction which isn't that widespread). Microsoft has the non-standard _mul128
intrinsic, but on other platforms you can just use a __int128
type (or the local equivalent) to get a 64x64=>128 bit multiply.
Also, I'd seriously recommend using the SSE instruction set rather than the older MMX set; the SSE instructions are faster in most cases and enable you to operate on much wider vector types (256-bit is standard now, with AVX512 now available), which can provide a significant speed boost.