c++cperformancelow-level-code

Fastest way to spread 4 bytes into 8 bytes (32bit -> 64bit)


Assume you have a 32-bit unsigned integer, where the bytes are organized like this: a b c d. What is the fastest way to spread these bytes into a 64-bit unsigned integer in this fashion: 0 a 0 b 0 c 0 d? It is for the x86-64 architecture. I would like to know the fastest approach without using special intrinsics, although that would also be interesting. (I say 'fastest', but compact solutions with reasonable performance is also nice).

Edit for people who want context. This seems like a really easy work, just shifting some bytes around, yet it requires more instructions than you'd think (check this godbolt with optimizations). Therefore I just wonder if anyone knows of a way that would solve the problem with fewer instructions.


Solution

  • uint64_t x = ...;
    // 0 0 0 0 a b c d
    x |= x << 16;
    // 0 0 a b ? ? c d
    x = x << 8 & 0x00ff000000ff0000 | x & 0x000000ff000000ff;
    // 0 a 0 b 0 c 0 d
    

    And for completeness, modern x86 processors can do this with one quick instruction:

    x = _pdep_u64(x, 0xff00ff00ff00ff)