c++11pointersx86-64canonical-form

Pointers to static variables must respect canonical form?


Assuming I have the following example:

struct Dummy {
    uint64_t m{0llu};

    template < class T > static uint64_t UniqueID() noexcept {
        static const uint64_t uid = 0xBEA57;
        return reinterpret_cast< uint64_t >(&uid);
    }
    template < class T > static uint64_t BuildID() noexcept {
        static const uint64_t id = UniqueID< T >()
               // dummy bits for the sake of example (whole last byte is used)
               | (1llu << 60llu) | (1llu << 61llu) | (1llu << 63llu);
        return id;
    }
    // Copy bits 48 through 55 over to bits 56 through 63 to keep canonical form.
    uint64_t GetUID() const noexcept {
        return ((m & ~(0xFFllu << 56llu)) | ((m & (0xFFllu << 48llu)) << 8llu));
    }
    uint64_t GetPayload() const noexcept {
        return *reinterpret_cast< uint64_t * >(GetUID());
    }
};

template < class T > inline Dummy DummyID() noexcept {
    return Dummy{Dummy::BuildID< T >()};
}

Knowing very well that the resulting pointer is an address to a static variable in the program.

When I call GetUID() do I need to make sure that bit 47 is repeated till bit 63?

Or I can just AND with a mask of the lower 48 bits and ignore this rule.

I was unable to find any information about this. And I assume that those 16 bits are likely to always be 0.

This example is strictly limited to x86_64 architecture (x32).


Solution

  • In user-space code for mainstream x86-64 OSes, you can normally assume that the upper bits of any valid address are zero.

    AFAIK, all the mainstream x86-64 OSes use a high-half kernel design where user-space addresses are always in the lower canonical range.

    If you wanted this code to work in kernel code, too, you would want to sign-extend with x <<= 16; x >>= 16; using signed int64_t x.


    If the compiler can't keep 0x0000FFFFFFFFFFFF = (1ULL<<48)-1 around in a register across multiple uses, 2 shifts might be more efficient anyway. (mov r64, imm64 to create that wide constant is a 10-byte instruction that can sometimes be slow to decode or fetch from the uop cache.) But if you're compiling with -march=haswell or newer, then BMI1 is available so the compiler can do mov eax, 48 / bzhi rsi, rdi, rax. Either way, though, one AND or BZHI is only 1 cycle of critical path latency for the pointer vs. 2 for 2 shifts. Unfortunately BZHI isn't available with an immediate operand. (x86 bitfield instructions mostly suck compared to ARM or PowerPC.)

    Your current method of extracting bits [55:48] and using them to replace the current bits [63:56] is probably slower because the compiler has to mask out the old high byte and then OR in the new high byte. That's already at least 2 cycle latency so you might as well just shift, or mask which can be faster.

    x86 has crap bitfield instructions so that was never a good plan. Unfortunately ISO C++ doesn't provide any guaranteed arithmetic right shift, but on all actual x86-64 compilers, >> on a signed integer is a 2's complement arithmetic shift. If you want to be really careful about avoiding UB, do the left shift on an unsigned type to avoid signed integer overflow.

    int64_t is guaranteed to be a 2's complement type with no padding if it exists.

    I think int64_t is actually a better choice than intptr_t, because if you have 32-bit pointers, e.g. the Linux x32 ABI (32-bit pointers in x86-64 long mode), your code might still Just Work, and casting a uint64_t to a pointer type will simply discard the upper bits. So it doesn't matter what you did to them, and zero-extension first will hopefully optimize away.

    So your uint64_t member would just end up storing a pointer in the low 32 and your tag bits in the high 32, somewhat inefficiently but still working. Maybe check sizeof(void*) in a template to select an implementation?


    Future proofing

    x86-64 CPUs with 5-level page tables for 57-bit canonical addresses are probably coming at some point soonish, to allow use of large memory mapped non-volatile storage like Optane / 3DXPoint NVDIMMs.

    Intel has already published a proposal for a PML5 extension https://software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf (see https://en.wikipedia.org/wiki/Intel_5-level_paging for a summary). There's already support for it in the Linux kernel so it's ready for the appearance of actual HW.

    (I can't find out if it's expected in Ice Lake or not.)

    See also Why in 64bit the virtual address are 4 bits short (48bit long) compared with the physical address (52 bit long)? for more about where the 48-bit virtual address limit comes from.


    So you can still use the high 7 bits for tagged pointers and maintain compat with PML5.

    If you assume user-space, then you can use the top 8 bits and zero-extend, because you're assuming the 57th bit (bit 56) = 0.

    Redoing sign- (or zero-) extension of the low bits was already optimal, we're just changing it to a different width that only re-extends the bits we disturb. And we're disturbing few enough high bits that it should be future proof even on systems that enable PML5 mode and use wide virtual addresses.

    On a system with 48-bit virtual addresses, broadcasting bit 57 to the upper 7 still works, because bit 57 = bit 48. And if you don't disturb those lower bits, they don't need to be re-written.


    And BTW, your GetUID() returns an integer. It's not clear why you need that to return the static address.

    And BTW, it may be cheaper for it to return &uid (just a RIP-relative LEA) than to load + re-canonicalize your m member value. Move static const uint64_t uid = 0xBEA57; to a static member variable instead of being within one member function.