ccachingassemblydisassemblyprefetch

What happens if an invalid address is prefetched?


Simple MWE:

int* ptr = (int*)malloc(64 * sizeof(int));
_mm_prefetch((const char*)(ptr + 64), _MM_HINT_0);
  1. Is this defined or undefined behavior?
  2. Can this raise a signal and abort the program run?

I'm asking since I can see such prefetching in compiler generated code, where inside a loop prefetching is done without checking the address (stored in rbx):

400e73:       49 83 c5 40             add    r13,0x40
400e77:       62 f1 f9 08 28 03       vmovapd zmm0,ZMMWORD PTR [rbx]
400e7d:       4d 3b ec                cmp    r13,r12
400e80:       62 d1 f9 08 eb 4d ff    vporq  zmm1,zmm0,ZMMWORD PTR [r13-0x40]
400e87:       90                      nop
400e88:       62 d1 78 08 29 4d ff    vmovaps ZMMWORD PTR [r13-0x40],zmm1
400e8f:       72 03                   jb     400e94 <main+0x244>
400e91:       49 89 c5                mov    r13,rax
400e94:       62 f1 78 08 18 53 1d    vprefetch1 [rbx+0x740]
400e9b:       ff c1                   inc    ecx
400e9d:       62 f1 78 08 18 4b 02    vprefetch0 [rbx+0x80]
400ea4:       48 83 c3 40             add    rbx,0x40
400ea8:       81 f9 00 00 10 00       cmp    ecx,0x100000
400eae:       72 c3                   jb     400e73 <main+0x223>

Solution

  • First of all, the compiler doing it or you doing it are very different things in theory. Just because it looks equivalent doesn't make it so, the compiler is allowed to use any dirty hacks that work no matter whether they're expressible or defined in fully standard C.

    Of course prefetching doesn't generate signals*, it would be nearly useless if it did. It can be very slow for some invalid pointers on especially older CPUs (see eg The problem with prefetch), but that's an old article and it doesn't seem to be so bad anymore these days, for example on Intel Rocket Lake prefetching invalid pointers is no big deal. Even if it doesn't fall into that performance pitfall, explicit prefetching isn't free and doesn't necessarily help (many normal access patterns are covered by automatic prefetching for example). So the compiler can safely use it, but it shouldn't indiscriminately use it for everything ever.

    Now using pointer arithmetic to create out of bounds pointers (except just past the end) is UB in theory, but when applied to a pointer it's the kind of UB that will mostly work anyway (with flat memory it's just an addition, the only way it could fail is if the compiler goes out of its way to detect it, and that means it would have to reason about dynamic sizes). Obviously the above case must be supported by compilers claiming to support SSE intrinsics otherwise you couldn't reasonably use prefetching, as demonstrated by this answer (and there's a bunch more extra guarantees they must make on top of the Standard).

    Regarding "UB? This is hardware, not a language standard" from the comments: that would be the case if you wrote assembly code. If you're writing C, anything that happens "in C" is at least in principle under the jurisdiction of the C standard, and that includes creating a potentially invalid (from C's point of view) pointer, eg further beyond the end of an object than "one past the end", which may or may not be also invalid from the point of view of the hardware. What the prefetch intrinsic does is not under the jurisdiction of the C standard, but the (potential) problem occurs outside of that. If there was a prefetch intrinsic with an offset parameter this problem could be side-stepped, but neither Intel-style prefetch intrinsics nor GCC-style __builtin_prefetch take an offset.


    * from Intel's SDM asm manual (https://www.felixcloutier.com/x86/prefetchh):

    The PREFETCHh instruction is merely a hint and does not affect program behavior.

    Signal would affect program behavior, so they cannot be generated.

    Another piece of evidence from the manual is the list of possible exceptions including only #UD (illegal instruction) if a lock prefix is used. Notably absent are #PF page faults or #GP(0) general-protection faults on non-canonical 64-bit addresses. (Not correctly sign-extended from 48 or 57-bit). Those are the hardware exceptions that you would get from normal loads on bad addresses.

    Real Intel CPUs such as Skylake don't fault on either of those types of bad address with prefetch instructions. (So you can't trigger hard / soft page faults with software prefetch either; soft vs. hard vs. invalid is only determined after the hardware #PF exception is taken. The CPU will abandon the prefetch after at most a page-walk if there's no valid mapping for the page.)