assemblyx86-64cpu-architecturearm64micro-architecture

Temporality of ST64B and MOVDIR64B


x86_64 has an instruction movdir64b, which to my understanding is a non-temporal copy (well, at least the store is) of 64 bytes (a cache line). AArch64 seems to have a similar instruction st64b, which does an atomic store of the same size. However, the official ARMv9 documentation is not clear about whether st64b, too, is a non-temporal store.

Intel's instruction-set reference documentation for movdir64b is much more detailed, but I'm not far along enough in my studies to fully understand what each memory type protocol represents.

From what I could deduce so far, the x86_64 instruction movntdq is roughly equivalent to stnp, and is write-combining. From that, it seems as if movdir64b is like four of those in one atomic store, hence my guess about st64b.

This is almost certainly an oversimplification of what's really going on (and could be wrong/inaccurate, of course), but it's what could deduce so far.

Could st64b be used as if it were an atomic sequence of four stnp instructions as a non-temporal write of a cache line in this way?


Solution

  • The ST64B/ST64BV/ST64BV0 instructions are intended to efficiently add work items to a work queue of an I/O device that supports this interface. When the target address is mapped to an I/O device, the store is translated as a non-posted write transaction, which means that there has to be a completion message that includes a status code as described in the documentation. The ST64B instruction simply discards the status code while the other two store it in the register specified by the Xs operand.

    If you look at the pseudocode, these instructions require the target address to be in uncacheable memory:

    if acctype == AccType_ATOMICLS64 && memattrs.memtype == MemType_Normal then
        if memattrs.inner.attrs != MemAttr_NC || memattrs.outer.attrs != MemAttr_NC then
            fault.statuscode = Fault_Exclusive;
            return (fault, AddressDescriptor UNKNOWN);
    

    Otherwise, the resulting status code is 0xFFFFFFFF_FFFFFFFF, which, as described in the documentation, indicates that the target address doesn't support atomic 64-byte stores. Note that this is different from the status code 1, which represents failure. This can occur for a number of reasons. For example, the work queue of the target device is full.

    My understanding from the pseudocode is that these instructions can be used on normal memory as well as device memory as long as the target address is in uncacheable memory. You should check whether they really work on normal memory experimentally by examining the status code.

    These instructions are completely different from ARM's STNP and x86's MOVNTDQ. The corresponding instructions in x86 are MOVDIR64B, ENQCMD, and ENQCMDS. Although there are major differences between the ARM ones and x86 ones. The "mental equivalence" you're making between these instructions is kind of OK if you intend in terms of purpose, not behavior.