page faulting maskmovdqu / _mm_maskmoveu_si128 - how to avoid?

I have a function that streams out structured data. The data are Vec4/Vec3/Vec2/float-structures, so the maximum size is 16 bytes per structure. Now it may happen, that the stream is being read starting inside a structure. Simple solution: load structure, build a store-mask, decrease destination data pointer by how many bytes into our structure that call wants to start reading.

Imagine the current item type is Vec2, we are 4 bytes into this structure:

xmm0 = 00000000-00000000-dadadada-dadadada
xmm1 = 00000000-00000000-ffffffff-00000000
result_data_ptr = 13450000
-> RDI = 1344fffc
maskmovdqu xmm0, xmm1

=> result is a page fault exception.

Is there any way to detect this page fault will happen? The memory of the previous page won't even be touched ...

Solution

maskmovdqu doesn't do fault-suppression, unlike AVX vmaskmovps or AVX512 masked stores. Those would solve your problem, although still maybe not the most efficient way.

As documented in Intel's ISA ref manual, with an all-zero mask (so nothing is stored to memory) Exceptions associated with addressing memory and page faults may still be signaled (implementation dependent).

With a non-zero mask, I assume it's guaranteed that it does page fault if the 16 bytes includes any non-writeable pages. Or maybe some implementations do the mask suppress faults even when some storing does happen (zeros in the unmapped page, but non-zero elsewhere)

It's not a fast instruction anyway on real CPUs.

maskmovdqu might have been good sometimes on single-core Pentium 4 (or not IDK), and/or its MMX predecessor was maybe useful on in-order Pentium. Masked cache-bypassing stores are much less useful on modern CPUs where L3 is the normal backstop, and caches are large. Perhaps more importantly, there's more machinery between a single core and the memory controller(s) because everything has to work correctly even if another core did reload this memory at some point, so a partial-line write is maybe even less efficient.

It's generally a terrible choice if you really are only storing 8 or 12 bytes total. (Basically the same as an NT store that doesn't write a full line). Especially if you're using multiple narrow stores to grab pieces of data and put them into one contiguous stream. I would not assume that multiple overlapping maskmovdqu stores will result in a single efficient store of a whole cache line once you eventually finish one, even if the masks mean no byte is actually written twice.

L1d cache is excellent for buffering multiple small writes to a cache line before it's eventually done; use that normal stores unless you can do a few NT stores nearly back-to-back.

To store the top 8 bytes of an XMM register, use movhps.

Writing into cache also makes it fine to do overlapping stores, like movdqu. So you can concatenate a few 12-byte objects by shuffling them each to the bottom of an XMM register (or loading them that way in the first place), then use movdqu stores to [rdi], [rdi+12], [rdi+24], etc. The 4-byte overlap is totally fine; coalescing in the store buffer may absorb it before it even commits to L1d cache, or if not then L1d cache is still pretty fast.

At the start of writing a large array, if you don't know the alignment you can do an unaligned movdqu of the first 16 bytes of your output. Then do the first 16-byte aligned store possibly overlapping with that. If your total output size is always >= 16 bytes, this strategy doesn't need a lot of branching to let you do aligned stores for most of it. At the end you can do the same thing with a final potentially-unaligned vector that might partially overlap the last aligned vector. (Or if the array is aligned, then there's no overlap and it's aligned too. movdqu is just as fast as movdqa if the address is aligned, on modern CPUs.)