I know that we can do something like this to move a character to a xmm register:
movaps xmm1, xword [.__0x20]
align 16
.__0x20 db 0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20
but since this is a memory process, i want to know if there is any better way? (also, im talking about SSE2 not other SIMD types ...)
i want to each byte of xmm1 register be 0x20 not only one byte ..
(Editor's note: this can be called a broadcast or splat.
It's what the _mm_set1_epi8(0x20)
intrinsic does.)
With only SSE2, loading the full pattern from memory is often your best bet.
In your NASM source you can use times 16 db 0x20
for easy maintainability.
With SSE3 you can do 8-byte broadcast loads with movddup
. With AVX you can do a 4-byte broadcast-load with vbroadcastss
. These broadcast-loads are very good on modern CPUs, running on just the load port, not needing a shuffle uop. i.e. they're exactly as cheap as movaps
on CPUs that support them, except for a byte or two more code-size. Same for vbroadcastf128
to YMM registers.
Most compilers don't seem to realize this and will do constant-propagation through _mm_set1
even when that results in a 32 byte constant instead of 4 bytes, even when just mov...
loading it ahead of a loop, not folding it into a memory operand for an ALU instruction. (And that's still possible with broadcast-loading when AVX512 is available.) Clang does sometimes take advantage of broadcast loads for simple constants.
AVX2 adds vpbroadcastb/w/d/q
, but only dword and qword are pure load uops on Intel CPUs. Byte and word broadcast-loads need an ALU shuffle uop, so for constant byte patterns you probably want to just broadcast-load a dword that repeats a byte 4 times. (Unless it's an element from a big lookup table, then compress the table by using a byte or word broadcast load, or a pmovsx
sign-extending load or whatever). AMD Zen-family appears to handle byte and word broadcasts from memory without any uops for FP/SIMD ports, according to measurements by https://uops.info/
AVX512 adds vpbroadcastb/w/d/e
from an integer register so you could mov eax, 0x20202020
/ vpbroadcastd xmm0, eax
if you have AVX512VL.
With SSE2 it would take at least 2 instructions including an ALU shuffle, like this, and may not be worth it unless that helps pack more constants into a single cache line.
movd xmm0, [const_4B]
pshufd xmm0, xmm0, 0
Or avoiding a memory constant,
mov eax, 0x20202020
movd xmm0, eax
pshufd xmm0, xmm0, 0 (or shufps xmm0,xmm0, 0 for FP constants)
;vpbroadcastd x/y/zmm0, eax ; AVX-512
For a runtime-variable byte value in a GRP, you can zero-extend it and use imul eax, ecx, 0x01010101
to broadcast it to 4 bytes, or with 0x0101010101010101
for 64-bit broadcast. Either as setup for pshufd
in an XMM, if you can't use SSSE3 pshufb
with an XMM reg you zeroed with pxor
, or if you want to just use integer registers.
Without pshufb
, another option is punpcklbw xmm0,xmm0
/ punpcklwd xmm0,xmm0
to broadcast a byte to a dword, setting up for pshufd
or shufps
. That has lower latency (2 cycles on most CPUs) than imul
(3 cycles), but costs more uops (worse throughput) so is probably a worse choice in many cases, especially as setup for a loop that needs to load other data.
Some repeating constants can be generated on the fly in a couple instructions, starting with all-ones from pcmpeqd xmm0,xmm0
. See What are the best instruction sequences to generate vector constants on the fly? and Agner Fog's asm optimization guide.
This pattern does not appear to be easy to generate. It's a byte pattern (not word, dword, or qword) and SSE shifts are only available with word granularity at best. However, if we know the bits shifted across byte boundaries are 0, it's fine. e.g.
pcmpeqd xmm0, xmm0 ; set1( -1 )
pabsb xmm0, xmm0 ; set1_epi8(1) SSSE3
pslld xmm0, 5 ; set1_epi8(1<<5)
; or with only SSE2, something even less efficient like shift / packsswb / shift
This is unlikely to be worth it unless you really want to avoid the possibility of a cache miss for the constant. On average a load will usually come out ahead.