I had nothing to do, so I decided to study AVX-512 and try to write something in assembler using it. I'm trying to optimize a piece of code that processes large amounts of data using AVX-512 instructions. The goal is to maximize the capabilities of vector registers and minimize the number of processor cycles.
The problem is this: I want to use masking to process only a portion of the elements in the zmm0 and zmm1 registers depending on a certain condition. However, AVX-512 instructions with masks (such as vaddps
) require a mask in the k0-k7 register:
vmovups zmm0, [rsi] ; float[16] <= zmm0
vmovups zmm1, [rsi+64]
; some code here
vaddps zmm0, zmm0, zmm11
vmovups [rdi], zmm0
add rsi, 128 ; ptr => next[data]
add rdi, 64 ; ptr => next[data] ?to: write
At the same time, the condition by which I want to mask the data is obtained by comparing the other two zmm registers.
So here's the question:
Is there any way to efficiently generate a mask in the k register based on a comparison of values in the zmm registers, and then use it for selective data processing using AVX-512 instructions? Or maybe there is another way to achieve the desired result using the AVX-512 without resorting to masks?
I remember that there is vpcmpd that compares the values of vector registers, and supposedly you can do something like k1 = zmm0 > zmm1 + k2 = zmm0 < zmm2
, but honestly I have no idea how effective this can be; i tried, but due to lack of my knowledge, i threw away this idea.
To summarize the discussion from the comments: You were right to assume that the vcmp
family of instructions such as vcmpps
is the proper way to do this. Masked AVX512 instructions are generally fast. When possible, use the zeroing mask instructions such as vaddps zmm1{k1}{z}, zmm2, zmm3
over the merging instructions such as vaddps zmm1{k1}, zmm3, zmm0
to avoid depending on the previous register content.
One thing to look out for with mask registers is that some of the instructions for computing them are rather slow. For example kadd
has a latency of 4 on Intel according to uops.info while kand
has a latency of only 1 but still only a throughput of 1.
However, you can often avoid combining masks that way. vcmp
itself accepts an input mask. The output mask will be zero where the input mask was zero. That's an AND connection. For example the condition zmm1 < zmm2 && zmm2 < zmm3
can be written as
vcmpps k1, zmm1, zmm2, 1 ; _CMP_LT_OS
vcmpps k1{k1}, zmm2, zmm3, 1
We cannot form an OR connection that way but we can still avoid using two mask registers. For example zmm1 < zmm2 || zmm2 < zmm3
is the same as ! (! (zmm1 < zmm2) && ! (zmm2 < zmm3))
according to De Morgan's laws
vcmpps k1, zmm1, zmm2, 5 ; _CMP_NLT_US
vcmpps k1{k1}, zmm2, zmm3, 5
knotw k1, k1
On the other hand, using two masks and merging them via korw
would remove the input dependency from one vcmp
to the other, potentially increasing the instruction-level parallelism.