assemblyx86x86-64micro-optimizationradix-sort

How to optimize for writes to memory in hot loop


I have a loop in my code where I spend most of the CPU time:

%%write_to_intermediate_head:
    ; uint64_t count_index = (arr[i] >> (8 * cur_digit)) & (RADIX - 1);
    ; We will use rsi to hold the digit we're currently examining
    mov rsi, [rdx + rdi * 8] ; Load arr[i] into rsi
    mov r9, %1 ; Load current digit were sorting on
    shl r9, 3  ; Multiply by 8
    shrx rsi, rsi, r9
    and rsi, 255

    ; intermediate_arr[--digit_counts[count_index]] = arr[i];
    ; rdi is the loop counter i
    ; rsi is the count_index
    ; rcx is the digit_counts array
    ; rax is the intermediate_arr
    dec qword [rcx + rsi * 8]
    mov r9, [rcx + rsi * 8] ; --digit_counts[count_index]

    mov rsi, [rdx + rdi * 8] ; arr[i]
    mov [rax + r9 * 8], rsi

    dec rdi
    jnz %%write_to_intermediate_head

The variables: digit_counts, arr, and intermediate_arr are all in memory (heap and bss). The AMD profiler shows that many cycles are spent reading and writing to these memory locations. Is there any way to optimize this?


Solution

  • Do your counts truly need to be qwords, or could you use a narrower type to cut your cache footprint in half with 32-bit (or even less with narrower)? If you're getting cache misses, that's going to mean much more time spent waiting for loads/stores if OoO exec can't hide that latency.

    I guess copying the data around is going to be most of the memory bandwidth / cache misses, though. This looks like Radix Sort, and the amount of metadata to manage is smallish compared to the data. (But having at least it hit in cache can help, making it more resistant to eviction by all the other data you're throwing around.)


    No matter what you do, the access pattern of Radix Sort is inherently not very cache-friendly, although it's not terrible. You're scattering writes across 256 possible destinations, along with updating the pointers. But those 256 destinations are sequential streams so they can hit in L1d cache if you're lucky.

    Hopefully those destinations aren't multiples of 4k apart (initially or most of the time), otherwise they'll alias the same line in L1d cache and cause conflict misses. (i.e. force eviction of another partially-written cache line that you're soon going to write to.)


    You have some redundant loads / stores which may be a bottleneck for load/store execution units, but if that's not the bottleneck then cache will absorb them just fine. This section is mostly about tuning the loop to use fewer uops, improving things in the no-cache-miss best case, and giving OoO exec less latency to hide.

    Using a memory-destination dec and then reloading the dec's store seems obviously inefficient in terms of total back-end load/store operations, and latency for OoO exec to hide. (Although on AMD, dec mem is still a single uop for the front-end, vs. 3 on Intel; https://uops.info/ and https://agner.org/optimize/).

    Similarly, aren't you loading [rdx + rdi * 8] ; arr[i] twice, with the same RDI? SHRX can copy-and-shift so you wouldn't even be saving uops by keeping around that load result for later. (You could also use a simple non-indexed addressing mode for arr[i], by doing a pointer-increment like add rdi,8 and cmp rdi, endp/jne where end is something you calculated earlier with lea endp, [rdx + size*8]. Looping forward over an array can be better for some HW prefetchers.)

    x86-64 has 15 registers, so if you need more for this inner loop, save/restore some call-preserved registers (like RBX or RBP) at the top/bottom of the function. Or spill some outer-loop stuff to memory if necessary.

    mov r9, %1 looks loop-invariant, so hoist that shl r9, 3 calculation out of the loop, and don't overwrite R9 inside the inner loop.


    You do need to zero-extend the byte you extracted, but and rsi, 255 is not as efficient as movzx eax, sil. (Or better, pick a register like ECX whose low byte can be accessed without a REX prefix). AMD can't do mov-elimination on movzx the way Intel can, though, so just saving code size for AMD, but optimizing latency if you ever run this on Intel CPUs.

    Or better, AMD Zen has single-uop BMI1 bextr r64,r64,r64, so prepare a start/length pair in the low 2 bytes of a register. As discussed, that's loop invariant. i.e. before the loop mov ecx, %k1 / shl cl, 3 / mov ch, 0x8 (AMD doesn't have partial-register stalls, just false dependencies. In this case true, since we want to merge.) If that's inline asm syntax, %k1 specifies the 32-bit version of the register. Or if it's memory, you're just loading anyway, and hoisting it will save another load!

    (Intel has 2-uop bextr, presumably shift and bzhi uops.)

    Or if you really want to load twice, movzx esi, byte [rdi + rdx] where RDI is a pointer to arr[[i] that you increment or decrement, and RDX is a byte offset. But probably BEXTR is a better bet.