performance x86 x86-64 benchmarking inline-assembly

How can I accurately benchmark unaligned access speed on x86_64?

In an answer, I've stated that unaligned access has almost the same speed as aligned access a long time (on x86/x86_64). I didn't have any numbers to back up this statement, so I've created a benchmark for it.

Do you see any flaws in this benchmark? Can you improve on it (I mean, to increase GB/sec, so it reflects the truth better)?

#include <sys/time.h>
#include <stdio.h>

template <int N>
__attribute__((noinline))
void loop32(const char *v) {
    for (int i=0; i<N; i+=160) {
        __asm__ ("mov     (%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x04(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x08(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x0c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x10(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x14(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x18(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x1c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x20(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x24(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x28(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x2c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x30(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x34(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x38(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x3c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x40(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x44(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x48(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x4c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x50(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x54(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x58(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x5c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x60(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x64(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x68(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x6c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x70(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x74(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x78(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x7c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x80(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x84(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x88(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x8c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x90(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x94(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x98(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x9c(%0), %%eax" : : "r"(v) :"eax");
        v += 160;
    }
}

template <int N>
__attribute__((noinline))
void loop64(const char *v) {
    for (int i=0; i<N; i+=160) {
        __asm__ ("mov     (%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x08(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x10(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x18(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x20(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x28(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x30(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x38(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x40(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x48(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x50(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x58(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x60(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x68(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x70(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x78(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x80(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x88(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x90(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x98(%0), %%rax" : : "r"(v) :"rax");
        v += 160;
    }
}

template <int N>
__attribute__((noinline))
void loop128a(const char *v) {
    for (int i=0; i<N; i+=160) {
        __asm__ ("movaps     (%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x10(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x20(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x30(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x40(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x50(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x60(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x70(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x80(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x90(%0), %%xmm0" : : "r"(v) :"xmm0");
        v += 160;
    }
}

template <int N>
__attribute__((noinline))
void loop128u(const char *v) {
    for (int i=0; i<N; i+=160) {
        __asm__ ("movups     (%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x10(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x20(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x30(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x40(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x50(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x60(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x70(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x80(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x90(%0), %%xmm0" : : "r"(v) :"xmm0");
        v += 160;
    }
}

long long int t() {
    struct timeval tv;
    gettimeofday(&tv, 0);
    return (long long int)tv.tv_sec*1000000 + tv.tv_usec;
}

int main() {
    const int ITER = 10;
    const int N = 1600000000;

    char *data = reinterpret_cast<char *>(((reinterpret_cast<unsigned long long>(new char[N+32])+15)&~15));
    for (int i=0; i<N+16; i++) data[i] = 0;

    {
        long long int t0 = t();
        for (int i=0; i<ITER*100000; i++) {
            loop32<N/100000>(data);
        }
        long long int t1 = t();
        for (int i=0; i<ITER*100000; i++) {
            loop32<N/100000>(data+1);
        }
        long long int t2 = t();
        for (int i=0; i<ITER; i++) {
            loop32<N>(data);
        }
        long long int t3 = t();
        for (int i=0; i<ITER; i++) {
            loop32<N>(data+1);
        }
        long long int t4 = t();

        printf(" 32-bit, cache: aligned: %8.4f GB/sec unaligned: %8.4f GB/sec, difference: %0.3f%%\n", (double)N*ITER/(t1-t0)/1000, (double)N*ITER/(t2-t1)/1000, 100.0*(t2-t1)/(t1-t0)-100.0f);
        printf(" 32-bit,   mem: aligned: %8.4f GB/sec unaligned: %8.4f GB/sec, difference: %0.3f%%\n", (double)N*ITER/(t3-t2)/1000, (double)N*ITER/(t4-t3)/1000, 100.0*(t4-t3)/(t3-t2)-100.0f);
    }

    {
        long long int t0 = t();
        for (int i=0; i<ITER*100000; i++) {
            loop64<N/100000>(data);
        }
        long long int t1 = t();
        for (int i=0; i<ITER*100000; i++) {
            loop64<N/100000>(data+1);
        }
        long long int t2 = t();
        for (int i=0; i<ITER; i++) {
            loop64<N>(data);
        }
        long long int t3 = t();
        for (int i=0; i<ITER; i++) {
            loop64<N>(data+1);
        }
        long long int t4 = t();

        printf(" 64-bit, cache: aligned: %8.4f GB/sec unaligned: %8.4f GB/sec, difference: %0.3f%%\n", (double)N*ITER/(t1-t0)/1000, (double)N*ITER/(t2-t1)/1000, 100.0*(t2-t1)/(t1-t0)-100.0f);
        printf(" 64-bit,   mem: aligned: %8.4f GB/sec unaligned: %8.4f GB/sec, difference: %0.3f%%\n", (double)N*ITER/(t3-t2)/1000, (double)N*ITER/(t4-t3)/1000, 100.0*(t4-t3)/(t3-t2)-100.0f);
    }

    {
        long long int t0 = t();
        for (int i=0; i<ITER*100000; i++) {
            loop128a<N/100000>(data);
        }
        long long int t1 = t();
        for (int i=0; i<ITER*100000; i++) {
            loop128u<N/100000>(data+1);
        }
        long long int t2 = t();
        for (int i=0; i<ITER; i++) {
            loop128a<N>(data);
        }
        long long int t3 = t();
        for (int i=0; i<ITER; i++) {
            loop128u<N>(data+1);
        }
        long long int t4 = t();

        printf("128-bit, cache: aligned: %8.4f GB/sec unaligned: %8.4f GB/sec, difference: %0.3f%%\n", (double)N*ITER/(t1-t0)/1000, (double)N*ITER/(t2-t1)/1000, 100.0*(t2-t1)/(t1-t0)-100.0f);
        printf("128-bit,   mem: aligned: %8.4f GB/sec unaligned: %8.4f GB/sec, difference: %0.3f%%\n", (double)N*ITER/(t3-t2)/1000, (double)N*ITER/(t4-t3)/1000, 100.0*(t4-t3)/(t3-t2)-100.0f);
    }
}

Solution

Timing method. I probably would have set it up so the test was selected by a command-line argument, so I could time it with perf stat ./unaligned-test, and get perf counter results instead of just wall-clock times for each test. That way, I wouldn't have to care about turbo / power-saving, since I could measure in core clock cycles. (Not the same thing as gettimeofday / rdtsc reference cycles unless you disable turbo and other frequency-variation. Even then, the CPU frequency doesn't always match the TSC, but at fixed CPU frequency, wall-clock-equivalent timers such as rdtsc are usable.)

You're only testing throughput, not latency, because none of the loads are dependent.

Your cache numbers will be worse than your memory numbers, but you maybe won't realize that it's because your cache numbers may be due to bottlenecking on the number of split-load registers that handle loads/stores that cross a cache-line boundary. For sequential read, the outer levels of cache are still always just going to see a sequence of requests for whole cache lines. It's only the execution units getting data from L1D that have to care about alignment. To test misalignment for the non-cached case, you could do scattered loads, so cache-line splits would need to bring two cache lines into L1.

Cache lines are 64 bytes wide¹, so you're always testing a mix of cache-line splits and within-a-cache-line accesses. Testing always-split loads would bottleneck harder on the split-load microarchitectural resources. (Actually, depending on your CPU, the cache-fetch width might be narrower than the line size. Recent Intel CPUs can fetch any unaligned chunk from inside a cache line, but that's because they have special hardware to make that fast. Other CPUs may only be at their fastest when fetching within a naturally-aligned 16 byte chunk or something. @BeeOnRope says that AMD CPUs may care about 16 byte and 32 byte boundaries. See also https://travisdowns.github.io/blog/2019/06/11/speed-limits.html#memory-related-limits)

You're not testing store → load forwarding at all. For existing tests, and a nice way to visualize results for different alignments, see this stuffedcow.net blog post: Store-to-Load Forwarding and Memory Disambiguation in x86 Processors.

Passing data through memory is an important use case, and misalignment + cache-line splits can interfere with store-forwarding on some CPUs. To properly test this, make sure you test different misalignments, not just 1:15 (vector) or 1:3 (integer). (You currently only test a +1 offset relative to 16B-alignment).

I forget if it's just for store-forwarding, or for regular loads, but there may be less penalty when a load is split evenly across a cache-line boundary (an 8:8 vector, and maybe also 4:4 or 2:2 integer splits). You should test this. (I might be thinking of P4 lddqu or Core 2 movqdu)

Intel's optimization manual has big tables of misalignment vs. store-forwarding from a wide store to narrow reloads that are fully contained in it. On some CPUs, this works in more cases when the wide store was naturally-aligned, even if it doesn't cross any cache-line boundaries. (Maybe on SnB/IvB, since they use a banked L1 cache with 16B banks, and splits across those can affect store forwarding.

I didn't re-check the manual, but if you really want to test this experimentally, that's something you should be looking for.)

Which reminds me, misaligned loads are more likely to provoke cache-bank conflicts on SnB/IvB (because one load can touch two banks). But you won't see this loading from a single stream, because accessing the same bank in the same line twice in one cycle is fine. It's only accessing the same bank in different lines that can't happen in the same cycle. (e.g., when two memory accesses are a multiple of 128 bytes apart.)

You don't make any attempt to test 4k page-splits. They are slower than regular cache-line splits, because they also need two TLB checks. (Skylake improved them from a ~100 cycles penalty to a ~5 cycles penalty beyond the normal load-use latency, though)

You fail to test movups on aligned addresses, so you wouldn't detect that movups is slower than movaps on Core 2 and earlier even when the memory is aligned at runtime. (I think unaligned mov loads up to 8 bytes were fine even in Core 2, as long as they didn't cross a cache-line boundary. IDK how old a CPU you'd have to look at to find a problem with non-vector loads within a cache line. It would be a 32-bit only CPU, but you could still test 8 byte loads with MMX or SSE, or even x87. P5 Pentium and later guarantee that aligned 8 byte loads/stores are atomic, but P6 and newer guarantee that cached 8 byte loads/stores are atomic as long as no cache-line boundary is crossed. Unlike AMD, where 8 byte boundaries matter for atomicity guarantees even in cacheable memory. Why is integer assignment on a naturally aligned variable atomic on x86?)

Go look at Agner Fog's stuff to learn more about how unaligned loads can be slower, and cook up tests to exercise those cases. Actually, Agner may not be the best resource for that, since his microarchitecture guide mostly focuses on getting uops through the pipeline. Just a brief mention of the cost of cache-line splits, nothing in-depth about throughput vs. latency.

See also: Cacheline splits, take two, from Dark Shikari's blog (x264 lead developer), talking about unaligned load strategies on Core2: it was worth it to check for alignment and use a different strategy for the block.

Footnote 1 64B cache lines is a safe assumption these days. Pentium 3 and earlier had 32B lines. P4 had 64B lines but they were often transferred in 128B-aligned pairs. I thought I remembered reading that P4 actually had 128B lines in L2 or L3, but maybe that was just a distortion of 64B lines transferred in pairs. 7-CPU definitely says 64B lines in both levels of cache for a P4 130nm.

Modern Intel CPUs have adjacent-line L2 "spatial" prefetch similarly tends to pull in the other half of a 128-byte aligned pair, which can increase false sharing in some cases. Should the cache padding size of x86-64 be 128 bytes? shows an experiment that demonstrates this.

See also uarch-bench results for Skylake. Apparently someone has already written a tester that checks every possible misalignment relative to a cache-line boundary.

My testing on Skylake desktop (i7-6700k)

Addressing mode affects load-use latency, exactly as Intel documents in their optimization manual. I tested with integer mov rax, [rax+...], and with movzx/sx (in that case using the loaded value as an index, since it's too narrow to be a pointer).

;;;  Linux x86-64 NASM/YASM source.  Assemble into a static binary
;; public domain, originally written by peter@cordes.ca.
;; Share and enjoy.  If it breaks, you get to keep both pieces.

;;; This kind of grew while I was testing and thinking of things to test
;;; I left in some of the comments, but took out most of them and summarized the results outside this code block
;;; When I thought of something new to test, I'd edit, save, and up-arrow my assemble-and-run shell command
;;; Then edit the result into a comment in the source.

section .bss

ALIGN   2 * 1<<20   ; 2MB = 4096*512.  Uses hugepages in .bss but not in .data.  I checked in /proc/<pid>/smaps
buf:    resb 16 * 1<<20

section .text
global _start
_start:
    mov     esi, 128

;   mov             edx, 64*123 + 8
;   mov             edx, 64*123 + 0
;   mov             edx, 64*64 + 0
    xor             edx,edx
   ;; RAX points into buf, 16B into the last 4k page of a 2M hugepage

    mov             eax, buf + (2<<20)*0 + 4096*511 + 64*0 + 16
    mov             ecx, 25000000

%define ADDR(x)  x                     ; SKL: 4c
;%define ADDR(x)  x + rdx              ; SKL: 5c
;%define ADDR(x)  128+60 + x + rdx*2   ; SKL: 11c cache-line split
;%define ADDR(x)  x-8                 ; SKL: 5c
;%define ADDR(x)  x-7                 ; SKL: 12c for 4k-split (even if it's in the middle of a hugepage)
; ... many more things and a block of other result-recording comments taken out

%define dst rax



        mov             [ADDR(rax)], dst
align 32
.loop:
        mov             dst, [ADDR(rax)]
        mov             dst, [ADDR(rax)]
        mov             dst, [ADDR(rax)]
        mov             dst, [ADDR(rax)]
    dec         ecx
    jnz .loop

        xor edi,edi
        mov eax,231
    syscall

Then run with

asm-link load-use-latency.asm && disas load-use-latency && 
    perf stat -etask-clock,cycles,L1-dcache-loads,instructions,branches -r4 ./load-use-latency

+ yasm -felf64 -Worphan-labels -gdwarf2 load-use-latency.asm
+ ld -o load-use-latency load-use-latency.o
 (disassembly output so my terminal history has the asm with the perf results)

 Performance counter stats for './load-use-latency' (4 runs):

     91.422838      task-clock:u (msec)       #    0.990 CPUs utilized            ( +-  0.09% )
   400,105,802      cycles:u                  #    4.376 GHz                      ( +-  0.00% )
   100,000,013      L1-dcache-loads:u         # 1093.819 M/sec                    ( +-  0.00% )
   150,000,039      instructions:u            #    0.37  insn per cycle           ( +-  0.00% )
    25,000,031      branches:u                #  273.455 M/sec                    ( +-  0.00% )

   0.092365514 seconds time elapsed                                          ( +-  0.52% )

In this case, I was testing mov rax, [rax], naturally-aligned, so cycles = 4*L1-dcache-loads. 4c latency. I didn't disable turbo or anything like that. Since nothing is going off the core, core clock cycles is the best way to measure.

[base + 0..2047]: 4c load-use latency, 11c cache-line split, 11c 4k-page split (even when inside the same hugepage). See Is there a penalty when base+offset is in a different page than the base? for more details: if base+disp turns out to be in a different page than base, the load uop has to be replayed.
any other addressing mode: 5c latency, 11c cache-line split, 12c 4k-split (even inside a hugepage). This includes [rax - 16]. It's not disp8 vs. disp32 that makes the difference.

So: hugepages don't help avoid page-split penalties (at least not when both pages are hot in the TLB). A cache-line split makes addressing mode irrelevant, but "fast" addressing modes have 1c lower latency for normal and page-split loads.

4k-split handling is fantastically better than before, see @harold's numbers where Haswell has ~32c latency for a 4k-split. (And older CPUs may be even worse than that. I thought pre-SKL it was supposed to be ~100 cycle penalty.)

Throughput (regardless of addressing mode), measured by using a destination other than rax so the loads are independent:

no split: 0.5c.
CL-split: 1c.
4k-split: ~3.8 to 3.9c (much better than pre-Skylake CPUs)

Same throughput/latency for movzx/movsx (including WORD splits), as expected because they're handled in the load port (unlike some AMD CPUs, where there's also an ALU uop).

Uops dependent on cache-line split loads get replayed from the RS (Reservation Station). Counters for uops_dispatched_port.port_2 + port_3 = 2x number of mov rdi, [rdi], in another test using basically the same loop. (This was a dependent-load case, not throughput limited.) The CPU can't detect a split load until after AGU produces a linear address.

I previously thought split loads themselves got replayed, but that was based on this pointer-chasing test where every load is dependent on a previous load. If we put an imul rdi, rdi, 1 in the loop, we'd get extra port 1 ALU counts for it getting replayed, not the loads.

A split load only has to dispatch once, but I'm not sure if it later borrows a cycle in the same load port to access the other cache line (and combine it with the first part saved in a split register inside that load port.) Or to initiate a demand-load for the other line if it's not present in L1d.

Whatever the details, throughput of cache-line-split loads is lower than non-splits even if you avoid replays of loads. (We didn't test pointer chasing with that anyway.)

See also Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up? for more about uop replays. (But note that's for uops dependent on a load, not the load uop itself. In that Q&A, the dependent uops are also mostly loads.)

A cache-miss load doesn't itself need to be replayed to "accept" the incoming data when it's ready, only dependent uops. See chat discussion on Are load ops deallocated from the RS when they dispatch, complete or some other time?. This https://godbolt.org/z/HJF3BN NASM test case on i7-6700k shows the same number of load uops dispatched regardless of L1d hits or L3 hits. But the number of ALU uops dispatched (not counting loop overhead) goes from 1 per load to ~8.75 per load. The scheduler aggressively schedules uops consuming the data to dispatch in the cycle when load data might arrive from L2 cache (and then very aggressively after that, it seems), instead of waiting one extra cycle to see if it did or not.

We haven't tested how aggressive replay is when there's other independent but younger work that could be done on the same port whose inputs are definitely ready.

SKL has two hardware page-walk units, which is probably related to the massive improvement in 4k-split performance. Even when there are no TLB misses, presumably older CPUs had to account for the fact that there might be.

It's interesting that the 4k-split throughput is non-integer. I think my measurements had enough precision and repeatability to say this. Remember this is with every load being a 4k-split, and no other work going on (except for being inside a small dec/jnz loop). If you ever have this in real code, you're doing something really wrong.

I don't have any solid guesses at why it might be non-integer, but clearly there's a lot that has to happen microarchitecturally for a 4k-split. It's still a cache-line split, and it has to check the TLB twice.