Analyzing Cache Behavior and Memory Traffic in Large File Reads Using perf stat

Consider the following code:

int main(int argc, char** argv) {
  int buf_size = 1024*1024*1024;
  char* buffer = malloc(buf_size);
  char* buffer2 = malloc(buf_size);
  for (int i = 0; i < 10; i++){
    int fd = open(argv[1], O_DIRECT | O_RDONLY);
    read(fd, buffer, buf_size);
    memcpy(buffer2, buffer, buf_size);
  }
  free(buffer);
  free(buffer2);
  return 0;
}

I get the following result using perf stat when I run the program on a 1 GiB input file:

# perf stat -B -e l2_request_g1.all_no_prefetch:k,l2_request_g1.l2_hw_pf:k,cache-references:k ./main sample.txt 

 Performance counter stats for './main sample.txt':

       651,263,793      l2_request_g1.all_no_prefetch:k                                       
       600,476,712      l2_request_g1.l2_hw_pf:k                                              
     1,251,740,542      cache-references:k

When I comment out read(fd, buffer, buf_size);, I get the following:

        36,037,824      l2_request_g1.all_no_prefetch:k                                       
        33,416,410      l2_request_g1.l2_hw_pf:k                                              
        69,454,244      cache-references:k

Looking at the cache line size, I get the following (the same for index 0-3):

# cat /sys/devices/system/cpu/cpu0/cache/index3/coherency_line_size
64

Transparent HugePage Support (THP) is enabled:

# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

I've checked that huge pages are allocated at runtime. Putting hardware prefetch accesses aside, it seems to me that read is responsible for more than 3 GiB of cache references:

64 x (651,263,793 - 36,037,824) / (1024^3 x 10) = 3.6 GiB

I'm now wondering how reading a 1 GiB file generates 3.6 GiB of memory traffic.

[Update] More Info About the System:

This is running on a double-socket server powered by AMD EPYC 7H12 64-core processors. The Linux kernel version is 6.8.0-41, and the distribution is Ubuntu 24.04.1 LTS. I compile the code using the following command:

# gcc -D_GNU_SOURCE main.c -o main

The filesystem is ZFS:

# df -Th
Filesystem     Type      Size  Used Avail Use% Mounted on
...
home           zfs       x.yT  xyzG  x.yT  xy% /home

When I remove O_DIRECT, I get the following results (which are not significantly different from when it's included):

       650,395,869      l2_request_g1.all_no_prefetch:k                                       
       599,548,912      l2_request_g1.l2_hw_pf:k                                              
     1,249,944,793      cache-references:k

Finally, if I replace malloc with valloc, I get the following results (again, not much different from the original values):

       651,092,248      l2_request_g1.all_no_prefetch:k                                       
       558,542,553      l2_request_g1.l2_hw_pf:k                                              
     1,209,634,821      cache-references:k

Solution

You're using ZFS, but your Linux kernel almost certainly doesn't support O_DIRECT on ZFS. https://www.phoronix.com/news/OpenZFS-Direct-IO says OpenZFS only merged support for it 5 days ago into mainline, so unless distro kernels have picked up that 2020 patch earlier, O_DIRECT is probably just silently ignored.

That probably explains your results of L2 traffic about 4X the size of your read. Two copies (ZFS's ARC to pagecache, and pagecache to user-space) each reading + writing the whole data. Or just one copy_to_user if it's not avoiding MESI RFO (Read For Ownership) would have to read the destination into cache before updating it with the newly stored values, so the total traffic is 3x the copy size. The extra .6 of a copy could be from the initial copy into pagecache, plus other L2 traffic that happens while your program runs.

There's potentially also extra reads for ZFS to verify checksums of the data (not just metadata). Hopefully they cache-block that somewhat so they get L1d or at least L2 hits, but IDK. But that verify only has to happen after reading from actual disk, and probably with O_DIRECT being fully ignored the data just stays hot in the pagecache and/or the ARC. IDK if any of that checksumming happens in a kernel thread rather than in your own process where perf stat (without -a) would count it.

Filesystems like XFS and EXT4 definitely support O_DIRECT. You will need valloc or aligned_alloc: Glibc malloc for big allocations uses mmap to get new pages and uses the first 16 bytes of that for its bookkeeping metadata, so big allocations are misaligned for every alignment of 32 and larger, including the page size.

FSes that support compression (like BTRFS) also couldn't do O_DIRECT for compressed files, and ZFS / BTRFS checksum data, which they have to verify at some point. XFS only checksums metadata.

DMA shouldn't be touching L2 except perhaps to evict cache lines its overwriting, and can happen while your process isn't current on a CPU core because it's blocked on I/O so it's asleep. So you'd actually expect no counts due to that I/O if O_DIRECT worked, unless you used system-wide mode (perf stat -a). And maybe only if you counted events for DRAM or L3. Or with some of the data hot in L2 from memcpy, that would have to be evicted before the next DMA.

x86 DMA is cache-coherent (since early CPUs didn't have cache and a requirement for software to invalidate before DMA wouldn't have been backwards-compatible). Intel Xeons can even DMA directly into L3, instead of just writing back and invalidating any cached data. I don't know if AMD Zen does anything similar. With each core-cluster (CCX) having its own L3, it would have to know which L3 to target to be most useful.