Consider the following code:
int main(int argc, char** argv) {
int buf_size = 1024*1024*1024;
char* buffer = malloc(buf_size);
char* buffer2 = malloc(buf_size);
for (int i = 0; i < 10; i++){
int fd = open(argv[1], O_DIRECT | O_RDONLY);
read(fd, buffer, buf_size);
memcpy(buffer2, buffer, buf_size);
}
free(buffer);
free(buffer2);
return 0;
}
I get the following result using perf stat
when I run the program on a 1 GiB input file:
# perf stat -B -e l2_request_g1.all_no_prefetch:k,l2_request_g1.l2_hw_pf:k,cache-references:k ./main sample.txt
Performance counter stats for './main sample.txt':
651,263,793 l2_request_g1.all_no_prefetch:k
600,476,712 l2_request_g1.l2_hw_pf:k
1,251,740,542 cache-references:k
When I comment out read(fd, buffer, buf_size);
, I get the following:
36,037,824 l2_request_g1.all_no_prefetch:k
33,416,410 l2_request_g1.l2_hw_pf:k
69,454,244 cache-references:k
Looking at the cache line size, I get the following (the same for index 0-3):
# cat /sys/devices/system/cpu/cpu0/cache/index3/coherency_line_size
64
Transparent HugePage Support (THP) is enabled:
# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
I've checked that huge pages are allocated at runtime. Putting hardware prefetch accesses aside, it seems to me that read
is responsible for more than 3 GiB of cache references:
64 x (651,263,793 - 36,037,824) / (1024^3 x 10) = 3.6 GiB
I'm now wondering how reading a 1 GiB file generates 3.6 GiB of memory traffic.
[Update] More Info About the System:
This is running on a double-socket server powered by AMD EPYC 7H12 64-core processors. The Linux kernel version is 6.8.0-41, and the distribution is Ubuntu 24.04.1 LTS. I compile the code using the following command:
# gcc -D_GNU_SOURCE main.c -o main
The filesystem is ZFS:
# df -Th
Filesystem Type Size Used Avail Use% Mounted on
...
home zfs x.yT xyzG x.yT xy% /home
When I remove O_DIRECT
, I get the following results (which are not significantly different from when it's included):
650,395,869 l2_request_g1.all_no_prefetch:k
599,548,912 l2_request_g1.l2_hw_pf:k
1,249,944,793 cache-references:k
Finally, if I replace malloc
with valloc
, I get the following results (again, not much different from the original values):
651,092,248 l2_request_g1.all_no_prefetch:k
558,542,553 l2_request_g1.l2_hw_pf:k
1,209,634,821 cache-references:k
You're using ZFS, but your Linux kernel almost certainly doesn't support O_DIRECT
on ZFS. https://www.phoronix.com/news/OpenZFS-Direct-IO says OpenZFS only merged support for it 5 days ago into mainline, so unless distro kernels have picked up that 2020 patch earlier, O_DIRECT
is probably just silently ignored.
That probably explains your results of L2 traffic about 4X the size of your read. Two copies (ZFS's ARC to pagecache, and pagecache to user-space) each reading + writing the whole data. Or just one copy_to_user
if it's not avoiding MESI RFO (Read For Ownership) would have to read the destination into cache before updating it with the newly stored values, so the total traffic is 3x the copy size. The extra .6 of a copy could be from the initial copy into pagecache, plus other L2 traffic that happens while your program runs.
There's potentially also extra reads for ZFS to verify checksums of the data (not just metadata). Hopefully they cache-block that somewhat so they get L1d or at least L2 hits, but IDK. But that verify only has to happen after reading from actual disk, and probably with O_DIRECT being fully ignored the data just stays hot in the pagecache and/or the ARC. IDK if any of that checksumming happens in a kernel thread rather than in your own process where perf stat
(without -a
) would count it.
Filesystems like XFS and EXT4 definitely support O_DIRECT
. You will need valloc
or aligned_alloc
: Glibc malloc
for big allocations uses mmap
to get new pages and uses the first 16 bytes of that for its bookkeeping metadata, so big allocations are misaligned for every alignment of 32 and larger, including the page size.
FSes that support compression (like BTRFS) also couldn't do O_DIRECT
for compressed files, and ZFS / BTRFS checksum data, which they have to verify at some point. XFS only checksums metadata.
DMA shouldn't be touching L2 except perhaps to evict cache lines its overwriting, and can happen while your process isn't current
on a CPU core because it's blocked on I/O so it's asleep. So you'd actually expect no counts due to that I/O if O_DIRECT
worked, unless you used system-wide mode (perf stat -a
). And maybe only if you counted events for DRAM or L3. Or with some of the data hot in L2 from memcpy
, that would have to be evicted before the next DMA.
x86 DMA is cache-coherent (since early CPUs didn't have cache and a requirement for software to invalidate before DMA wouldn't have been backwards-compatible). Intel Xeons can even DMA directly into L3, instead of just writing back and invalidating any cached data. I don't know if AMD Zen does anything similar. With each core-cluster (CCX) having its own L3, it would have to know which L3 to target to be most useful.