I would expect copying an array using a simple for loop to achieve my machine's peak bandwidth, but it does not. I ran the following example code with input 3GB, ensuring that it did not swap. It got 13 GB/s. (Ran 10 times, stdev was < 1 GB/s).
My CPU is a zen2, running at 4 GHz. A vmovupd
has reciprocal throughput of 1, so the CPU should be able to handle 4 * 32 = 128 GB/s
on a single-core, meaning RAM bandwidth should be the bottleneck. I have two 4 GB sticks (single channel) at 3200 MT/s, so that should be 25 GB/s, not 13 GB/s.
So what is going on, and what can I do to get the peak performance here?
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
int main(int argc, char **argv)
{
if (argc != 2) {
printf("Usage: %s <BYTES>\n", argv[0]);
return EXIT_FAILURE;
}
size_t n = atol(argv[1]) / sizeof(double);
double *source = malloc(n * sizeof(double));
double *target = malloc(n * sizeof(double));
struct timeval tv1, tv2;
gettimeofday(&tv1, NULL);
for (size_t i = 0; i < n; i++) {
target[i] = source[i];
}
gettimeofday(&tv2, NULL);
double duration = (double)(tv2.tv_usec - tv1.tv_usec) / 1e6 +
(double)(tv2.tv_sec - tv1.tv_sec);
/* 3 because cache write-back policy causes target to be loaded to the CPU */
fprintf(stderr, "Bandwidth is %lf GB/s\n", 3 * n * sizeof(double) / duration / 1e9);
/* Anti-optimisation */
fprintf(stderr, "%lf\n", target[0]);
return EXIT_SUCCESS;
}
Compiled with -O3 -march=native -mtune=native
, gcc 12.2.0, tight-loop is
13e0: c4 c1 7d 10 0c 09 vmovupd (%r9,%rcx,1),%ymm1
13e6: c4 c1 7d 11 0c 0a vmovupd %ymm1,(%r10,%rcx,1)
13ec: 48 83 c1 20 add $0x20,%rcx
13f0: 49 39 cb cmp %rcx,%r11
13f3: 75 eb jne 13e0 <main._omp_fn.0+0xd0>
With OpenMP I could get it up to 18 GB/s, which is still pretty far from peak. (Ran with OMP_PLACES=cores
)
Your benchmark is biased and contain an undefined behaviors (mentioned by Simon Goater in comments), not to mention it does not free memory.
First of all, compilers like GCC (v13.2) can replace your loop with a memmove
(and even possibly a memcpy
). In fact, GCC does so. A good implementation of memmove
will use non-temporal store so there is no cache-line write-allocate (i.e. cache lines are not loaded from the RAM). This means the computation of the bandwidth is wrong in this case. It should be 2*n*...
rather than 3*n*...
. Nowadays, I expect all x86-64 implementations to use non-temporal store when copying big arrays in memory. This is what happen on my machine (Debian Testing with a i5-9600KF CPU). It can be seen in profilers : the function __memmove_avx_unaligned_erms
is called during the benchmark and takes a significant part of the overall time (~50%).
Moreover, your benchmark include the overhead of page faults. Indeed, malloc
does not directly map virtual pages to physical ones in RAM. This mapping is performed lazily at runtime during the first-touch, that is, in the middle of your benchmark. It is particularly expensive. This can be seen with a low-level profiler : the kernel function clear_page_erms
is called during the benchmark and takes a significant part of the overall time (~45%).
On top of that, one core often cannot saturate memory. This is because the speed of memory accesses is limited by the formula concurent_accesses / latency
. More specifically, the latency is pretty huge for RAM accesses and the buffers to hold concurrent accesses are AFAIK often not wide enough to saturate the RAM on most mainstream platforms. Because of that, it is not rare for multiple cores to be needed so to saturate the RAM (1-3 are often enough on mainstream x86-64 PCs).
Finally, you need to be careful about how pages are mapped on which NUMA node on NUMA architecture. Partitioning it the key to avoid sneaky NUMA effects. AFAIK, AMD CPUs are NUMA ones (due to CCXs/CCDs, though I did not check in practice).
A simple solution is to use better benchmark like Stream Triad (designed decades ago and still quite good if tuned correctly).
Alternatively, you can use the following better benchmark:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <sys/time.h>
#include <omp.h>
void __attribute__ ((noinline)) init(double* target, double* source, size_t size)
{
#pragma omp parallel
{
const size_t start = (size * omp_get_thread_num()) / omp_get_num_threads();
const size_t stop = (size * (omp_get_thread_num() + 1)) / omp_get_num_threads();
memset(&source[start], 0, (stop - start) * sizeof(double));
memset(&target[start], 0, (stop - start) * sizeof(double));
}
}
void __attribute__ ((noinline)) benchmark_classical_stores(double* target, double* source, size_t size)
{
// GCC generate a SIMD loop using classical load/store with OpenMP but not with a sequential code
#pragma omp parallel for // Clause ignored by GCC: simd nontemporal(target)
for (size_t i = 0; i < size; i++) {
target[i] = source[i];
}
}
void __attribute__ ((noinline)) benchmark_nt_stores(double* target, double* source, size_t size)
{
#pragma omp parallel
{
const size_t start = (size * omp_get_thread_num()) / omp_get_num_threads();
const size_t stop = (size * (omp_get_thread_num() + 1)) / omp_get_num_threads();
memcpy(&target[start], &source[start], (stop - start) * sizeof(double));
}
}
int main(int argc, char **argv)
{
if (argc != 2) {
printf("Usage: %s <BYTES>\n", argv[0]);
return EXIT_FAILURE;
}
size_t n = atol(argv[1]) / sizeof(double);
double *source = malloc(n * sizeof(double));
double *target = malloc(n * sizeof(double));
init(target, source, n);
for (int j = 0; j < 10; ++j)
{
struct timeval tv1, tv2;
gettimeofday(&tv1, NULL);
benchmark_nt_stores(target, source, n);
gettimeofday(&tv2, NULL);
double duration = (double)(tv2.tv_usec - tv1.tv_usec) / 1e6 +
(double)(tv2.tv_sec - tv1.tv_sec);
fprintf(stderr, "Bandwidth is %lf GB/s\n", 2 * n * sizeof(double) / duration / 1e9);
fprintf(stderr, "%lf\n", target[0]);
}
free(source);
free(target);
return EXIT_SUCCESS;
}
I get 37.9 GiB/s on my machine for the solution using non-temporal writes and 41.2 GiB/s for the solution using classical loads/stores. The maximum practical throughput of my (3200MHz dual-channel) RAM is 42~43 GiB/s while the theoretical bandwidth is 47.7 GiB/s. Results are good since my i5-9600KF CPU is design to achieve a throughput of 38.7 GiB/s. I think the non-temporal store results in a lower throughput because of the limited concurrency compared to the alternative solution (which can benefit from the whole caches).
Slightly better results can certainly be achieved with huge-pages (so to avoid TLB misses) and aligned loads (memcpy
performs unaligned load on my machine).
Note that saturating RAM is notoriously hard for CPUs, especially when loads and stores are mixed (especially non-temporal ones). If you reach >80% of the theoretical bandwidth with mixed reads/stores, then you can consider that the code saturate your memory (>85% for only reads).