I am new to CUDA/C++ and I am studying about Unified Memory. I have found this introduction to this topic. However, I have a question regarding one of the examples.
To mitigate migration overhead there is one example in which the data is initialized in a kernel:
#include <iostream>
#include <math.h>
// initialize arrays on device
__global__ void init(int n, float *x, float *y) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride) {
x[i] = 1.0f;
y[i] = 2.0f;
}
}
// CUDA kernel to add elements of two arrays
__global__ void add(int n, float *x, float *y){
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride){
y[i] = x[i] + y[i];
}
}
int main(void)
{
int N = 1<<20;
float *x, *y;
// Allocate Unified Memory -- accessible from CPU or GPU
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));
// Launch kernel on 1M elements on the GPU
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
init<<<numBlocks, blockSize>>>(N, x, y);
add<<<numBlocks, blockSize>>>(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Free memory
cudaFree(x);
cudaFree(y);
return 0;
}
In the link I have put previously it is said that for this case "There are still device-to-host page faults, but this is due to the loop at the end of the program that checks the results on the CPU.". However, I have deleted the loop at the end and the profiling for this is
==4242== NVPROF is profiling process 4242, command: /content/src/add_unifmem_initonkernel
==4242== Profiling application: /content/src/add_unifmem_initonkernel
==4242== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 96.00% 1.4178ms 1 1.4178ms 1.4178ms 1.4178ms init(int, float*, float*)
4.00% 59.070us 1 59.070us 59.070us 59.070us add(int, float*, float*)
API calls: 99.21% 263.47ms 2 131.74ms 54.879us 263.42ms cudaMallocManaged
0.54% 1.4273ms 1 1.4273ms 1.4273ms 1.4273ms cudaDeviceSynchronize
0.15% 401.83us 2 200.91us 197.33us 204.49us cudaFree
0.05% 120.55us 101 1.1930us 139ns 50.860us cuDeviceGetAttribute
0.04% 96.692us 2 48.346us 40.043us 56.649us cudaLaunchKernel
0.01% 28.565us 1 28.565us 28.565us 28.565us cuDeviceGetName
0.00% 6.9460us 1 6.9460us 6.9460us 6.9460us cuDeviceGetPCIBusId
0.00% 2.0890us 3 696ns 225ns 1.5490us cuDeviceGetCount
0.00% 1.0370us 2 518ns 314ns 723ns cuDeviceGet
0.00% 502ns 1 502ns 502ns 502ns cuDeviceTotalMem
0.00% 500ns 1 500ns 500ns 500ns cuModuleGetLoadingMode
0.00% 230ns 1 230ns 230ns 230ns cuDeviceGetUuid
==4242== Unified Memory profiling result:
Device "Tesla T4 (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
13 - - - - 1.695805ms Gpu page fault groups
There is still some GPU page faults happening, but if I have got it correctly it should not happen for this case.
What am I missing here?
Your init
kernel is still experiencing page faults. You can get an additional clue of this by noting the huge time disparity between the duration of the init
kernel (~1400 microseconds) and the add
kernel (~60 microseconds).
The reason for this is that page faults may occur in at least two cases, related to the same core issue: the page touched by the code is not present in device memory. Perhaps the typical case for this is when the data is physically present on some other processor, and needs to be migrated. In this case the page fault serves the purpose to trigger the migration, and when that happens nvprof
will usually report additional data associated with the faults, such as the amount of data migrated, size of the blocks, number of migrations, etc.
But all that is missing in your report. This is a second kind of clue that these page faults have a slightly different origin and purpose. The basic idea is that some allocators are so-called "lazy allocators". The allocator creates the possibility for the data to exist, including an address range, but does not actually assign or "map" physical memory to store it. cudaMallocManaged
is a lazy allocator in this respect. The assignment of memory will happen on "first touch". And in your case, first touch takes place in the init
kernel. Since the data is not actually present or fully allocated in device memory at that point, page faults occur, and these page faults have the purpose of "bringing pages into existence" as opposed to migration of data.
If you want to make this effect disappear altogether, you will need to actually instantiate the data somewhere. In typical programmatic usage, you would normally do this by initializing the data "somewhere", so if you do it in the init
kernel you will get these kind of GPU page faults and if you do it in host code and then allow the data to be migrated to the GPU, you will get these kind of page faults in CPU code.
For your particular program here, one approach you could take to remove this effect is to insert the following before your first (init
) kernel call:
cudaMemPrefetchAsync(x, N*sizeof(float), 0);
cudaMemPrefetchAsync(y, N*sizeof(float), 0);
Here is a full example:
$ cat t65.cu
#include <iostream>
#include <math.h>
// initialize arrays on device
__global__ void init(int n, float *x, float *y) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride) {
x[i] = 1.0f;
y[i] = 2.0f;
}
}
// CUDA kernel to add elements of two arrays
__global__ void add(int n, float *x, float *y){
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride){
y[i] = x[i] + y[i];
}
}
int main(void)
{
int N = 1<<20;
float *x, *y;
// Allocate Unified Memory -- accessible from CPU or GPU
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));
// Launch kernel on 1M elements on the GPU
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
#ifdef USE_FIX
cudaMemPrefetchAsync(x, N*sizeof(float), 0);
cudaMemPrefetchAsync(y, N*sizeof(float), 0);
#endif
init<<<numBlocks, blockSize>>>(N, x, y);
add<<<numBlocks, blockSize>>>(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Free memory
cudaFree(x);
cudaFree(y);
return 0;
}
$ nvcc -arch=sm_75 -o t65 t65.cu
$ nvprof ./t65
==2152== NVPROF is profiling process 2152, command: ./t65
==2152== Profiling application: ./t65
==2152== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 94.67% 847.68us 1 847.68us 847.68us 847.68us init(int, float*, float*)
5.33% 47.771us 1 47.771us 47.771us 47.771us add(int, float*, float*)
API calls: 98.80% 100.54ms 2 50.269ms 20.319us 100.52ms cudaMallocManaged
0.84% 858.64us 1 858.64us 858.64us 858.64us cudaDeviceSynchronize
0.09% 91.105us 2 45.552us 38.187us 52.918us cudaFree
0.09% 90.522us 114 794ns 95ns 34.440us cuDeviceGetAttribute
0.08% 85.802us 1 85.802us 85.802us 85.802us cuLibraryLoadData
0.08% 78.436us 2 39.218us 37.040us 41.396us cudaLaunchKernel
0.01% 9.4660us 1 9.4660us 9.4660us 9.4660us cuDeviceGetPCIBusId
0.01% 8.8140us 1 8.8140us 8.8140us 8.8140us cuDeviceGetName
0.00% 982ns 3 327ns 156ns 670ns cuDeviceGetCount
0.00% 518ns 2 259ns 122ns 396ns cuDeviceGet
0.00% 472ns 1 472ns 472ns 472ns cuDeviceTotalMem
0.00% 279ns 1 279ns 279ns 279ns cuDeviceGetUuid
0.00% 236ns 1 236ns 236ns 236ns cuModuleGetLoadingMode
==2152== Unified Memory profiling result:
Device "NVIDIA GeForce GTX 1660 SUPER (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
12 - - - - 850.3410us Gpu page fault groups
$ nvcc -arch=sm_75 -o t65 t65.cu -DUSE_FIX
$ nvprof ./t65
==2192== NVPROF is profiling process 2192, command: ./t65
==2192== Profiling application: ./t65
==2192== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 62.56% 47.167us 1 47.167us 47.167us 47.167us add(int, float*, float*)
37.44% 28.223us 1 28.223us 28.223us 28.223us init(int, float*, float*)
API calls: 99.39% 96.413ms 2 48.206ms 20.312us 96.392ms cudaMallocManaged
0.19% 182.79us 2 91.394us 16.462us 166.33us cudaMemPrefetchAsync
0.12% 112.34us 114 985ns 93ns 35.158us cuDeviceGetAttribute
0.09% 87.379us 1 87.379us 87.379us 87.379us cuLibraryLoadData
0.08% 78.244us 2 39.122us 33.997us 44.247us cudaFree
0.07% 64.200us 1 64.200us 64.200us 64.200us cudaDeviceSynchronize
0.05% 51.041us 2 25.520us 11.435us 39.606us cudaLaunchKernel
0.01% 8.3590us 1 8.3590us 8.3590us 8.3590us cuDeviceGetName
0.01% 7.5140us 1 7.5140us 7.5140us 7.5140us cuDeviceGetPCIBusId
0.00% 963ns 3 321ns 167ns 622ns cuDeviceGetCount
0.00% 487ns 2 243ns 108ns 379ns cuDeviceGet
0.00% 434ns 1 434ns 434ns 434ns cuDeviceTotalMem
0.00% 226ns 1 226ns 226ns 226ns cuModuleGetLoadingMode
0.00% 192ns 1 192ns 192ns 192ns cuDeviceGetUuid
$
We see that the duration of the init
kernel drops to something that is roughly comparable to the duration of the add
kernel, and it removes all page fault reporting.
You can get additional discussion of managed memory behavior in unit 6 of this online training series.