cudapage-fault

Getting "GPU page fault" by initializing data in a kernel


I am new to CUDA/C++ and I am studying about Unified Memory. I have found this introduction to this topic. However, I have a question regarding one of the examples.

To mitigate migration overhead there is one example in which the data is initialized in a kernel:

#include <iostream>
#include <math.h>

// initialize arrays on device
__global__ void init(int n, float *x, float *y) {
  int index = threadIdx.x + blockIdx.x * blockDim.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }
}

// CUDA kernel to add elements of two arrays
__global__ void add(int n, float *x, float *y){
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    for (int i = index; i < n; i += stride){
        y[i] = x[i] + y[i];
    }
}

int main(void)
{
    int N = 1<<20;
    float *x, *y;

    // Allocate Unified Memory -- accessible from CPU or GPU
    cudaMallocManaged(&x, N*sizeof(float));
    cudaMallocManaged(&y, N*sizeof(float));

    // Launch kernel on 1M elements on the GPU
    int blockSize = 256;
    int numBlocks = (N + blockSize - 1) / blockSize;

    init<<<numBlocks, blockSize>>>(N, x, y);
    add<<<numBlocks, blockSize>>>(N, x, y);

    // Wait for GPU to finish before accessing on host
    cudaDeviceSynchronize();

    // Free memory
    cudaFree(x);
    cudaFree(y);

    return 0;
}

In the link I have put previously it is said that for this case "There are still device-to-host page faults, but this is due to the loop at the end of the program that checks the results on the CPU.". However, I have deleted the loop at the end and the profiling for this is

==4242== NVPROF is profiling process 4242, command: /content/src/add_unifmem_initonkernel
==4242== Profiling application: /content/src/add_unifmem_initonkernel
==4242== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   96.00%  1.4178ms         1  1.4178ms  1.4178ms  1.4178ms  init(int, float*, float*)
                    4.00%  59.070us         1  59.070us  59.070us  59.070us  add(int, float*, float*)
      API calls:   99.21%  263.47ms         2  131.74ms  54.879us  263.42ms  cudaMallocManaged
                    0.54%  1.4273ms         1  1.4273ms  1.4273ms  1.4273ms  cudaDeviceSynchronize
                    0.15%  401.83us         2  200.91us  197.33us  204.49us  cudaFree
                    0.05%  120.55us       101  1.1930us     139ns  50.860us  cuDeviceGetAttribute
                    0.04%  96.692us         2  48.346us  40.043us  56.649us  cudaLaunchKernel
                    0.01%  28.565us         1  28.565us  28.565us  28.565us  cuDeviceGetName
                    0.00%  6.9460us         1  6.9460us  6.9460us  6.9460us  cuDeviceGetPCIBusId
                    0.00%  2.0890us         3     696ns     225ns  1.5490us  cuDeviceGetCount
                    0.00%  1.0370us         2     518ns     314ns     723ns  cuDeviceGet
                    0.00%     502ns         1     502ns     502ns     502ns  cuDeviceTotalMem
                    0.00%     500ns         1     500ns     500ns     500ns  cuModuleGetLoadingMode
                    0.00%     230ns         1     230ns     230ns     230ns  cuDeviceGetUuid

==4242== Unified Memory profiling result:
Device "Tesla T4 (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
      13         -         -         -           -  1.695805ms  Gpu page fault groups

There is still some GPU page faults happening, but if I have got it correctly it should not happen for this case.

What am I missing here?


Solution

  • Your init kernel is still experiencing page faults. You can get an additional clue of this by noting the huge time disparity between the duration of the init kernel (~1400 microseconds) and the add kernel (~60 microseconds).

    The reason for this is that page faults may occur in at least two cases, related to the same core issue: the page touched by the code is not present in device memory. Perhaps the typical case for this is when the data is physically present on some other processor, and needs to be migrated. In this case the page fault serves the purpose to trigger the migration, and when that happens nvprof will usually report additional data associated with the faults, such as the amount of data migrated, size of the blocks, number of migrations, etc.

    But all that is missing in your report. This is a second kind of clue that these page faults have a slightly different origin and purpose. The basic idea is that some allocators are so-called "lazy allocators". The allocator creates the possibility for the data to exist, including an address range, but does not actually assign or "map" physical memory to store it. cudaMallocManaged is a lazy allocator in this respect. The assignment of memory will happen on "first touch". And in your case, first touch takes place in the init kernel. Since the data is not actually present or fully allocated in device memory at that point, page faults occur, and these page faults have the purpose of "bringing pages into existence" as opposed to migration of data.

    If you want to make this effect disappear altogether, you will need to actually instantiate the data somewhere. In typical programmatic usage, you would normally do this by initializing the data "somewhere", so if you do it in the init kernel you will get these kind of GPU page faults and if you do it in host code and then allow the data to be migrated to the GPU, you will get these kind of page faults in CPU code.

    For your particular program here, one approach you could take to remove this effect is to insert the following before your first (init) kernel call:

    cudaMemPrefetchAsync(x, N*sizeof(float), 0);
    cudaMemPrefetchAsync(y, N*sizeof(float), 0);
    

    Here is a full example:

    $ cat t65.cu
    #include <iostream>
    #include <math.h>
    
    // initialize arrays on device
    __global__ void init(int n, float *x, float *y) {
      int index = threadIdx.x + blockIdx.x * blockDim.x;
      int stride = blockDim.x * gridDim.x;
      for (int i = index; i < n; i += stride) {
        x[i] = 1.0f;
        y[i] = 2.0f;
      }
    }
    
    // CUDA kernel to add elements of two arrays
    __global__ void add(int n, float *x, float *y){
        int index = blockIdx.x * blockDim.x + threadIdx.x;
        int stride = blockDim.x * gridDim.x;
    
        for (int i = index; i < n; i += stride){
            y[i] = x[i] + y[i];
        }
    }
    
    int main(void)
    {
        int N = 1<<20;
        float *x, *y;
    
        // Allocate Unified Memory -- accessible from CPU or GPU
        cudaMallocManaged(&x, N*sizeof(float));
        cudaMallocManaged(&y, N*sizeof(float));
    
        // Launch kernel on 1M elements on the GPU
        int blockSize = 256;
        int numBlocks = (N + blockSize - 1) / blockSize;
    #ifdef USE_FIX
        cudaMemPrefetchAsync(x, N*sizeof(float), 0);
        cudaMemPrefetchAsync(y, N*sizeof(float), 0);
    #endif
        init<<<numBlocks, blockSize>>>(N, x, y);
        add<<<numBlocks, blockSize>>>(N, x, y);
    
        // Wait for GPU to finish before accessing on host
        cudaDeviceSynchronize();
    
        // Free memory
        cudaFree(x);
        cudaFree(y);
    
        return 0;
    }
    $ nvcc -arch=sm_75 -o t65 t65.cu
    $ nvprof ./t65
    ==2152== NVPROF is profiling process 2152, command: ./t65
    ==2152== Profiling application: ./t65
    ==2152== Profiling result:
                Type  Time(%)      Time     Calls       Avg       Min       Max  Name
     GPU activities:   94.67%  847.68us         1  847.68us  847.68us  847.68us  init(int, float*, float*)
                        5.33%  47.771us         1  47.771us  47.771us  47.771us  add(int, float*, float*)
          API calls:   98.80%  100.54ms         2  50.269ms  20.319us  100.52ms  cudaMallocManaged
                        0.84%  858.64us         1  858.64us  858.64us  858.64us  cudaDeviceSynchronize
                        0.09%  91.105us         2  45.552us  38.187us  52.918us  cudaFree
                        0.09%  90.522us       114     794ns      95ns  34.440us  cuDeviceGetAttribute
                        0.08%  85.802us         1  85.802us  85.802us  85.802us  cuLibraryLoadData
                        0.08%  78.436us         2  39.218us  37.040us  41.396us  cudaLaunchKernel
                        0.01%  9.4660us         1  9.4660us  9.4660us  9.4660us  cuDeviceGetPCIBusId
                        0.01%  8.8140us         1  8.8140us  8.8140us  8.8140us  cuDeviceGetName
                        0.00%     982ns         3     327ns     156ns     670ns  cuDeviceGetCount
                        0.00%     518ns         2     259ns     122ns     396ns  cuDeviceGet
                        0.00%     472ns         1     472ns     472ns     472ns  cuDeviceTotalMem
                        0.00%     279ns         1     279ns     279ns     279ns  cuDeviceGetUuid
                        0.00%     236ns         1     236ns     236ns     236ns  cuModuleGetLoadingMode
    
    ==2152== Unified Memory profiling result:
    Device "NVIDIA GeForce GTX 1660 SUPER (0)"
       Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
          12         -         -         -           -  850.3410us  Gpu page fault groups
    $ nvcc -arch=sm_75 -o t65 t65.cu -DUSE_FIX
    $ nvprof ./t65
    ==2192== NVPROF is profiling process 2192, command: ./t65
    ==2192== Profiling application: ./t65
    ==2192== Profiling result:
                Type  Time(%)      Time     Calls       Avg       Min       Max  Name
     GPU activities:   62.56%  47.167us         1  47.167us  47.167us  47.167us  add(int, float*, float*)
                       37.44%  28.223us         1  28.223us  28.223us  28.223us  init(int, float*, float*)
          API calls:   99.39%  96.413ms         2  48.206ms  20.312us  96.392ms  cudaMallocManaged
                        0.19%  182.79us         2  91.394us  16.462us  166.33us  cudaMemPrefetchAsync
                        0.12%  112.34us       114     985ns      93ns  35.158us  cuDeviceGetAttribute
                        0.09%  87.379us         1  87.379us  87.379us  87.379us  cuLibraryLoadData
                        0.08%  78.244us         2  39.122us  33.997us  44.247us  cudaFree
                        0.07%  64.200us         1  64.200us  64.200us  64.200us  cudaDeviceSynchronize
                        0.05%  51.041us         2  25.520us  11.435us  39.606us  cudaLaunchKernel
                        0.01%  8.3590us         1  8.3590us  8.3590us  8.3590us  cuDeviceGetName
                        0.01%  7.5140us         1  7.5140us  7.5140us  7.5140us  cuDeviceGetPCIBusId
                        0.00%     963ns         3     321ns     167ns     622ns  cuDeviceGetCount
                        0.00%     487ns         2     243ns     108ns     379ns  cuDeviceGet
                        0.00%     434ns         1     434ns     434ns     434ns  cuDeviceTotalMem
                        0.00%     226ns         1     226ns     226ns     226ns  cuModuleGetLoadingMode
                        0.00%     192ns         1     192ns     192ns     192ns  cuDeviceGetUuid
    $
    

    We see that the duration of the init kernel drops to something that is roughly comparable to the duration of the add kernel, and it removes all page fault reporting.

    You can get additional discussion of managed memory behavior in unit 6 of this online training series.