c++linuxmemorycuda

CUDA and pinned (page locked) memory not page locked at all?


I try to figure out if CUDA (or the OpenCL implementation) tells the truth when I require pinned (page locked) memory.

I tried cudaMallocHost and looked at the /proc/meminfo values Mlocked and Unevictable , both stay at 0 and never go up (/proc/<pid>/status reports VmLck also as 0). I used mlock to page lock memory and the values go up as expected.

So two possible reasons for this behavior might be:

  1. I don't get page locked memory from the CUDA API and the cudaSuccess is a fake
  2. CUDA bypasses the OS counters for page locked memory because CUDA does some magic with the linux kernel

So the actual question is: Why can’t I get the values for page locked memory from the OS when I use CUDA to allocate page locked memory?

Additionally: Where can I get the right values if not from /proc/meminfo or /proc/<pid>/status?

Thanks!

System: Ubuntu 14.04.01 LTS; CUDA 6.5; Nvidida Driver 340.29; Nvidia Tesla K20c


Solution

  • It would seem that the pinned allocator on CUDA 6.5 under the hood is using mmap() with MAP_FIXED. Although I am not an OS expert, this apparently has the effect of "pinning" memory, i.e. ensuring that its address never changes. However this is not a complete explanation. Refer to the answer by @Jeff which points out what is almost certainly the "missing piece".

    Let's consider a short test program:

    #include <stdio.h>
    #define DSIZE (1048576*1024)
    
    int main(){
    
      int *data;
      cudaFree(0);
      system("cat /proc/meminfo > out1.txt");
      printf("*$*before alloc\n");
      cudaHostAlloc(&data, DSIZE, cudaHostAllocDefault);
      printf("*$*after alloc\n");
      system("cat /proc/meminfo > out2.txt");
      cudaFreeHost(data);
      system("cat /proc/meminfo > out3.txt");
      return 0;
    }
    

    If we run this program with strace, and excerpt the output part between the printf statements, we have:

    write(1, "*$*before alloc\n", 16*$*before alloc)       = 16
    mmap(0x204500000, 1073741824, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x204500000
    ioctl(11, 0xc0304627, 0x7fffcf72cce0)   = 0
    ioctl(3, 0xc0384657, 0x7fffcf72cd70)    = 0
    write(1, "*$*after alloc\n", 15*$*after alloc)        = 15
    

    (note that 1073741824 is exactly one gigabyte, i.e. the same as the requested 1048576*1024)

    Reviewing the description of mmap, we have:

    address gives a preferred starting address for the mapping. NULL expresses no preference. Any previous mapping at that address is automatically removed. The address you give may still be changed, unless you use the MAP_FIXED flag.

    Therefore, assuming the mmap command is successful, the virtual address requested will be fixed, which is probably useful, but not the whole story.

    As I mentioned, I am not a OS expert, and it's not obvious to me what exactly about this system call would create a "pinned" mapping/allocation. It may be that the combination of MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS somehow creates a pinned underlying allocation, but I've not found any evidence to support that.

    Based on this article it seems that even mlock()-ed pages would not meet the needs of DMA activity, which is one of the key goals of pinned host pages in CUDA. Therefore, it seems that something else is providing the actual "pinning" (i.e. guaranteeing that the underlying physical pages are always memory-resident, and that their virtual-to-physical mapping doesn't change -- the latter part of this is possibly accomplished by MAP_FIXED along with whatever mechanism guarantees that the underlying physical pages don't move in any way).

    This mechanism apparently does not use mlock(), and so the mlock'ed pages don't change, before and after. However we would expect a change in the mapping statistic, and if we diff the out1.txt and out2.txt produced by the above program, we see (excerpted):

    < Mapped:            87488 kB
    ---
    > Mapped:          1135904 kB
    

    The difference is approximately a gigabyte, the amount of "pinned" memory requested.