Is there a function that provides these feature simultaneously? I am looking for a function that allocated memory which has traits of both "memory-mapped" (like allocated with mmap
) and UVM (accessible from both host and GPU devices). I see that cudaHostAlloc
allocates a memory on the host memory that is accessible to the devices, but no apparent way to declare the allocated memory ranges as memory-mapped!
My question is this: is there an API function to allocate a memory with above-mentioned traits?
If the answer to the above question is "no", then, is there a set of API functions that I can call which leads to the same behavior?
For instance, at first, we use cudaMallocManaged
to allocate a UVM-based memory then use a specific API (either POSIX or CUDA API) to declare the previously allocated memory as "memory-mapped" (just like mmap
)? Or, vice vesa (allocate with mmap
and then declare the range as UVM to the CUDA driver)?
Any other suggestions will also be appreciated!
UPDATE on Dec. 13, 2018:
Unfortunately, the suggestion provided by @tera seems to not be working as expected. When the code is executed on the device, it seems like the device is not able to see the memory on the host!
Below is the code that I am using with the compilation command.
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>
#include <assert.h>
__global__
void touchKernel(char *d, char init, int n) {
int index = blockIdx.x *blockDim.x + threadIdx.x;
if(index >= n)
return;
d[index] = init;
}
void process_file(char* filename, int n) {
if(n < 0) {
printf("Error in n: %d\n", n);
exit(1);
}
size_t filesize = n*sizeof(char);
size_t pagesize = (size_t) sysconf (_SC_PAGESIZE);
//Open file
int fd = open(filename, O_RDWR|O_CREAT, 0666);
// assert(fd != -1);
if(fd == -1) {
perror("Open API");
exit(1);
}
ftruncate(fd, filesize);
//Execute mmap
char* mmappedData = (char*) mmap(0, filesize, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_LOCKED, fd, 0);
assert(mmappedData != MAP_FAILED);
printf("mmappedData: %p\n", mmappedData);
for(int i=0;i<n;i++)
mmappedData[i] = 'z';
if(cudaSuccess != cudaHostRegister(mmappedData, filesize, cudaHostRegisterDefault)) {
printf("Unable to register with CUDA!\n");
exit(1);
}
int vec = 256;
int gang = (n) / vec + 1;
printf("gang: %d - vec: %d\n", gang, vec);
touchKernel<<<gang, vec>>>((char*) mmappedData, 'a', n);
cudaDeviceSynchronize();
//Cleanup
int rc = munmap(mmappedData, filesize);
assert(rc == 0);
close(fd);
}
int main(int argc, char const *argv[])
{
process_file("buffer.obj", 10);
return 0;
}
And to compile, here it is:
nvcc -g -O0 f1.cu && cuda-memcheck ./a.out
The cuda-memcheck
will generate some outputs concerning user that the threads could not reach the memory addresses similar to below output:
========= Invalid __global__ write of size 1
========= at 0x000000b0 in touchKernel(char*, char, int)
========= by thread (2,0,0) in block (0,0,0)
========= Address 0x7fdc8e137002 is out of bounds
========= Device Frame:touchKernel(char*, char, int) (touchKernel(char*, char, int) : 0xb0)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24d9dd]
========= Host Frame:./a.out [0x22b22]
========= Host Frame:./a.out [0x22d17]
========= Host Frame:./a.out [0x570d5]
========= Host Frame:./a.out [0x6db8]
========= Host Frame:./a.out [0x6c76]
========= Host Frame:./a.out [0x6cc3]
========= Host Frame:./a.out [0x6a4c]
========= Host Frame:./a.out [0x6ade]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
========= Host Frame:./a.out [0x673a]
=========
========= Invalid __global__ write of size 1
========= at 0x000000b0 in touchKernel(char*, char, int)
========= by thread (1,0,0) in block (0,0,0)
========= Address 0x7fdc8e137001 is out of bounds
========= Device Frame:touchKernel(char*, char, int) (touchKernel(char*, char, int) : 0xb0)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24d9dd]
========= Host Frame:./a.out [0x22b22]
========= Host Frame:./a.out [0x22d17]
========= Host Frame:./a.out [0x570d5]
========= Host Frame:./a.out [0x6db8]
========= Host Frame:./a.out [0x6c76]
========= Host Frame:./a.out [0x6cc3]
========= Host Frame:./a.out [0x6a4c]
========= Host Frame:./a.out [0x6ade]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
========= Host Frame:./a.out [0x673a]
=========
========= Invalid __global__ write of size 1
========= at 0x000000b0 in touchKernel(char*, char, int)
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x7fdc8e137000 is out of bounds
========= Device Frame:touchKernel(char*, char, int) (touchKernel(char*, char, int) : 0xb0)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24d9dd]
========= Host Frame:./a.out [0x22b22]
========= Host Frame:./a.out [0x22d17]
========= Host Frame:./a.out [0x570d5]
========= Host Frame:./a.out [0x6db8]
========= Host Frame:./a.out [0x6c76]
========= Host Frame:./a.out [0x6cc3]
========= Host Frame:./a.out [0x6a4c]
========= Host Frame:./a.out [0x6ade]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
========= Host Frame:./a.out [0x673a]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x351c13]
========= Host Frame:./a.out [0x40a16]
========= Host Frame:./a.out [0x6a51]
========= Host Frame:./a.out [0x6ade]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
========= Host Frame:./a.out [0x673a]
=========
Above output means that the code was not successfully executed on the device.
Any suggestions?
UPDATE on Dec. 14, 2018
I changed the code to following:
__global__
void touchKernel(char *d, char init, int n) {
int index = blockIdx.x *blockDim.x + threadIdx.x;
if(index >= n || index < 0)
return;
printf("index %d\n", index);
d[index] = init + (index%20);
printf("index %d - Done\n", index);
}
If above code is replace with the old one, one can see the output of both printf
commands. If one checks the buffer.obj
file, they can see that the file contains the correct output!
UPDATE on Dec. 14, 2018
Probably cuda-memcheck
has some issues. It turns out that if the executable file is executed without cuda-memcheck
, then the contents of buffer.obj
is totally correct. However, if the executable is executed with cuda-memcheck
, then the content of the output file (buffer.obj
) is completely incorrect!
Coincidentally I have just replied to a similar question on Nvidia's forum.
You can cudaHostRegister()
mmapped memory if you pass the MAP_LOCKED
flag to mmap()
.
You may need to increase the limit for locked memory (ulimit -m
in bash) when doing so.
Update:
It turns out the MAP_LOCKED flag
to mmap()
isn't even necessary. The documentation to cudaHostRegister()
however lists a few other limitations:
cudaHostRegisterMapped
flag needs to be passed to cudaHostRegister()
or the memory will not be mapped. Unless the device has a non-zero value for the cudaDevAttrCanUseHostPointerForRegisteredMem
attribute, this also means you need to query the device address for the mapped memory range via cudaHostGetDevicePointer()
.cudaMapHost
flag in order for mapping to be possible. Since the context is created lazily by the runtime API, you would need to create the context yourself using the driver API before any invocation of the runtime API in order to be able to affect the flags the context is created with.