I want to achieve the effect of the below code, which means using flags to control kernel behavior from the host. So far the flags allocated by unified memory worked as I expected, but when I want to update data from the host and copy it to the device, it does not work.
So my question is, could CUDA achieve this effect, that is, update data from the host and copy it to an executing device side kernel function, and then informed the kernel to process the data by updating a data-ready flag?
cudaMemcpy:
When I use cudaMemcpy
, the data_ready
flag could not be changed and kept printing x
.
cudaMemcpyAsync:
While using cudaMemcpyAsync
to copy the updated data, the program can finish since the data_ready
could be changed, but the value of data
remains the same.
Unified memory for data:
I also think about using unified memory for my data
, but the size of the data could be really large (more than 1GB) in a more complex scenario, and I don't think my unified memory could take that.
#include <iostream>
#include <cstdio>
#include <cuda_runtime.h>
using namespace std;
__global__ void test (int *flag, int *data_ready, int *data) {
int tid = blockDim.x * blockIdx.x + threadIdx.x;
while (true) {
if (*flag == 0) {
// wait for data transfer
while (true) {
if (*data_ready == 0) {
printf("x");
}
else {
break;
}
}
printf("data %d\n", *data);
__syncthreads();
}
else {
break;
}
}
printf("gpu finish %d\n", tid);
}
int main() {
// flags
int *flag;
cudaMallocManaged(&flag, sizeof(int));
*flag = 0;
int *data_ready;
cudaMallocManaged(&data_ready, sizeof(int));
*data_ready = 0;
// data
int *data = (int *)malloc(sizeof(int));
int *data_device;
*data = 777;
cudaMalloc(&data_device, sizeof(int));
cudaMemcpy(data_device, data, sizeof(int), cudaMemcpyHostToDevice);
// launch kernel
int block = 8, grid = 1;
test<<<grid, block>>> (flag, data_ready, data_device);
// random host code
for (int i = 0; i < 1e5; i++);
printf("host do something\n");
// update data
*data = 987;
cudaMemcpy(data_device, data, sizeof(int), cudaMemcpyHostToDevice);
printf("host copied\n");
*data_ready = 1;
// update flag
*flag = 1;
cudaDeviceSynchronize();
// free memory
cudaFree(flag);
printf("host finish\n");
}
The general topic of "how to communicate data to a running kernel" is covered already in various posts such as here and here. There are many other examples, see the items linked to that first example for a list of relevant material.
Several concepts are needed to make it work.
Possibly the most important concept is understanding what CUDA streams are. Even if you don't explicitly use CUDA streams, you are launching work into a particular stream, the so-called "null" stream. Stream semantics dictate that work issued into the same stream will serialize. Item B, issued into stream s (s may be the null stream) will not begin working until item A, previously issued into stream s completes. So these two items you have issued will never run concurrently. The cudaMemcpy
operation will wait (forever) for the kernel to complete:
test<<<grid, block>>> (flag, data_ready, data_device);
...
cudaMemcpy(data_device, data, sizeof(int), cudaMemcpyHostToDevice);
because both are issued into the same (null) stream.
Furthermore, in your case, you are using managed memory to facilitate some of the communication. In this case, you must be on a system for which the concurrentManagedAccess
attribute is true. Managed memory is not a suitable vehicle for this otherwise. I don't intend to give a tutorial on UM usage, but there are many resources online.
Finally, in some cases it is necessary to mark items that will be used for global communication to a kernel with the volatile
qualifier, so as to prevent the compiler from doing any optimizations that would affect "visibility" of that item, since it is being communicated to by a separate entity (the host, in this case).
The following code has some of these items addressed and seems to finish in a sensible way, for me:
$ cat t2246.cu
#include <iostream>
#include <cstdio>
#include <cuda_runtime.h>
using namespace std;
__global__ void test (volatile int *flag, volatile int *data_ready, volatile int *data) {
int tid = blockDim.x * blockIdx.x + threadIdx.x;
while (true) {
if (*flag == 0) {
// wait for data transfer
while (true) {
if (*data_ready == 0) {
printf("x");
}
else {
break;
}
}
printf("data %d\n", *data);
__syncthreads();
}
else {
break;
}
}
printf("gpu finish %d\n", tid);
}
int main() {
int attr = 0;
cudaDeviceGetAttribute(&attr, cudaDevAttrConcurrentManagedAccess, 0);
if (attr == 0) {printf("device does not support this case\n"); return 0;}
// flags
int *flag;
cudaMallocManaged(&flag, sizeof(int));
*flag = 0;
int *data_ready;
cudaMallocManaged(&data_ready, sizeof(int));
*data_ready = 0;
// data
int *data = (int *)malloc(sizeof(int));
int *data_device;
*data = 777;
cudaMalloc(&data_device, sizeof(int));
cudaMemcpy(data_device, data, sizeof(int), cudaMemcpyHostToDevice);
cudaStream_t s1, s2;
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);
// launch kernel
int block = 8, grid = 1;
test<<<grid, block, 0, s1>>> (flag, data_ready, data_device);
// random host code
for (int i = 0; i < 1e5; i++);
printf("host do something\n");
// update data
*data = 987;
cudaMemcpyAsync(data_device, data, sizeof(int), cudaMemcpyHostToDevice, s2);
printf("host copied\n");
*data_ready = 1;
// update flag
*flag = 1;
cudaDeviceSynchronize();
// free memory
cudaFree(flag);
printf("host finish\n");
}
$ nvcc -o t2246 t2246.cu
$ ./t2246
host do something
host copied
xxxxxxxxdata 987
data 987
data 987
data 987
data 987
data 987
data 987
data 987
gpu finish 0
gpu finish 1
gpu finish 2
gpu finish 3
gpu finish 4
gpu finish 5
gpu finish 6
gpu finish 7
host finish
$