I have a CUDA kernel CreateBasePopulation where I use printf to print struct values inside the kernel. However, no output is being printed when I execute the kernel, here’s the relevant code:
__global__ void CreateBasePopulation(Population* pop, int pop_num, int input_num, int output_num) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx >= pop_num) {
return;
}
Network* net = &pop->Networks[idx];
net->num_neurons = input_num + output_num;
net->num_connections = input_num * output_num;
net->fitness = 0.0f;
curandState state;
curand_init(clock64(), idx, 0, &state);
for (int i = 0; i < output_num; ++i) {
net->Neurons[i].type = 2;
net->Neurons[i].bias = (2.0f * sqrtf((float)input_num) * curand_uniform(&state)) - sqrtf((float)input_num);
net->Neurons[i].output = 0.0f;
net->Neurons[i].input_sum = 0.0f;
printf("%f\n", net->Neurons[i].bias);
}
for (int i = 0; i < input_num; ++i) {
net->Neurons[i].type = 0;
net->Neurons[i].bias = 0.0f;
net->Neurons[i].output = 0.0f;
net->Neurons[i].input_sum = 0.0f;
for (int j = 0; j < output_num; ++j) {
int offset = j + (output_num * i);
net->Connections[offset].from = i;
net->Connections[offset].to = j;
net->Connections[offset].innovationid = offset;
net->Connections[offset].enabled = true;
net->Connections[offset].weight = (2.0f * curand_uniform(&state)) - 1.0f;
printf("Weight [%d]: %f\n", offset, net->Connections[offset].weight);
}
}
}
I also tried to allocate memory like this;
....
curandState state;
curand_init(clock64(), idx, 0, &state);
cudaMalloc((void**)&(net->Neurons), sizeof(Neuron) * net->num_neurons);
cudaMalloc((void**)&(net->Connections), sizeof(Connection) * net->num_connections);
....
But it doesn't give any output (I tried to memcopy device to host and there was too big and too small numbers, so it seems like there is an issue with memory management)
Also main function;
int main() {
int population_size = 1024;
int input_num = 10;
int output_num = 5;
Population* d_population;
cudaMalloc(&d_population, sizeof(Population));
Network* d_networks;
cudaMalloc(&d_networks, sizeof(Network) * population_size);
cudaMemcpy(&(d_population->Networks), &d_networks, sizeof(Network*), cudaMemcpyHostToDevice);
int threadsPerBlock = 512;
int blocks = (population_size + threadsPerBlock - 1) / threadsPerBlock;
CreateBasePopulation<<<blocks, threadsPerBlock>>>(d_population, population_size, input_num, output_num);
cudaDeviceSynchronize();
std::cout << "Population created successfully!" << std::endl;
cudaFree(d_networks);
cudaFree(d_population);
return 0;
}
Also structs;
struct Connection {
int innovationid;
int from;
int to;
float weight;
bool enabled;
};
struct Neuron {
int type; //0 = input, 1 = hidden, 2 = output
float input_sum; // Sum of inputs into neuron
float bias;
float output; // Activated output
};
struct Network {
Connection* Connections;
Neuron* Neurons;
int num_connections;
int num_neurons;
float fitness;
};
struct Population {
Network* Networks;
int num_networks;
int generation_id;
};
First of all you need to allocate memory, so the initial version of the kernel is clearly broken.
Using cudaMalloc()
in device code is a feature from the CUDA device runtime API introduced for CUDA Dynamic Parallelism (CDP). To use these APIs with NVCC, one needs to compile with -rdc=true
which enables relocatable device code. According to the documentation one also needs to explicitly link against the library with -lcudadevrt
although I did not need to do so to get OP's code running. Interestingly using other functions from this API like device cudaGetErrorString()
does result in a compilation error (Unresolved extern function '__cudaCDP2GetErrorString'
) in the absence of -rdc=true
. The fact that cudaMalloc()
behaves differently (it compiles but causes a runtime error) might be a bug.
To avoid these complications one can use malloc()
/free()
in device code (C++ new
/delete
should also work) to get a so called device memory heap allocation. Just keep in mind that the amount of memory available this way is limited and that the limit can be configured on the host. This memory can not be used with any CUDA runtime APIs like cudaMemcpy()
.
As this memory seems to be sub-allocated from a buffer instead of going down to the driver which is more flexible due to virtual addressing, there is no way to guarantee how much of the heap is actually usable even if one increases its size. So arguably it is not suited for OP's application with its many allocations (one per thread). Using one thread per block to allocate a single buffer for the whole block might improve reliability (and performance) but also increase code complexity, especially when freeing the memory later on.
By now I'm pretty sure that the same size-limit does not apply to cudaMalloc()
in device code (I could not find any hint that it would apply in the docs). Therefore I would expect device cudaMalloc()
to have an even bigger overhead than device malloc()
. Both kinds of allocations can only be freed from device code which is why I originally assumed that device cudaMalloc()
also uses the device memory heap.
Independent of which API you choose you should check if the allocation was successful, at least in debug builds (e.g. using assert()
). For the CUDA device runtime API there is a error checking macro given at the bottom of the top answer to "What is the canonical way to check for errors using the CUDA runtime API?" .