Kernel accessing device-allocated struct does not print

I have a CUDA kernel CreateBasePopulation where I use printf to print struct values inside the kernel. However, no output is being printed when I execute the kernel, here’s the relevant code:

__global__ void CreateBasePopulation(Population* pop, int pop_num, int input_num, int output_num) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    if (idx >= pop_num) {
        return;
    }

    Network* net = &pop->Networks[idx];
    net->num_neurons = input_num + output_num;
    net->num_connections = input_num * output_num;
    net->fitness = 0.0f;

    curandState state;
    curand_init(clock64(), idx, 0, &state);

    for (int i = 0; i < output_num; ++i) {
        net->Neurons[i].type = 2;
        net->Neurons[i].bias = (2.0f * sqrtf((float)input_num) * curand_uniform(&state)) - sqrtf((float)input_num);
        net->Neurons[i].output = 0.0f;
        net->Neurons[i].input_sum = 0.0f;

        printf("%f\n", net->Neurons[i].bias);
    }

    for (int i = 0; i < input_num; ++i) {
        net->Neurons[i].type = 0;
        net->Neurons[i].bias = 0.0f;
        net->Neurons[i].output = 0.0f;
        net->Neurons[i].input_sum = 0.0f;

        for (int j = 0; j < output_num; ++j) {
            int offset = j + (output_num * i);

            net->Connections[offset].from = i;
            net->Connections[offset].to = j;
            net->Connections[offset].innovationid = offset;
            net->Connections[offset].enabled = true;
            net->Connections[offset].weight = (2.0f * curand_uniform(&state)) - 1.0f;

            printf("Weight [%d]: %f\n", offset, net->Connections[offset].weight);
        }
    }
}

I also tried to allocate memory like this;

    ....

    curandState state;
    curand_init(clock64(), idx, 0, &state);

    cudaMalloc((void**)&(net->Neurons), sizeof(Neuron) * net->num_neurons);
    cudaMalloc((void**)&(net->Connections), sizeof(Connection) * net->num_connections);

    ....

But it doesn't give any output (I tried to memcopy device to host and there was too big and too small numbers, so it seems like there is an issue with memory management)

Also main function;

int main() {
    int population_size = 1024;
    int input_num = 10;
    int output_num = 5;

    Population* d_population;
    cudaMalloc(&d_population, sizeof(Population));

    Network* d_networks;
    cudaMalloc(&d_networks, sizeof(Network) * population_size);
    cudaMemcpy(&(d_population->Networks), &d_networks, sizeof(Network*), cudaMemcpyHostToDevice);

    int threadsPerBlock = 512;
    int blocks = (population_size + threadsPerBlock - 1) / threadsPerBlock;

    CreateBasePopulation<<<blocks, threadsPerBlock>>>(d_population, population_size, input_num, output_num);
    cudaDeviceSynchronize();

    std::cout << "Population created successfully!" << std::endl;

    cudaFree(d_networks);
    cudaFree(d_population);
    return 0;
}

Also structs;

struct Connection {
    int innovationid;
    int from; 
    int to; 
    float weight;     
    bool enabled;   
};

struct Neuron {
    int type;             //0 = input, 1 = hidden, 2 = output
    float input_sum;      // Sum of inputs into neuron
    float bias;
    float output;         // Activated output
};

struct Network {
    Connection* Connections;
    Neuron* Neurons;
    int num_connections;
    int num_neurons;
    float fitness;
};

struct Population {
    Network* Networks;
    int num_networks;
    int generation_id;
};

Solution

First of all you need to allocate memory, so the initial version of the kernel is clearly broken.

Using cudaMalloc() in device code is a feature from the CUDA device runtime API introduced for CUDA Dynamic Parallelism (CDP). To use these APIs with NVCC, one needs to compile with -rdc=true which enables relocatable device code. According to the documentation one also needs to explicitly link against the library with -lcudadevrt although I did not need to do so to get OP's code running. Interestingly using other functions from this API like device cudaGetErrorString() does result in a compilation error (Unresolved extern function '__cudaCDP2GetErrorString') in the absence of -rdc=true. The fact that cudaMalloc() behaves differently (it compiles but causes a runtime error) is probably because it just uses malloc() under the hood.

To avoid these complications one can use device malloc()/free() directly (C++ new/delete should also work) to get a so called device memory heap allocation.

Keep in mind that the amount of memory available this way is limited and that the limit can be configured on the host. This memory can not be used with any CUDA runtime APIs like cudaMemcpy(). The same size-limit does apply to cudaMalloc() in device code.

As this memory seems to be sub-allocated from a buffer, there is no way to guarantee how much of the heap is actually usable even if one increases its size. So arguably it is not suited for OP's application with its many allocations (one per thread). Using one thread per block to allocate a single buffer for the whole block might improve reliability (and performance) but also increase code complexity, especially when freeing the memory later on.

Independent of which API you choose you should check if the allocation was successful, at least in debug builds (e.g. using assert()). For the CUDA device runtime API there is a error checking macro given at the bottom of the top answer to "What is the canonical way to check for errors using the CUDA runtime API?" .