Why do I need to declare CUDA variables on the Host before allocating them on the Device

I've just started trying to learn CUDA again and came across some code I don't fully understand.

// declare GPU memory pointers
float * d_in;
float * d_out;

// allocate GPU memory
cudaMalloc((void**) &d_in, ARRAY_BYTES);
cudaMalloc((void**) &d_out, ARRAY_BYTES);

When the GPU memory pointers are declared, they allocate memory on the host. The cudaMalloc calls throw away the information that d_in and d_out are pointers to floats.

I can't think why cudaMalloc would need to know about where in host memory d_in & d_out have originally been stored. It's not even clear why I need to use the host bytes to store whatever host address d_in & d_out point to.

So, what is the purpose of the original variable declarations on the host?

======================================================================

I would've thought something like this would make more sense:

// declare GPU memory pointers
cudaFloat * d_in;
cudaFloat * d_out;

// allocate GPU memory
cudaMalloc((void**) &d_in, ARRAY_BYTES);
cudaMalloc((void**) &d_out, ARRAY_BYTES);

This way, everything GPU related takes place on the GPU. If d_in or d_out are accidentally used in host code, an error can be thrown at compile time, since those variables wouldn't be defined on the host.

I guess what I also find confusing is that by storing device memory addresses on the host, it feels like the device isn't in fully in charge of managing its own memory. It feels like there's a risk of host code accidentally overwriting the value of either d_in or d_out either through accidentally assigning to them in host code or another more subtle error, which could cause the GPU to lose access to its own memory. Also, it seems strange that the addresses assigned to d_in & d_out are chosen by the host, instead of the device. Why should the host know anything about which addresses are/are not available on the device?

What am I failing to understand here?

Solution

Your fundamental conceptual failure is mixing up host-side code and device-side code. If you call cudaMalloc() from code execution on the CPU, then, well, it's on the CPU: It's you who want to have the arguments in CPU memory, and the result in CPU memory. You asked for it. cudaMalloc has told the GPU/device how much of its (the device's) memory to allocate, but if the CPU/host wants to access that memory, it needs a way to refer to it that the device will understand. The memory location on the device is a way to do this.

Alternatively, you can call it from device-side code; then everything takes place on the GPU. (Although, frankly, I've never done it myself and it's not such a great idea except in special cases).