Do kernel-launched child kernels have the same warp size as host-launched kernels?

When a kernel block is launched from the host, it has a warp size of 32. Is it the same for child kernels launched via dynamic parallelism? My guess would be yes, but I haven't seen it in the docs.

The larger question, of course, is: is it worth it?

__global__ void kernel(const int * vec, float * outvec){
    int idx = threadIdx.x;
    float random_sum=0;
    for(int j=0; j<vec[idx]; j++){
        random_sum+=threadsafe_rand_uniform();
    }
    outvec[idx] = random_sum;
}

Ok, this example is kind of contrived. The point, though, is that if you have a loop of different length from thread to thread, it's tempting to try and dynamically parallelize it. However, if the warp is still 32, you're going to end up wasting a lot of processors on warps of uneven sizes. In this particular example, you may want to sort the data first and then dispatch the dynamically parallelizable indexes in one kernel and the poorly shaped indexes in a different one.

Solution

They do indeed have the same warp size. But that is because the warp size is fixed for the graphics card. All kernels running on the same graphics card is going to have the same warp size.

Today virtually all GPUs are using a warp size of 32, but it may change in the future.

Were you possibly thinking about the number of threads in the kernel, and not warp size? If that is the case then no, they are not necessarily the same. You launch a new kernel with dynamic parallelism the same way as you launch it from host:

<<<blocks, threads>>>threadsafe_rand_uniform();

Beware that this is not the same thing as just calling a device function, which is what you are currently doing.

To your question on whether it is worth it? Well, it is hard to tell without considering the alternative. If the alternative is to return data to host, so that the host can launch a new appropriate kernel, then it may well be worth it. But it all depends on the context.