CUDA's implementation of the Mersenne Twister
(MT
) random number generator is limited to a maximal number of threads/blocks of 256
and 200
blocks/grid, i.e. the maximal number of threads is 51200
.
Therefore, it is not possible to launch the kernel that uses the MT with
kernel<<<blocksPerGrid, threadsPerBlock>>>(devMTGPStates, ...)
where
int blocksPerGrid = (n+threadsPerBlock-1)/threadsPerBlock;
and n
is the total number of threads.
What is the best way to use the MT
for threads > 51200
?
My approach if to use constant values for blocksPerGrid
and threadsPerBlock
, e.g. <<<128,128>>>
and use the following in the kernel code:
__global__ void kernel(curandStateMtgp32 *state, int n, ...) {
int id = threadIdx.x+blockIdx.x*blockDim.x;
while (id < n) {
float x = curand_normal(&state[blockIdx.x]);
/* some more calls to curand_normal() followed
by the algorithm that works with the data */
id += blockDim.x*gridDim.x;
}
}
I am not sure if this is the correct way or if it can influence the MT status in an undesired way?
Thank you.
I suggest you read the CURAND documentation carefully and thoroughly.
The MT API will be most efficient when using 256 threads per block with up to 64 blocks to generate numbers.
If you need more than that, you have a variety of options:
Generally, I don't see a problem with the kernel that you've outlined, and it's roughly in line with choice 1 above. However it does not allow you to exceed 51200 threads. (your example has <<<128, 128>>>
so 16384 threads)