[SOLVED] CUDA's Mersenne Twister for an arbitrary number of threads

CUDA's Mersenne Twister for an arbitrary number of threads

CUDA's implementation of the Mersenne Twister (MT) random number generator is limited to a maximal number of threads/blocks of 256 and 200 blocks/grid, i.e. the maximal number of threads is 51200.

Therefore, it is not possible to launch the kernel that uses the MT with

kernel<<<blocksPerGrid, threadsPerBlock>>>(devMTGPStates, ...)

where

int blocksPerGrid = (n+threadsPerBlock-1)/threadsPerBlock;

and n is the total number of threads.

What is the best way to use the MT for threads > 51200?

My approach if to use constant values for blocksPerGrid and threadsPerBlock, e.g. <<<128,128>>> and use the following in the kernel code:

__global__ void kernel(curandStateMtgp32 *state, int n, ...) { 

    int id = threadIdx.x+blockIdx.x*blockDim.x;

    while (id < n) {

        float x = curand_normal(&state[blockIdx.x]);
        /* some more calls to curand_normal() followed
           by the algorithm that works with the data */

        id += blockDim.x*gridDim.x; 
    }
}

I am not sure if this is the correct way or if it can influence the MT status in an undesired way?

Thank you.

Solution

I suggest you read the CURAND documentation carefully and thoroughly.

The MT API will be most efficient when using 256 threads per block with up to 64 blocks to generate numbers.

If you need more than that, you have a variety of options:

simply generate more numbers from the existing state - set (i.e. 64 blocks, 256 threads), and distribute these numbers amongst the threads that need them.
Use more than a single state per block (but this does not allow you to exceed the overall limit within a state-set, it just addresses the need for a single block.)
Create multiple MT generators with independent seeds (and therefore independent state-sets).

Generally, I don't see a problem with the kernel that you've outlined, and it's roughly in line with choice 1 above. However it does not allow you to exceed 51200 threads. (your example has <<<128, 128>>> so 16384 threads)