Managing multiple cuRAND sequences between threads and blocks

I'm trying to generate 32 64x64 bitmaps with a single CUDA kernel call. When rendering these images, I want to randomize the parameters of image generation both per-image and per-pixel. That is, some randomized decisions happen once and apply consistently to all pixels in an image, while other decisions are made independently for each pixel. I'm trying to figure out a cuRAND setup to enable this.

The approach I have so far uses two state arrays: one with 32 sequences (one per image) and another with 4096 sequences (one per pixel). I pass both into my kernel and compute each pixel value based on both. This sorta works, but I'm seeing some weird artifacts. I'm looking for advice on how to fix this, or suggestions for an alternative approach that would work better.

If I render the images using only the per-pixel noise, I would expect to get the same image of random static 32 times. What I actually get is different but highly correlated images of random static. Interestingly, the first several images are almost identical, and the later images (larger img_id) become more different.

If I render the images using only the per-image noise, I would expect each image to be a solid block of some random color. What I actually get is mostly images of a solid color, but sometimes the four quadrants of the image aren't the same. Again, the first images are all consistent, and the later images are more varied.

I suspect part of my problem is that each 64x64 image is actually composed of a 2x2 grid of blocks that are 32x32 threads each (my device supports at most 1024 threads per block). The cuRAND docs say "two different blocks can not operate on the same state safely," but I don't see any guidance on what to do about that.

Can anyone offer some insight into what's going wrong here? Any advice on how to fix this, or another approach that would work better?

Code snippet below:

__global__ void init_rngs(curandState* per_img_rng_state, curandState* per_pxl_rng_state) {
    int row = blockIdx.x * blockDim.x + threadIdx.x;
    int col = blockIdx.y * blockDim.y + threadIdx.y;
    int img_id = blockIdx.z * blockDim.z;
    int pxl_id = col * 64 + row;

    curand_init(42, img_id, 0, &per_img_rng_state[img_id]);
    curand_init(42, pxl_id, 0, &per_pxl_rng_state[pxl_id]);
}

__global__ void make_images(curandState* per_img_rng_state, curandState* per_pxl_rng_state, unsigned char* image) {
    int row = blockIdx.x * blockDim.x + threadIdx.x;
    int col = blockIdx.y * blockDim.y + threadIdx.y;
    int img_id = blockIdx.z * blockDim.z;
    int pxl_id = col * 64 + row;

    unsigned int per_img_noise = curand(&per_img_rng_state[img_id]);
    unsigned int per_pxl_noise = curand(&per_pxl_rng_state[pxl_id]);

    // An example of logic mixing the two sources of noise.
    unsigned int density = per_img_noise;
    unsigned int value = per_img_noise ^ per_pxl_noise;
    image[img_id][row][col] = (value >= density) ? 0x00 : 0xFF;

    // An example using only per-pixel noise:
    image[img_id][row][col] = (per_pxl_noise & 1) ? 0x00 : 0xFF;

    // An example using only per-image noise:
    image[img_id][row][col] = per_img_noise / 16777216;
}

void randomize_images() {
  curandState* per_img_rng_state = nullptr;
  curandState* per_pxl_rng_state = nullptr;
  unsigned char* image = nullptr;

  cudaMalloc(&image, 32*64*64);
  cudaMalloc(&per_img_rng_state, 32 * sizeof(curandState));
  cudaMalloc(&per_pxl_rng_state, 64 * 64 * sizeof(curandState));

  // Blocks are arranged 2x2x32, meaning 32 images made out of 4 blocks in a 2x2 grid.
  // Each block gets 32x32 threads, one per pixel in each quadrant of the image.
  init_rngs<<<{2, 2, 32}, {32, 32}>>>(per_img_rng_state, per_pxl_rng_state);
  make_images<<<{2, 2, 32}, {32, 32}>>>(per_img_rng_state, per_pxl_rng_state, image);
}

Solution

Okay, I think I've got this working, but if anyone sees a problem with this or has a better idea, please let me know.

The issue here is that cuRand generally wants each thread to have its own unique sequence, and I'm explicitly trying to share the sequences between many threads. That works fine, but each call to curand() tries to update the position in the sequence, and if many threads try to update the same RNG state at once, you get race conditions. That's why the random noise I was generating had undesired correlations and was not deterministic.

I dealt with this problem by managing sequence positions manually. Whenever I'm trying to read a value from the RNG, I make a local copy of its state and I don't copy it back to the global state when I'm done. This prevents threads from interfering with each others' randomness. Then I have to manually advance the RNGs using skipahead().

This seems to work well, though it has a few drawbacks. Most seriously, I have to keep track of how many random numbers I generate and make sure I skipahead() by the right number of positions. This is tricky and error prone. The other drawback is efficiency. This solution requires more copies from global memory and frequent calls to skipahead(). It also means every call to curand() is incrementing a local state object that just gets thrown away. This still seems better than generating randomness on the host, since it allows my program to run entirely on the GPU without any host / device memory transfers.

This changes my kernel invocations to look more like this:

init_img_rng<<<1, 32>>>(img_rng);
init_pxl_rng<<<{2, 2}, {32, 32}>>>(pxl_rng);
make_images<<<{2, 2, 32}, {32, 32}>>>(img_rng, pxl_rng, image);
inc_img_rng<<<1, 32>>>(img_rng);
inc_pxl_rng<<<{2, 2}, {32, 32}>>>(pxl_rng);

Note this also addresses some inefficiency in the initial code where I used the same block configuration for all my kernel calls even though the init_rngs kernel was processing less data than make_images. Now each kernel invocation is sized to match the data its operating on.

In the original question, I generated per-pixel noise that was unique to each image by combining the per-image and per-pixel noise using the XOR operation. I'm not sure that's really safe, so instead I decided to use skipahead(img_id, pxl_rng) before reading from the per-pixel RNG. That way, each image is reading from the same RNG data but at a different sequence offset.