Summing up elements in array using managedCuda

Problem Description

I try to get a kernel summing up all elements of an array to work. The kernel is intended to be launched with 256 threads per block and an arbitary number of blocks. The length of the array passsed in as a is always a multiple of 512, in fact it is #blocks * 512. One block of the kernel should sum up 'its' 512 elements (256 threads can sum up 512 elements using this algorithm), storing the result in out[blockIdx.x]. The final summation over the values in out ,and therefore the results of the blocks, will be done on the host.
This kernel works fine for up to 6 blocks, meaning up to 3072 elements. But launching it with more than 6 blocks result in the first block calculating a strictly greater, wrong result than the other blocks (i. e. out = {572, 512, 512, 512, 512, 512, 512}), this wrong result is reproducable, the wrong value is the same for multiple executions.
I guess this means there is a structural error somewhere in my code, which has something to do with blockIdx.x, but the only use this is to calculate blockStart, and this seams to be a correct calculation, also for the first block.
I verified if my host code computes the correct number of blocks for the kernel and passes in an array of correct size. That's not the problem.
Of course I read a lot of similar questions here on stackoverflow, but none seems to describe my problem (See i. e. here or here)
The kernel is called via managedCuda (C#), I don't know if this might be a problem.

Hardware

I use a MX150 with the follwing specifications:

Revision Number: 6.1
Total global memory: 2147483648
Total shared memory per block: 49152
Total registers per block: 65536
Warp size: 32
Max Threads per block: 1024
Max Blocks: 2147483648
Number of multiprocessors: 3

Code

Kernel

__global__ void Vector_Reduce_As_Sum_Kernel(float* out, float* a)
{   
int tid = threadIdx.x;
int blockStart = blockDim.x * blockIdx.x * 2;
int i = tid + blockStart;

int leftSumElementIdx =  blockStart + tid * 2;

a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];

__syncthreads();

if (tid < 128) 
{
    a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}

__syncthreads();

if(tid < 64)
{
    a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}

__syncthreads();

if (tid < 32)
{
    a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}

__syncthreads();

if (tid < 16)
{
    a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}

__syncthreads();

if (tid < 8)
{
    a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}

__syncthreads();

if (tid < 4)
{
    a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}

__syncthreads();

if (tid < 2)
{
    a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}

__syncthreads();

if (tid == 0)
{
    out[blockIdx.x] = a[blockStart] + a[blockStart + 1];
}
}

Kernel Invocation

//Get the cuda kernel
//PathToPtx and MangledKernelName must be replaced
CudaContext cntxt = new CudaContext();
CUmodule module = cntxt.LoadModule("pathToPtx");    
CudaKernel vectorReduceAsSumKernel = new CudaKernel("MangledKernelName", module, cntxt);

//Get an array to reduce
float[] array = new float[4096];
for(int i = 0; i < array.Length; i++)
{
    array[i] = 1;
}

//Calculate execution info for the kernel
int threadsPerBlock = 256;
int numOfBlocks = array.Length / (threadsPerBlock * 2);

//Memory on the device
CudaDeviceVariable<float> m_d = array;
CudaDeviceVariable<float> out_d = new CudaDeviceVariable<float>(numOfBlocks);

//Give the kernel necessary execution info
vectorReduceAsSumKernel.BlockDimensions = threadsPerBlock;
vectorReduceAsSumKernel.GridDimensions = numOfBlocks;

//Run the kernel on the device
vectorReduceAsSumKernel.Run(out_d.DevicePointer, m_d.DevicePointer);

//Fetch the result
float[] out_h = out_d;

//Sum up the partial sums on the cpu
float sum = 0;
for(int i = 0; i < out_h.Length; i++)
{
    sum += out_h[i];
}

//Verify the correctness
if(sum != 4096)
{
    throw new Exception("Thats the wrong result!");
}

Update:

The very helpfull and only answer did address all my problems. Thank you! The problem was an unforeseen race condition.

Important Hint:

In the comments the author of managedCuda pointed out all NPPs methods are indeed already implmented in managedCuda (using ManagedCuda.NPP.NPPsExtensions;). I wasn't aware of that, and i guess so are many people reading ths question.

Solution

You are not correctly incorporating into your code the idea that each block will process 512 elements out of your total array. According to my testing, you need to make at least 2 changes to fix this:

In the kernel, you have incorrectly calculated the starting point for each block:
```
int blockStart = blockDim.x * blockIdx.x;
```
since blockDim.x is 256, but each block processes 512 elements, you must multiply this by 2. (the multiplication by 2 in your calculation of leftSumElementIdx doesn't take care of this -- since it is only multiplying tid).
In your host code, your number of blocks calculation is incorrect:
```
vectorReduceAsSumKernel.GridDimensions = array.Length / threadsPerBlock;
```
for a value of 2048 for array.Length and a value of 256 for threadsPerBlock, this creates 8 blocks. But as you already indicate, your intention is to launch for blocks (2048/512). So you need to multiply the denominator by 2:
```
vectorReduceAsSumKernel.GridDimensions = array.Length / (2*threadsPerBlock);
```

In addition, your reduction sweep pattern is broken. It is warp-execution-order dependent, to give the proper result, and CUDA does not specify a warp execution order.

To see why, let's take a simple example. Let's consider just a single threadblock, with a starting point of the array being all 1, just as you have initialized it.

Now, warp 0 consists of threads 0-31. Your reduction sweep operation is like this:

a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];

So each thread in warp 0 will collect two other values and add them, and store them. Thread 31 will take the values a[62] and a[63] and add them together. If the values of a[62] and a[63] are still 1, as initialized, then this will work as expected. But the values of a[62] and a[63] are written to by warp 1, consisting of threads 32-63. So if warp 1 executes before warp 0 (perfectly legal), then you will get a different result. This is a global memory race condition. It is arising due to the fact that your input array is both the source and destination of your intermediate results, and __syncthreads() will not sort this out for you. It doesn't force warps to execute in any particular order.

One possible solution is to fix your sweep pattern. On any given reduction cycle, let's have a sweep pattern where each thread writes and reads values that are not touched by any other thread during that cycle. The following adaptation of your kernel code accomplishes that:

__global__ void Vector_Reduce_As_Sum_Kernel(float* out, float* a)
{
  int tid = threadIdx.x;
  int blockStart = blockDim.x * blockIdx.x * 2;
  int i = tid + blockStart;

  for (int j = blockDim.x; j > 0; j>>=1){
    if (tid < j)
      a[i] += a[i+j];

    __syncthreads();}

  if (tid == 0)
  {
    out[blockIdx.x] = a[i];
  }
}

For general purpose reductions, this is still a very slow method. This tutorial covers how to write faster reductions. And, as already pointed out, managedCuda may have methods to avoid writing a kernel at all.