Reduction in CUDA

I'm just starting to learn CUDA programming, and I have some confusion about reduction.

I know that global memory has much visiting delay as compared to shared memory, but can I use global memory to (at least) simulate a behavior similar to shared memory?

For example, I want to sum the elements of a large array whose length is exactly BLOCK_SIZE * THREAD_SIZE (both the dimensions of grid and block are power of 2), and I've tried to use the code below:

    __global__ void parallelSum(unsigned int* array) {

    unsigned int totalThreadsNum = gridDim.x * blockDim.x;
    unsigned int idx = blockDim.x * blockIdx.x + threadIdx.x;

    int i = totalThreadsNum / 2;
    while (i != 0) {
            if (idx < i) {
                array[idx] += array[idx + i];
        }
        __syncthreads();
        i /= 2;
    }
}

I compared the result of this code and the result generated serially on the host, and the weird thing is: sometimes the results are the same, but sometimes they are apparently different. Is there any reason related to using global memory here?

Solution

Your best bet is to start with the reduction example in the CUDA Samples. The scan example is also good for learning the principles of parallel computing on a throughput architecture.

That said, if you actually just want to use a reduction operator in your code then you should look at Thrust (calling from host, cross-platform) and CUB (CUDA GPU specific).

To look at your specific questions:

There's no reason you can't use global memory for a reduction, the example code in the toolkit walks through various levels of optimisation but in each case the data starts in global memory.
Your code is inefficient (see the example in the toolkit for more details on the work efficiency!).
Your code is attempting to communicate between threads in different blocks without proper synchronisation; __syncthreads() only synchronises threads within a specific block and not across the different blocks (that would be impossible, at least generically, since you tend to oversubscribe the GPU meaning not all blocks will be running at a given time).

The last point is the most important. If a thread in block X wants to read data that is written by block Y then you need to break this across two kernel launches, that's why a typical parallel reduction takes a multi-phase approach: reduce batches within blocks, then reduce between the batches.