[SOLVED] how many processors can I get in a block on cuda GPU?

how many processors can I get in a block on cuda GPU?

I have three questions to ask

If I create only one block of threads in cuda and execute the parallel program on it then is it possible that more than one processors would be given to single block so that my program get some benefit of multiprocessor platform ? To be more clear, If I use only one block of threads then how many processors will be allocated to it because so far as I know (I might have misunderstood it) one warp is given only single processing element.
can I synchronize the threads of different blocks ? if yes please give some hints to do it.
How to find out warp size ? it is fixed for a particular hardware ?

Solution

1 is it possible that more than one processors would be given to single block so that my program get some benefit of multiprocessor platform

Simple answer: No.

The CUDA programming model maps one threadblock to one multiprocessor (SM); the block cannot be split across two or more multiprocessors and, once started, it will not move from one multiprocessor to another.

As you have seen, CUDA provides __syncthreads() to allow threads within a block to synchronise. This is a very low cost operation, and that's partly because all the threads within a block are in close proximity (on the same SM). If they were allowed to split then this would no longer be possible. In addition, threads within a block can cooperate by sharing data in the shared memory; the shared memory is local to a SM and hence splitting the block would break this too.

2 can I synchronize the threads of different blocks ?

Not really no. There are some things you can do, like get the very last block to do something special (see the threadFenceReduction sample in the SDK) but general synchronisation is not really possible. When you launch a grid, you have no control over the scheduling of the blocks onto the multiprocessors, so any attempt to do global synchronisation would risk deadlock.

3 How to find out warp size ? it is fixed for a particular hardware ?

Yes, it is fixed. In fact, for all current CUDA capable devices (both 1.x and 2.0) it is fixed to 32. If you are relying on the warp size then you should ensure forward-compatibility by checking the warp size.

In device code you can just use the special variable warpSize. In host code you can query the warp size for a specific device with:

cudaError_t result;
int deviceID;
struct cudaDeviceProp prop;

result = cudaGetDevice(&deviceID);
if (result != cudaSuccess)
{
    ...
}
result = cudaGetDeviceProperties(&prop, deviceID);
if (result != cudaSuccess)
{
    ...
}

int warpSize = prop.warpSize;