I have a CUDA program whose kernel basically does the following.
Suppose n/2 of those points are of one kind and the rest are of other. A sample image generated looks as follows.
Especially when precision of the time was set to 1 I noticed that the number of unique 3 tuples of time (t_0,t_1,t_2) was just 2.5% of the total data points i.e. number of points on the plane. This meant that most of the times the kernel was uselessly simulating when it could just use the values from previous simulations. Hence I could use a dictionary having key as 3-tuple of times and value as index and prob. Since as far as I know and tested, STL can't be accessed inside a kernel, I constructed an array of floats of size 201000000. This choice was by experimentation since none of the top 3 times exceeded 20 seconds. Hence t_0 could take any value from {0.0,0.1,0.2,...,20.0} thus having 201 choices. I could construct a key for such a dictionary like the following
As far as the value is concerned I could make it as (prob+idx). Since idx is an integer and 0.0<=prob<=1.0, I could retrieve both of those values later by
So now my kernel looks like the following
__global__ my_kernel(float* points,float* dict,float *p,float *i,size_t w,...){
unsigned int col = blockIdx.y*blockDim.y + threadIdx.y;
unsigned int row = blockIdx.x*blockDim.x + threadIdx.x;
//Calculate time taken for each of the points to reach a particular point on the plane
//Order the times in increasing order t_0,t_1,...,t_n
//Calculate Key = t_o * 10^6 + t_1 * 10^3 + t_2
if(dict[key]>0.0){
prob=dict[key]-floor(dict[key])
idx = floor(dict[key])
}
else{
//Simulate and find prob and idx
dict[key]=(prob+idx)
}
p[row*width+col]=prob;
i[row*width+col]=idx;
}
The result is quite similar to the original program for most points but for some it is wrong.
I am quite sure that this is due to race condition. Notice that dict was initialized with all zeroes. The basic idea would be to make the data structure "read many write once" in a particular location of the dict.
I am aware that there might be much more optimized ways of solving this problem rather than allocating so much memory. Please let me know in that case. But I would really like to understand why this particular solution is failing. In particular I would like to know how to use atomicAdd in this setting. I have failed to use it.
Unless your simulation in the else
branch is very long (~100s of floating-point operations), a lookup table in global memory is likely to be slower than running the computation. Global memory access is very expensive!
In any case, there is no way to save time by "skipping work" using conditional branching. The Single Instruction, Multiple Thread architecture of a GPU means that the instructions for both sides of the branch will be executed serially, unless all of the threads in a block follow the same branch.
The fact that you are seeing a performance increase as a result of introducing the conditional branch and you didn't have any problems with deadlock suggests that all the threads in each block are always taking the same branch. I suspect that once dict
starts getting populated, the performance increase will go away.
Perhaps I have misunderstood something, but if you want to calculate the probability of an event x
, assuming a normal distribution and given the mean mu
and standard deviation sigma
, there is no need to generate a load of random numbers and approximate a Gaussian curve. You can directly calculate the probability:
p = exp(-((x - mu) * (x - mu) / (2.0f * sigma * sigma))) /
(sigma * sqrt(2.0f * M_PI));