I have boolean 1D array T[N]
controlling the value of shifts as follows:
**a
: an array of pointers to n*n
matrices in global memory
I want for each matrix a
to substruct a shift*Identity to obtain:
a=a-shift*eye(n)
I have:
__device__ bool T[N];
__device__ float shift1[N];
__device__ float shift2[N];
__device__ float* a[N];
The value of shift is controlled by T if T[i]==true => shift=shift1 else shift=shift2;
int tid=threadIdx.x;
if(tid < N){
if(T[tid]){
for (int i=0;i<n;i++){
a[tid][i*n+i]=a[tid][i*n+i]-shift1[tid];
}
}
else {
for (int i=0;i<n;i++){
a[tid][i*n+i]=a[tid][i*n+i]-shift2[tid];
}
}
}
__syncthreads();
This will cause warp divergence and slow down my code. Is there a trick to avoid warp divergence for the above loop?
As suggested by @AnastasiyaAsadullayeva, I believe a fairly simple transformation of your code might reduce your concerns about warp divergence:
int tid=threadIdx.x;
float myshift;
if (T[tid]) myshift = shift1[tid];
else myshift = shift2[tid];
if(tid < N){
for (int i=0;i<n;i++){
a[tid][i*n+i]=a[tid][i*n+i]-myshift;
}
}
__syncthreads();
The compiler will predicate the load of myshift
(creating the "conditional load" already mentioned). This predication minimizes the cost of divergence for the load itself. The rest of this code under this transformation is non-divergent (excepting where tid >= N
, which ought to be of no concern).
Again, as already mentioned, this whole transformation may already be observed and done by the compiler. It's possible, but cannot be confirmed without running an actual complete test case, which you haven't provided.
A better approach is to write the code in a way that seems natural to you, and then let the compiler handle it. At that point, you can use a profiler and analysis-driven optimization to decide if warp-divergence is actually a performance problem in your code (the profilers have metrics and other ways to evaluate warp-divergence and indicate its severity in your code.)