Why is the compiler not doing some trivial optimizations that can be done in the kernel? I have the following code for matrix multiplication:
__global__ void matrixMultiply(float * A, float * B, float * C,
int numARows, int numAColumns,
int numBRows, int numBColumns,
int numCRows, int numCColumns) {
int n=numAColumns;
int Row=blockIdx.x*blockDim.x+threadIdx.x;
int Col=blockIdx.y*blockDim.y+threadIdx.y;
if((Row<numCRows) && (Col<numCColumns)){
for(int k=0;k<n;++k){
C[Row*numCColumns+Col]+=
A[Row*numAColumns+k]*B[k*numBColumns+Col];
}
}
}
The example would go much faster if I use a temporal register Cvalue
to store the sum:
__global__ void matrixMultiply(float * A, float * B, float * C,
int numARows, int numAColumns,
int numBRows, int numBColumns,
int numCRows, int numCColumns) {
int n=numAColumns;
int Row=blockIdx.x*blockDim.x+threadIdx.x;
int Col=blockIdx.y*blockDim.y+threadIdx.y;
if((Row<numCRows) && (Col<numCColumns)){
float Cvalue=0;
for(int k=0;k<n;++k){
Cvalue+=A[Row*numAColumns+k]*B[k*numBColumns+Col];
}
C[Row*numCColumns+Col]=Cvalue;
}
}
In the last case, global memory for C
is accessed only once whereas in the first case it is accessed many times in the loop. Isn't this kind of optimizations normally done by compilers? Both codes have a difference of about 30% in performance in my tests, I'm doing nvcc -O3 ...
Since C
is not declared as __restrict__
the compiler cannot know whether C
is the same matrix as A
or B
, so it cannot perform the optimization mentioned. When I instead used float* __restrict__ C
, the times for both became practically the same. Thanks Chris Dodd.