In a Fortran program, I need to compute several expressions like M · v, MT · v, MT · M, M · MT, etc ... Here, M and v are 2D and 1D arrays of small size (less than 100, typically around 2-10)
I was wondering if writing MATMUL(TRANSPOSE(M),v)
would unfold at compile time into some code as efficient as MATMUL(N,v)
, where N
is explicitly stored as N=TRANSPOSE(M)
. I am specifically interested in the gnu and ifort compilers with "strong" optimization flags (-O2, -O3 or -Ofast for instance).
Below you find a couple of execution times of various methods.
system:
- Intel(R) Core(TM) i5-6500T CPU @ 2.50GHz
- cache size : 6144 KB
- RAM : 16MB
- GNU Fortran (GCC) 6.3.1 20170216 (Red Hat 6.3.1-3)
- ifort (IFORT) 18.0.5 20180823
- BLAS : for gnu compiler, the used blas is the default version
compilation:
[gnu] $ gfortran -O3 x.f90 -lblas [intel] $ ifort -O3 -mkl x.f90
execution:
[gnu] $ ./a.out > matmul.gnu.txt [intel] $ EXPORT MKL_NUM_THREADS=1; ./a.out > matmul.intel.txt
In order, to make the results as neutral as possible, I've rescaled the answers with the average time of an equivalent set of operations done. I ignored threading.
Six different implementations were compared:
do j=1,n; do k=1,n; w(j) = P(j,k)*v(k); end do; end do
matmul(P,v)
dgemv('N',n,n,1.0D0,P,n,v,1,0,w,1)
matmul(transpose(P),v)
Q=transpose(P); w=matmul(Q,v)
dgemv('T',n,n,1.0D0,P,n,v,1,0,w,1)
In Figure 1 and Figure 2 you can compare the timing results for the above cases. Overall we can say that the usage of a temporary is in both gfortran
and ifort
not advised. Both compilers can optimize MATMUL(TRANSPOSE(P),v)
much better. While in gfortran
, the implementation of MATMUL
is faster than default BLAS, ifort
clearly shows that mkl-blas
is faster.
figure 1: Matrix-vector multiplication. Comparison of various implementations ran on gfortran
. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n2 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000.
figure 2: Matrix-vector multiplication. Comparison of various implementations ran on a single-threaded ifort
compilation. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n2 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000.
Six different implementations were compared:
do l=1,n; do j=1,n; do k=1,n; Q(j,l) = P(j,k)*P(k,l); end do; end do; end do
matmul(P,P)
dgemm('N','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
matmul(transpose(P),P)
Q=transpose(P); matmul(Q,P)
dgemm('T','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
In Figure 3 and Figure 4 you can compare the timing results for the above cases. In contrast to the vector-case, the usage of a temporary is only advised for gfortran. While in gfortran
, the implementation of MATMUL
is faster than default BLAS, ifort
clearly shows that mkl-blas
is faster. Remarkably, the manual implementation is comparable to mkl-blas
.
figure 3: Matrix-matrix multiplication. Comparison of various implementations ran on gfortran
. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n3 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000 × 1000.
figure 4: Matrix-matrix multiplication. Comparison of various implementations ran on a single-threaded ifort
compilation. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n3 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000 × 1000.
The used code:
program matmul_test
implicit none
double precision, dimension(:,:), allocatable :: P,Q,R
double precision, dimension(:), allocatable :: v,w
integer :: n,i,j,k,l
double precision,dimension(12) :: t1,t2
do n = 1,1000
allocate(P(n,n),Q(n,n), R(n,n), v(n),w(n))
call random_number(P)
call random_number(v)
i=0
i=i+1
call cpu_time(t1(i))
do j=1,n; do k=1,n; w(j) = P(j,k)*v(k); end do; end do
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
w=matmul(P,v)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
call dgemv('N',n,n,1.0D0,P,n,v,1,0,w,1)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
w=matmul(transpose(P),v)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
Q=transpose(P)
w=matmul(Q,v)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
call dgemv('T',n,n,1.0D0,P,n,v,1,0,w,1)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
do l=1,n; do j=1,n; do k=1,n; Q(j,l) = P(j,k)*P(k,l); end do; end do; end do
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
Q=matmul(P,P)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
call dgemm('N','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
Q=matmul(transpose(P),P)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
Q=transpose(P)
R=matmul(Q,P)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
call dgemm('T','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
call cpu_time(t2(i))
write(*,'(I6,12D25.17)') n, t2-t1
deallocate(P,Q,R,v,w)
end do
end program matmul_test