fortrangfortranintel-fortranmicro-optimization

Do most compilers optimize MATMUL(TRANSPOSE(A),B)?


In a Fortran program, I need to compute several expressions like M · v, MT · v, MT · M, M · MT, etc ... Here, M and v are 2D and 1D arrays of small size (less than 100, typically around 2-10)

I was wondering if writing MATMUL(TRANSPOSE(M),v) would unfold at compile time into some code as efficient as MATMUL(N,v), where N is explicitly stored as N=TRANSPOSE(M). I am specifically interested in the gnu and ifort compilers with "strong" optimization flags (-O2, -O3 or -Ofast for instance).


Solution

  • Below you find a couple of execution times of various methods.

    system:

    • Intel(R) Core(TM) i5-6500T CPU @ 2.50GHz
    • cache size : 6144 KB
    • RAM : 16MB
    • GNU Fortran (GCC) 6.3.1 20170216 (Red Hat 6.3.1-3)
    • ifort (IFORT) 18.0.5 20180823
    • BLAS : for gnu compiler, the used blas is the default version

    compilation:

    [gnu] $ gfortran -O3 x.f90 -lblas
    [intel] $ ifort -O3 -mkl x.f90
    

    execution:

    [gnu] $ ./a.out > matmul.gnu.txt
    [intel] $ EXPORT MKL_NUM_THREADS=1; ./a.out > matmul.intel.txt
    

    In order, to make the results as neutral as possible, I've rescaled the answers with the average time of an equivalent set of operations done. I ignored threading.

    matrix times vector

    Six different implementations were compared:

    1. manual: do j=1,n; do k=1,n; w(j) = P(j,k)*v(k); end do; end do
    2. matmul: matmul(P,v)
    3. blas N:dgemv('N',n,n,1.0D0,P,n,v,1,0,w,1)
    4. matmul-transpose: matmul(transpose(P),v)
    5. matmul-transpose-tmp: Q=transpose(P); w=matmul(Q,v)
    6. blas T: dgemv('T',n,n,1.0D0,P,n,v,1,0,w,1)

    In Figure 1 and Figure 2 you can compare the timing results for the above cases. Overall we can say that the usage of a temporary is in both gfortran and ifort not advised. Both compilers can optimize MATMUL(TRANSPOSE(P),v) much better. While in gfortran, the implementation of MATMUL is faster than default BLAS, ifort clearly shows that mkl-blas is faster.

    enter image description here figure 1: Matrix-vector multiplication. Comparison of various implementations ran on gfortran. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n2 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000.

    enter image description here figure 2: Matrix-vector multiplication. Comparison of various implementations ran on a single-threaded ifort compilation. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n2 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000.

    matrix times matrix

    Six different implementations were compared:

    1. manual: do l=1,n; do j=1,n; do k=1,n; Q(j,l) = P(j,k)*P(k,l); end do; end do; end do
    2. matmul: matmul(P,P)
    3. blas N:dgemm('N','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
    4. matmul-transpose: matmul(transpose(P),P)
    5. matmul-transpose-tmp: Q=transpose(P); matmul(Q,P)
    6. blas T: dgemm('T','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)

    In Figure 3 and Figure 4 you can compare the timing results for the above cases. In contrast to the vector-case, the usage of a temporary is only advised for gfortran. While in gfortran, the implementation of MATMUL is faster than default BLAS, ifort clearly shows that mkl-blas is faster. Remarkably, the manual implementation is comparable to mkl-blas.

    enter image description here figure 3: Matrix-matrix multiplication. Comparison of various implementations ran on gfortran. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n3 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000 × 1000.

    enter image description here figure 4: Matrix-matrix multiplication. Comparison of various implementations ran on a single-threaded ifort compilation. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n3 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000 × 1000.


    The used code:

    program matmul_test
    
      implicit none
    
      double precision, dimension(:,:), allocatable :: P,Q,R
      double precision, dimension(:), allocatable :: v,w
    
      integer :: n,i,j,k,l
      double precision,dimension(12) :: t1,t2
    
      do n = 1,1000
         allocate(P(n,n),Q(n,n), R(n,n), v(n),w(n))
         call random_number(P)
         call random_number(v)
    
         i=0
    
         i=i+1
         call cpu_time(t1(i))
         do j=1,n; do k=1,n; w(j) = P(j,k)*v(k); end do; end do
         call cpu_time(t2(i))
    
         i=i+1
         call cpu_time(t1(i))
         w=matmul(P,v)
         call cpu_time(t2(i))
    
         i=i+1
         call cpu_time(t1(i))
         call dgemv('N',n,n,1.0D0,P,n,v,1,0,w,1)
         call cpu_time(t2(i))
    
         i=i+1
         call cpu_time(t1(i))
         w=matmul(transpose(P),v)
         call cpu_time(t2(i))
    
         i=i+1
         call cpu_time(t1(i))
         Q=transpose(P)
         w=matmul(Q,v)
         call cpu_time(t2(i))
    
         i=i+1
         call cpu_time(t1(i))
         call dgemv('T',n,n,1.0D0,P,n,v,1,0,w,1)
         call cpu_time(t2(i))
    
         i=i+1
         call cpu_time(t1(i))
         do l=1,n; do j=1,n; do k=1,n; Q(j,l) = P(j,k)*P(k,l); end do; end do; end do
         call cpu_time(t2(i))
    
         i=i+1
         call cpu_time(t1(i))
         Q=matmul(P,P)
         call cpu_time(t2(i))
    
         i=i+1
         call cpu_time(t1(i))
         call dgemm('N','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
         call cpu_time(t2(i))
    
         i=i+1
         call cpu_time(t1(i))
         Q=matmul(transpose(P),P)
         call cpu_time(t2(i))
    
         i=i+1
         call cpu_time(t1(i))
         Q=transpose(P)
         R=matmul(Q,P)
         call cpu_time(t2(i))
    
         i=i+1
         call cpu_time(t1(i))
         call dgemm('T','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
         call cpu_time(t2(i))
    
         write(*,'(I6,12D25.17)') n, t2-t1
         deallocate(P,Q,R,v,w)
      end do
    
    end program matmul_test