OpenACC code runs 17036.0939901 times faster on Nvidia V100 GPU than on AMD MI250 GPU

I am trying to understand why my OpenACC code runs 17036.0939901 times faster on Nvidia V100 GPU than on AMD Mi-250 GPU. It is a simple matrix-matrix multiplication code. Here is output which I obtained on Nvidia V100 GPU which took 2.2043999284505844E-002 sec:

[ilkhom@topaz-3 MCCC-FN-GPU_DEV]$ cat acc.f90 
!nvfortran -fast -Minfo=accel -acc -gpu=lineinfo,ptxinfo acc.f90
program main
implicit none
integer :: nkgmax,nchmax, i, f, j, nr, k
real(kind=8), allocatable, dimension(:) :: cont_wave
real(kind=8), allocatable, dimension(:,:) :: vmat2D
real(kind=8) :: tmp
integer :: time1, time2, dt, count_rate, count_max
real(kind=8) ::  secs_acc

  nkgmax=2000
  nr=2000

  allocate(cont_wave(1:nkgmax*nr))

  cont_wave(:)=0.d0
  do i=1,nkgmax
    do j=1,nr
      cont_wave((i-1)*nr+j)=dble(i-j)/dble(i+j)!tmp  !1.d0
    enddo
  enddo

!!!! OpenACC test:

!$acc enter data copyin(cont_wave,nr,nkgmax,nchmax) 

  allocate(vmat2D(1:nkgmax,1:nkgmax))
  call system_clock(count_max=count_max, count_rate=count_rate)
  call system_clock(time1)
!$acc kernels copyout(vmat2D) present(cont_wave,nkgmax)
!$acc loop independent vector(16)
  do i=1,nkgmax
!$acc loop independent vector(16)
    do j=1,nkgmax
      if(j.gt.i)cycle
      tmp=0.d0
!$acc loop seq
      do k=1,nr
        tmp=tmp+cont_wave((k-1)*nkgmax+i)*cont_wave((k-1)*nkgmax+j)
      enddo
      vmat2D(i,j)=tmp
      if(i/=j)vmat2D(j,i)=vmat2D(i,j)
    enddo
  enddo
!$acc end kernels
  call system_clock(time2)
  dt = time2-time1
  secs_acc = real(dt)/real(count_rate)
  print*,'time in secs in OpenACC',secs_acc

  print*,'min=',minval(vmat2D(1:nkgmax,1:nkgmax))
  print*,'max=',maxval(vmat2D(1:nkgmax,1:nkgmax))
  print*,'mean=',sum(vmat2D(1:nkgmax,1:nkgmax))/dble(nkgmax*nkgmax)

end program main 
[ilkhom@t006 MCCC-FN-GPU_DEV]$ nvfortran -fast -Minfo=accel -acc -gpu=lineinfo,ptxinfo acc.f90 ; ./a.out 
main:
     25, Generating enter data copyin(cont_wave(:),nchmax,nr,nkgmax)
     30, Generating copyout(vmat2d(:,:)) [if not already present]
         Generating present(nkgmax,cont_wave(:))
     32, Loop is parallelizable
     34, Loop is parallelizable
         Generating Tesla code
         32, !$acc loop gang, vector(16) ! blockidx%x threadidx%x
         34, !$acc loop gang, vector(16) ! blockidx%y threadidx%y
         38, !$acc loop seq
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'main_34_gpu' for 'sm_70'
ptxas info    : Function properties for main_34_gpu
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 176 registers, 392 bytes cmem[0]
 time in secs in OpenACC   2.2043999284505844E-002
 min=   -760.4901596366437     
 max=    1973.862266351370     
 mean=    221.6705356107172

And here is the output on AMD Mi-250 GPU which took 329.869873046875 sec:

abdurakhmanov@uan01:/scratch/project_462000053/ilkhom/openacc/TEST> ftn -h acc -O3 acc_cray.f90 -o check_acc; srun ./check_acc 
 time in secs in OpenACC 329.869873046875
 min= -760.49015963664374
 max= 1973.8622663513693
 mean= 221.67053561071717

One note: on AMD GPU I am using cray-ftn and since I was getting

!$acc loop independent vector(16)
ftn-7271 ftn: WARNING MAIN, File = acc_cray.f90, Line = 36 
  Unsupported OpenACC vector_length expression: Converting 16 to 1.

in the source code I changed !$acc loop independent vector(16) to !$acc loop independent vector(32)

I also have more detailed logs on MI250 GPU generated by setting export CRAY_ACC_DEBUG=3 which I can attach if required.

Cheers, Ilkhom

I expected to see at least similar runtimes on NVIDIA V100 and AMD MI250 GPUs.

Solution

I don't have experience with ftn or AMD GPUs (I work on the NVHPC compiler team), but other compilers have had issues with the "kernels" directive given it takes quite a bit of compiler analysis to support the auto-parallelization, especially with multiple levels of parallelism.

You might try using "parallel" instead as well as collapsing the outer loops:

!$acc parallel loop collapse(2) copyout(vmat2D) present(cont_wave,nkgmax)
  do i=1,nkgmax
    do j=1,nkgmax
      if(j.gt.i)cycle

Multiple nested vector loops are not actually legal OpenACC so another possibility is that ftn is only using "vector" on the outer loop, and possibly not scheduling with any gangs. We support it as a extension because the "kernels" model is based largely on the PGI Accelerator model which did use it.

In standard OpenACC, you'd want to add a "worker" loop or use the "tile" clause for multi-level parallelism. For example:

!$acc parallel loop gang worker  &
!$acc copyout(vmat2D) present(cont_wave,nkgmax)
  do i=1,nkgmax
    !$acc loop vector
    do j=1,nkgmax
      if(j.gt.i)cycle

tile:

!$acc parallel loop tile(16,16) copyout(vmat2D) present(cont_wave,nkgmax)
  do i=1,nkgmax
    do j=1,nkgmax
      if(j.gt.i)cycle