I am trying to understand why my OpenACC code runs 17036.0939901 times faster on Nvidia V100 GPU than on AMD Mi-250 GPU. It is a simple matrix-matrix multiplication code. Here is output which I obtained on Nvidia V100 GPU which took 2.2043999284505844E-002 sec:
[ilkhom@topaz-3 MCCC-FN-GPU_DEV]$ cat acc.f90
!nvfortran -fast -Minfo=accel -acc -gpu=lineinfo,ptxinfo acc.f90
program main
implicit none
integer :: nkgmax,nchmax, i, f, j, nr, k
real(kind=8), allocatable, dimension(:) :: cont_wave
real(kind=8), allocatable, dimension(:,:) :: vmat2D
real(kind=8) :: tmp
integer :: time1, time2, dt, count_rate, count_max
real(kind=8) :: secs_acc
nkgmax=2000
nr=2000
allocate(cont_wave(1:nkgmax*nr))
cont_wave(:)=0.d0
do i=1,nkgmax
do j=1,nr
cont_wave((i-1)*nr+j)=dble(i-j)/dble(i+j)!tmp !1.d0
enddo
enddo
!!!! OpenACC test:
!$acc enter data copyin(cont_wave,nr,nkgmax,nchmax)
allocate(vmat2D(1:nkgmax,1:nkgmax))
call system_clock(count_max=count_max, count_rate=count_rate)
call system_clock(time1)
!$acc kernels copyout(vmat2D) present(cont_wave,nkgmax)
!$acc loop independent vector(16)
do i=1,nkgmax
!$acc loop independent vector(16)
do j=1,nkgmax
if(j.gt.i)cycle
tmp=0.d0
!$acc loop seq
do k=1,nr
tmp=tmp+cont_wave((k-1)*nkgmax+i)*cont_wave((k-1)*nkgmax+j)
enddo
vmat2D(i,j)=tmp
if(i/=j)vmat2D(j,i)=vmat2D(i,j)
enddo
enddo
!$acc end kernels
call system_clock(time2)
dt = time2-time1
secs_acc = real(dt)/real(count_rate)
print*,'time in secs in OpenACC',secs_acc
print*,'min=',minval(vmat2D(1:nkgmax,1:nkgmax))
print*,'max=',maxval(vmat2D(1:nkgmax,1:nkgmax))
print*,'mean=',sum(vmat2D(1:nkgmax,1:nkgmax))/dble(nkgmax*nkgmax)
end program main
[ilkhom@t006 MCCC-FN-GPU_DEV]$ nvfortran -fast -Minfo=accel -acc -gpu=lineinfo,ptxinfo acc.f90 ; ./a.out
main:
25, Generating enter data copyin(cont_wave(:),nchmax,nr,nkgmax)
30, Generating copyout(vmat2d(:,:)) [if not already present]
Generating present(nkgmax,cont_wave(:))
32, Loop is parallelizable
34, Loop is parallelizable
Generating Tesla code
32, !$acc loop gang, vector(16) ! blockidx%x threadidx%x
34, !$acc loop gang, vector(16) ! blockidx%y threadidx%y
38, !$acc loop seq
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'main_34_gpu' for 'sm_70'
ptxas info : Function properties for main_34_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 176 registers, 392 bytes cmem[0]
time in secs in OpenACC 2.2043999284505844E-002
min= -760.4901596366437
max= 1973.862266351370
mean= 221.6705356107172
And here is the output on AMD Mi-250 GPU which took 329.869873046875 sec:
abdurakhmanov@uan01:/scratch/project_462000053/ilkhom/openacc/TEST> ftn -h acc -O3 acc_cray.f90 -o check_acc; srun ./check_acc
time in secs in OpenACC 329.869873046875
min= -760.49015963664374
max= 1973.8622663513693
mean= 221.67053561071717
One note: on AMD GPU I am using cray-ftn and since I was getting
!$acc loop independent vector(16)
ftn-7271 ftn: WARNING MAIN, File = acc_cray.f90, Line = 36
Unsupported OpenACC vector_length expression: Converting 16 to 1.
in the source code I changed !$acc loop independent vector(16)
to !$acc loop independent vector(32)
I also have more detailed logs on MI250 GPU generated by setting export CRAY_ACC_DEBUG=3
which I can attach if required.
Cheers, Ilkhom
I expected to see at least similar runtimes on NVIDIA V100 and AMD MI250 GPUs.
I don't have experience with ftn or AMD GPUs (I work on the NVHPC compiler team), but other compilers have had issues with the "kernels" directive given it takes quite a bit of compiler analysis to support the auto-parallelization, especially with multiple levels of parallelism.
You might try using "parallel" instead as well as collapsing the outer loops:
!$acc parallel loop collapse(2) copyout(vmat2D) present(cont_wave,nkgmax)
do i=1,nkgmax
do j=1,nkgmax
if(j.gt.i)cycle
Multiple nested vector loops are not actually legal OpenACC so another possibility is that ftn is only using "vector" on the outer loop, and possibly not scheduling with any gangs. We support it as a extension because the "kernels" model is based largely on the PGI Accelerator model which did use it.
In standard OpenACC, you'd want to add a "worker" loop or use the "tile" clause for multi-level parallelism. For example:
!$acc parallel loop gang worker &
!$acc copyout(vmat2D) present(cont_wave,nkgmax)
do i=1,nkgmax
!$acc loop vector
do j=1,nkgmax
if(j.gt.i)cycle
tile:
!$acc parallel loop tile(16,16) copyout(vmat2D) present(cont_wave,nkgmax)
do i=1,nkgmax
do j=1,nkgmax
if(j.gt.i)cycle