I have a problem: my code does give me a runtime error, when I compile with PGI instead of Cray. The code fails when it enters the parallel region and gives me a segmentation fault.
I tried several things and found that passing scalars (var4, see code) to the subroutine works proper. But passing arrays (var3, see code) to the routine it fails.
The code works without any problems compiled with Cray, but with PGI it complains.
So, there is my question: Is there a difference how arrays are allocated on the device between PGI and Cray?
The parallel region with the call looks like that:
!$acc data present(var2,var3,var4)
!$acc parallel
!$acc loop gang vector collapse(2) private(var1)
DO j = 1, jend
DO i = 1, iend
IF (var2(i,j) .gt. 100.0) THEN
CALL routine_seq ( var3(i,j,:), &
var4(i,j), &
var1)
END IF
END DO
END DO
!$acc end parallel
In the routine I have the !$acc routine seq included. It looks like that:
SUBROUTINE routine_seq(var3,var4)
!$acc routine seq
REAL (KIND=wp), DIMENSION( : ), &
INTENT( IN ) :: var3
REAL (KIND=wp), DIMENSION( : ), &
INTENT( IN ) :: var4
REAL (KIND=wp), &
INTENT( OUT ) :: var1
var3 and var4 are allocated this way:
ALLOCATE ( var3(iend,jend,kend) , STAT=ierr); IF (ierr/=0) istat=ierr
ALLOCATE ( var4(iend,jend) , STAT=ierr); IF (ierr/=0) istat=ierr
!$acc enter data create(var3,var4)
Since the error is a seg fault, this means the problem is with the host side.
Try adding a "present" clause on you "parallel" region:
!$acc parallel present(var2,var3)
While it's a bit difficult to determine since you don't provide a complete reproducing example, my best guess as to the problem is that the compiler can't properly determine how much of the array is needed for the implicit copy into the region given var3 is not used in the region except as an argument to the call. PGI will attempt to only implicitly copy the minimal amount of the array. Adding "present" will disable the implicit copy and instead have it only check the present table if the array already has a device copy present on the device. Alternatively, you could use "copy(var2,var3)" for "present_or_copy" semantics, where the present check will be for the entire array rather than for a subset.
To see what the compiler using for the implicit copy, try adding "-Minfo=accel" to enable the compiler feedback messages.