cudafortrancufft

cuda fortran cufftPlanMany


I am having troubles using cufftPlanMany. After creating the plans and taking the forward and inverse FFTs, I could not get the original data back. Please, find attached a minimum version of the code.

program test_cufft
  use cudafor
  use cufft

  integer :: plan_r2c
  integer :: plan_c2r
  real,allocatable,dimension(:,:,:,:), device :: eta_d
  complex,allocatable,dimension(:,:,:,:), device :: etak_d

  nv = 4
  nx = 256
  ny = 512
  nz = 512
  nx21 = nx/2+1

  allocate( eta_d(nv,nx,ny,nz) )
  allocate( etak_d(nv,nx21,ny,nz) )

  batch = nv;
  rank = 3;
  n = (/ nx, ny, nz /);
  idist = nx*ny*nz;
  odist = nx21*ny*nz;
  inembed = (/ nx, ny, nz /);
  onembed = (/ nx21, ny, nz /);
  istride = 1;
  ostride = 1;

  istat = cufftPlanMany( plan_r2c, rank, n, inembed, istride, idist, &
                         onembed, ostride, odist, CUFFT_R2C, batch )
  istat = cufftPlanMany( plan_c2r, rank, n, onembed, ostride, odist, &
                         inembed, istride, idist, CUFFT_C2R, batch )

  ! Initialize eta_d

  istat = cufftExecR2C( plan_r2c, eta_d, etak_d )
  istat = cufftExecC2R( plan_c2r, etak_d, eta_d )
  eta_d = eta_d/idist

end program test_cufft

The problem is after I did the forward and inverse FFTs, I could not get the original data back. Please, what am I doing wrong? Should the ordering of data be eta_d(batch,nx,ny,nz) or eta_d(nx,ny,nz,batch)?


Solution

  • I would say the correct ordering is (nz, ny, nx, batch)

    But it's important to relate these to your array indexing and storage order as well.

    In CUFFT terminology, for a 3D transform(*) the nz direction is the fastest changing index, with typical usage (stride=1) being adjacent data in memory, corresponding to adjacent elements in a transform.

    This direction (I think of it as the elements along a row, i.e. the "z" index being the column index) is also the direction of a multidimensional transform that gets "reduced" in the complex domain, for the R2C/C2R transform types.

    With that in mind, I would rewrite your code this way:

    $ cat t4.cuf
    program test_cufft
      use cudafor
      use cufft
    
      integer :: plan_r2c
      integer :: plan_c2r
      real,allocatable,dimension(:,:,:,:), managed :: eta_d
      complex,allocatable,dimension(:,:,:,:), managed :: etak_d
      integer :: n(3), inembed(3), onembed(3),rank,istride,idist,ostride,odist,batch
    
      nv = 4
      nx = 8
      ny = 8
      nz = 4
      nz21 = nz/2+1
    
      allocate( eta_d(nz,ny,nx,nv) )
      allocate( etak_d(nz21,ny,nx,nv) )
    
      batch = nv;
      rank = 3;
      n = (/ nx, ny, nz /);
      idist = nx*ny*nz;
      odist = nx*ny*nz21;
      inembed = (/ nx, ny, nz /);
      onembed = (/ nx, ny, nz21 /);
      istride = 1;
      ostride = 1;
    
      istat = cufftPlanMany( plan_r2c, rank, n, inembed, istride, idist, onembed, ostride, odist, CUFFT_R2C, batch )
      istat = cufftPlanMany( plan_c2r, rank, n, onembed, ostride, odist, inembed, istride, idist, CUFFT_C2R, batch )
    
      ! Initialize eta_d
      eta_d(:,:,:,:) = 1.0
      eta_d(1,1,1,2) = 2.0
      istat = cufftExecR2C( plan_r2c, eta_d, etak_d )
      istat = cudaDeviceSynchronize()
      eta_d(:,:,:,:) = 0.0
      istat = cufftExecC2R( plan_c2r, etak_d, eta_d )
      istat = cudaDeviceSynchronize()
      eta_d = eta_d/idist
      print *,eta_d(1,1,1,1)
      print *,eta_d(1,1,1,2)
    end program test_cufft
    $ nvfortran t4.cuf -lcufft
    $ ./a.out
        1.000000
        2.000000
    $
    

    (NVIDIA HPC SDK 20.9, Tesla V100 GPU)

    and it seems to give expected results for my simple test case.

    (*) For a 2D transform, the ny dimension would be the most rapidly varying, and for a 1D transform the nx dimension (of course) is the most rapidly varying.

    The multi-dimensional transforms and advanced data layout sections of the CUFFT manual may also be useful reading.