iofortranperformance-testinggfortrannvme

GFortran unformatted I/O throughput on NVMe SSDs


Please help me understand how I can improve sequential, unformatted I/O throughput with (G)Fortran, especially when working on NVMe SSDs.

I wrote a little test program, see bottom of this post. What this does is open one or more files in parallel (OpenMP) and write an array of random numbers into it. Then it flushes system caches (root required, otherwise the read test will most likely read from memory) opens the files, and reads from them. Time is measured in wall time (trying to include only I/O-related times), and performance numbers are given in MiB/s. The program loops until aborted.

The hardware I am using for testing is a Samsung 970 Evo Plus 1TB SSD, connected via 2 PCIe 3.0 lanes. So in theory, it should be capable of ~1500MiB/s sequential reads and writes. Testing beforehand with "dd if=/dev/zero of=./testfile bs=1G count=1 oflag=direct" results in ~750MB/s. Not too great, but still better than what I get with Gfortran. And depending on who you ask, dd should not be used for benchmarking anyway. This is just to make sure that the hardware is in theory capable of more.

Results with my code tend to get better with larger file size, but even with 1GiB it caps out at around 200MiB/s write, 420MiB/s read. Using more threads (e.g. 4) increases write speeds a bit, but only to around 270MiB/s. I made sure to keep the benchmark runs short, and give the SSD time to relax between tests.

I was under the impression that it should be possible to saturate 2 PCIe 3.0 lanes worth of bandwidth, even with only a single thread. At least when using unformatted I/O. The code does not seem to be CPU limited, top shows less than 50% usage on a single core if I move the allocation and initialization of the "values" field out of the loop. Which still does not bode well for overall performance, considering that I would like to see numbers that are at least 5 times higher.
I also tried to use access=stream for the open statements, but to no avail.

So what seems to be the problem?
Is my code wrong/unoptimized? Are my expectations too high?

Platform used:
Opensuse Leap 15.1, Kernel 4.12.14-lp151.28.36-default
2x AMD Epyc 7551, Supermicro H11DSI, Samsung 970 Evo Plus 1TB (2xPCIe 3.0)
gcc version 8.2.1, compiler options: -ffree-line-length-none -O3 -ffast-math -funroll-loops -flto

MODULE types
    implicit none
    save

    INTEGER, PARAMETER  :: I8B = SELECTED_INT_KIND(18)
    INTEGER, PARAMETER  :: I4B = SELECTED_INT_KIND(9)
    INTEGER, PARAMETER  :: SP = KIND(1.0)
    INTEGER, PARAMETER  :: DP = KIND(1.0d0)

END MODULE types

MODULE parameters
    use types
    implicit none
    save

    INTEGER(I4B) :: filesize ! file size in MiB
    INTEGER(I4B) :: nthreads ! number of threads for parallel ececution
    INTEGER(I4B) :: alloc_size ! size of the allocated data field

END MODULE parameters



PROGRAM iometer
    use types
    use parameters
    use omp_lib

    implicit none

    CHARACTER(LEN=100) :: directory_char, filesize_char, nthreads_char
    CHARACTER(LEN=40)  :: dummy_char1
    CHARACTER(LEN=110) :: filename
    CHARACTER(LEN=10)  :: filenumber
    INTEGER(I4B) :: thread, tunit, n
    INTEGER(I8B) :: counti, countf, count_rate
    REAL(DP) :: telapsed_read, telapsed_write, mib_written, write_speed, mib_read, read_speed
    REAL(SP), DIMENSION(:), ALLOCATABLE :: values

    call system_clock(counti,count_rate)

    call getarg(1,directory_char)
    dummy_char1 = ' directory to test:'
    write(*,'(A40,A)') dummy_char1, trim(adjustl(directory_char))

    call getarg(2,filesize_char)
    dummy_char1 = ' file size (MiB):'
    read(filesize_char,*) filesize
    write(*,'(A40,I12)') dummy_char1, filesize

    call getarg(3,nthreads_char)
    dummy_char1 = ' number of parallel threads:'
    read(nthreads_char,*) nthreads
    write(*,'(A40,I12)') dummy_char1, nthreads

    alloc_size = filesize * 262144

    dummy_char1 = ' allocation size:'
    write(*,'(A40,I12)') dummy_char1, alloc_size

    mib_written = real(alloc_size,kind=dp) * real(nthreads,kind=dp) / 1048576.0_dp
    mib_read = mib_written

    CALL OMP_SET_NUM_THREADS(nthreads)
    do while(.true.)
        !$OMP PARALLEL default(shared) private(thread, filename, filenumber, values, tunit)

        thread = omp_get_thread_num()
        write(filenumber,'(I0.10)') thread
        filename = trim(adjustl(directory_char)) // '/' // trim(adjustl(filenumber)) // '.temp'

        allocate(values(alloc_size))
        call random_seed()
        call RANDOM_NUMBER(values)
        tunit = thread + 100

        !$OMP BARRIER
        !$OMP MASTER
        call system_clock(counti)
        !$OMP END MASTER
        !$OMP BARRIER

        open(unit=tunit, file=trim(adjustl(filename)), status='replace', action='write', form='unformatted')
        write(tunit) values
        close(unit=tunit)

        !$OMP BARRIER
        !$OMP MASTER
        call system_clock(countf)
        telapsed_write = real(countf-counti,kind=dp)/real(count_rate,kind=dp)
        write_speed = mib_written/telapsed_write
        !write(*,*) 'write speed (MiB/s): ', write_speed
        call execute_command_line ('echo 3 > /proc/sys/vm/drop_caches', wait=.true.)
        call system_clock(counti)
        !$OMP END MASTER
        !$OMP BARRIER

        open(unit=tunit, file=trim(adjustl(filename)), status='old', action='read', form='unformatted')
        read(tunit) values
        close(unit=tunit)

        !$OMP BARRIER
        !$OMP MASTER
        call system_clock(countf)
        telapsed_read = real(countf-counti,kind=dp)/real(count_rate,kind=dp)
        read_speed = mib_read/telapsed_read
        write(*,'(A29,2F10.3)') ' write / read speed (MiB/s): ', write_speed, read_speed
        !$OMP END MASTER
        !$OMP BARRIER
        deallocate(values)
        !$OMP END PARALLEL

        call sleep(1)

    end do

END PROGRAM iometer

Solution

  • The mistake in your code is that in your calculation of mib_written you have forgotten to take into account the size of a real(sp) variable (4 bytes). Thus your results are a factor of 4 too low. E.g. calculate it as

    mib_written = filesize * nthreads
    

    Some minor nits, some specific to GFortran:

    Your test program with these fixes (and the parameters module folded into the main program) below:

    PROGRAM iometer
      use iso_fortran_env
      use omp_lib
    
      implicit none
    
      CHARACTER(LEN=100) :: directory_char, filesize_char, nthreads_char
      CHARACTER(LEN=40)  :: dummy_char1
      CHARACTER(LEN=110) :: filename
      CHARACTER(LEN=10)  :: filenumber
      INTEGER :: thread, tunit
      INTEGER(int64) :: counti, countf, count_rate
      REAL(real64) :: telapsed_read, telapsed_write, mib_written, write_speed, mib_read, read_speed
      REAL, DIMENSION(:), ALLOCATABLE :: values
    
      INTEGER :: filesize ! file size in MiB
      INTEGER :: nthreads ! number of threads for parallel ececution
      INTEGER(int64) :: alloc_size ! size of the allocated data field
    
    
      call system_clock(counti,count_rate)
    
      call get_command_argument(1, directory_char)
      dummy_char1 = ' directory to test:'
      write(*,'(A40,A)') dummy_char1, trim(adjustl(directory_char))
    
      call get_command_argument(2, filesize_char)
      dummy_char1 = ' file size (MiB):'
      read(filesize_char,*) filesize
      write(*,'(A40,I12)') dummy_char1, filesize
    
      call get_command_argument(3, nthreads_char)
      dummy_char1 = ' number of parallel threads:'
      read(nthreads_char,*) nthreads
      write(*,'(A40,I12)') dummy_char1, nthreads
    
      alloc_size = filesize * 262144_int64
    
      dummy_char1 = ' allocation size:'
      write(*,'(A40,I12)') dummy_char1, alloc_size
    
      mib_written = filesize * nthreads
      dummy_char1 = ' MiB written:'
      write(*, '(A40,g0)') dummy_char1, mib_written
      mib_read = mib_written
    
      CALL OMP_SET_NUM_THREADS(nthreads)
      !$OMP PARALLEL default(shared) private(thread, filename, filenumber, values, tunit)
      do while (.true.)
         thread = omp_get_thread_num()
         write(filenumber,'(I0.10)') thread
         filename = trim(adjustl(directory_char)) // '/' // trim(adjustl(filenumber)) // '.temp'
    
         if (.not. allocated(values)) then
            allocate(values(alloc_size))
            call RANDOM_NUMBER(values)
         end if
    
         open(newunit=tunit, file=filename, status='replace', action='write', form='unformatted', access='stream')
         !$omp barrier
         !$omp master
         call system_clock(counti)
         !$omp end master
         !$omp barrier
         write(tunit) values
         close(unit=tunit)
         !$omp barrier
         !$omp master
         call system_clock(countf)
    
         telapsed_write = real(countf - counti, kind=real64)/real(count_rate, kind=real64)
         write_speed = mib_written/telapsed_write
         call execute_command_line ('echo 3 > /proc/sys/vm/drop_caches', wait=.true.)
    
         !$OMP END MASTER
    
         open(newunit=tunit, file=trim(adjustl(filename)), status='old', action='read', form='unformatted', access='stream')
         !$omp barrier
         !$omp master
         call system_clock(counti)
         !$omp end master
         !$omp barrier
         read(tunit) values
         close(unit=tunit)
         !$omp barrier
         !$omp master
         call system_clock(countf)
    
         telapsed_read = real(countf - counti, kind=real64)/real(count_rate, kind=real64)
         read_speed = mib_read/telapsed_read
         write(*,'(A29,2F10.3)') ' write / read speed (MiB/s): ', write_speed, read_speed
         !$OMP END MASTER
    
         call sleep(1)
    
      end do
      !$OMP END PARALLEL
    
    END PROGRAM iometer