I was comparing the performance of doing a sum followed by an assignment of two arrays, in the form of c=a+b
, between a native Fortran type, real
, and a derived data type that only contains one array of real
. The class is very simple: it contains operators for addition and assignment and a destructor, as follows:
module type_mod
use iso_fortran_env
type :: class_t
real(8), dimension(:,:), allocatable :: a
contains
procedure :: assign_type
generic, public :: assignment(=) => assign_type
procedure :: sum_type
generic :: operator(+) => sum_type
final :: destroy
end type class_t
contains
subroutine assign_type(lhs, rhs)
class(class_t), intent(inout) :: lhs
type(class_t), intent(in) :: rhs
lhs % a = rhs % a
end subroutine assign_type
subroutine destroy(this)
type(class_t), intent(inout) :: this
if (allocated(this % a)) deallocate(this % a)
end subroutine destroy
function sum_type (lhs, rhs) result(res)
class(class_t), intent(in) :: lhs
type(class_t), intent(in) :: rhs
type(class_t) :: res
res % a = lhs % a + rhs % a
end function sum_type
end module type_mod
The assign
subroutine contains different modes of operations, just for the sake of benchmarking.
To test it against performing the same operations on a real
I created the following module
module subroutine_mod
use type_mod, only: class_t
contains
subroutine sum_real(a, b, c)
real(8), dimension(:,:), intent(inout) :: a, b, c
c = a + b
end subroutine sum_real
subroutine sum_type(a, b, c)
type(class_t), intent(inout) :: a, b, c
c = a + b
end subroutine sum_type
end module subroutine_mod
Everything is executed in the program below, considering arrays of size (10000,10000) and repeating the operation 100 times:
program test
use subroutine_mod
integer :: i
integer :: N = 100 ! Number of times to repeat the assign
integer :: M = 10000 ! Size of the arrays
real(8) :: tf, ts
real(8), dimension(:,:), allocatable :: a, b, c
type(class_t) :: a2, b2, c2
allocate(a2%a(M,M), b2%a(M,M), c2%a(M,M))
a2%a = 1.0d0
b2%a = 2.0d0
c2%a = 3.0d0
allocate(a(M,M), b(M,M), c(M,M))
a = 1.0d0
b = 2.0d0
c = 3.0d0
! Benchmark timing with
call cpu_time(ts)
do i = 1, N
call sum_type(a2, b2, c2)
end do
call cpu_time(tf)
write(*,*) "Type : ", tf-ts
call cpu_time(ts)
do i = 1, N
call sum_real(a, b, c)
end do
call cpu_time(tf)
write(*,*) "Real : ", tf-ts
end program test
To my surprise, the operation with my derived datatype consistently underperformed the operation with the Fortran arrays by a factor of 2 with gfortran
and a factor of 10 with ifort
. For instance, using the CHECK_SIZE
mode, which saves allocation time, I got the following timings compiling with the -O2
flag:
gfortran
ifort
Question
Is this normal behaviour? If so, are there any recommendations to achieve better performance?
Context
To provide some context, the type with a single array will be very useful for a code refactoring task, where we need to keep similar interfaces to a previous type.
Compiler versions
gfortran
9.4.0ifort
2021.6.0 20220226You are worried about allocation time, but you do a lot of allocations of arrays of shape [M,M]
for the derived type, and almost none for the intrinsic type.
The only allocations for the intrinsic type are in the main program, for a
, b
and c
. These are outside the timing loop.
For the derived type, you allocate for a2%a
, b2%a
and c2%a
(again outside the timing loop), but also res%a
in the function sum
, N
times inside the timing loop.
Equally, inside the sum_real
subroutine the assignment statement c=a+b
involves no allocatable object but inside sum_type
the c
in c=a+b
is an allocatable array: the compiler checks whether c
is allocated and if so, whether its shape matches the right-hand side expression.
In summary: you are not comparing like with like. There's a lot of overhead in wrapping an intrinsic array as an allocatable component of a derived type.
Tangential to your timing concerns is the "cleverness" of the subroutine assign
. It's horrible.
Calling an argument lhs
when it's associated with the right-hand side of the assignment statement is a little confusing, but the select case construct is confusing beyond a little.
In
case (ASSUMED_SIZE)
this % a = lhs % a
under rules where the rest of the program makes any sense, invokes a couple of checks:
this%a
allocated? If not, allocate it to the shape of lhs%a
.lhs%a
, if not deallocate it then allocate it to the shape of lhs%a
.Those checks and actions which are done manually in the CHECK_SIZE
case, in other words.
The final subroutine does nothing of value, so the entire assign
subroutine's execution can be replaced by this%a = lhs%a
.
(Things would be different if the final subroutine had substantive effect or the compiler had been asked to ignore the rules of intrinsic assignment using -fno-realloc-arrays
or -nostandard-realloc-lhs
for example, or this%a(:,:)=lhs%a
had been used.)