I am doing comparison between Fortran and C programming language for matrix operations.
This time I have written two files (matmul.c and matmul.f90) that both do the same thing, i.e. multiply matrices A and B - which are both square matrices of size N by N.
In both files I have implemented loop blocking and I considered efficient memory access for better performance. However, Fortran is significantly faster for the multiplication of matrix A and B. When N is equal to 1024, Fortran is about three times faster. In that case, C takes about 0.57 seconds to complete matrix multiplication, while Fortran takes 0.18 seconds to complete matrix multiplication.
Can anyone take a look at my codes and tell me what can be done to reduce this difference.
I am running Windows 10, with intel i7-6700 cpu and 16 GB of RAM. For compilation I am using MinGW 14.2.
In order to compile matmul.f90 file I have used this command in cmd:
gfortran -O3 -march=native -funroll-all-loops matmul.f90
In order to compile matmul.c file I have used this command in cmd:
gcc -O3 -march=native -funroll-all-loops matmul.c
Here is the fortran code:
program matmul_fortran
implicit none
integer, parameter :: N = 1024 ! matrix size
integer, parameter :: BLOCK_SIZE = 32 ! block size
real(8), dimension(N, N) :: A, B, C
integer :: i, j, k, i_block, j_block, k_block
real(8) :: start, finish, temp
! initialize matrices A and B with random values
call random_seed()
call random_number(A)
call random_number(B)
C = 0.0 ! set matrix C to zero values
call cpu_time(start)
! multiplication
do i_block = 1, N, BLOCK_SIZE
do j_block = 1, N, BLOCK_SIZE
do k_block = 1, N, BLOCK_SIZE
do i = i_block, min(i_block + BLOCK_SIZE - 1, N)
do j = j_block, min(j_block + BLOCK_SIZE - 1, N)
do k = k_block, min(k_block + BLOCK_SIZE - 1, N)
C(k, i) = C(k, i) + A(k, j)*B(j, i)
end do
end do
end do
end do
end do
end do
call cpu_time(finish)
print *, "Fortran Matrix Multiplication Time: ", finish - start, " seconds"
end program matmul_fortran
Here is the C code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define N 1024 // matrix size
#define BLOCK_SIZE 32 // block size
// function to initliaze matrices with random values
void initialize_matrix(double *matrix) {
for (int i = 0; i < N * N; i++) {
matrix[i] = (double)rand() / RAND_MAX; // Random values between 0 and 1
}
}
int main() {
double *A, *B, *C;
clock_t start, end;
A = (double *)malloc(N * N * sizeof(double));
B = (double *)malloc(N * N * sizeof(double));
C = (double *)malloc(N * N * sizeof(double));
// set matrix C to zero values
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
C[i * N + j] = 0.0;
// Initialize matrices
srand(time(NULL));
initialize_matrix(A);
initialize_matrix(B);
start = clock();
// multiplication
for (int i_block = 0; i_block < N; i_block += BLOCK_SIZE) {
for (int j_block = 0; j_block < N; j_block += BLOCK_SIZE) {
for (int k_block = 0; k_block < N; k_block += BLOCK_SIZE) {
for (int i = i_block; i < i_block + BLOCK_SIZE && i < N; i++) {
for (int j = j_block; j < j_block + BLOCK_SIZE && j < N; j++) {
for (int k = k_block; k < k_block + BLOCK_SIZE && k < N; k++) {
C[i*N + k] += A[j*N + k]*B[i*N + j];
}
}
}
}
}
}
end = clock();
printf("C Matrix Multiplication Time: %.6f seconds\n", ((double)(end - start)) / CLOCKS_PER_SEC);
free(A);
free(B);
free(C);
return 0;
}
Thanks in advance!
@Adrian McCarthy had the right observation. There is a conditional that must be evaluated to terminate the blocked loops in C and it is a compound expression involving summation and the logical &&
operator. Fortran, on the other hand, just calls the min
function once to determine the upper bound. While some compilers might be able to optimize this, others may be not able.
When changing the loop nest to
for (int i_block = 0; i_block < N; i_block += BLOCK_SIZE) {
for (int j_block = 0; j_block < N; j_block += BLOCK_SIZE) {
for (int k_block = 0; k_block < N; k_block += BLOCK_SIZE) {
int maxi = i_block + BLOCK_SIZE < N ? i_block + BLOCK_SIZE : N;
for (int i = i_block; i < maxi; i++) {
int maxj = j_block + BLOCK_SIZE < N ? j_block + BLOCK_SIZE : N;
for (int j = j_block; j < maxj; j++) {
int maxk = k_block + BLOCK_SIZE < N ? k_block + BLOCK_SIZE : N;
for (int k = k_block; k < maxk; k++) {
C[i*N + k] += A[j*N + k]*B[i*N + j];
the performance for the GCC compiler improves dramatically. In my case from 0.68 s to 0.28 s. Fortran still remains slightly faster at 0.25 s, but the speeds are much closer now. Intel is still better on my machine with 0.18-0.19 s with -O3 -xHost
.
The new versions enables GCC to do more vectorization. Although I did not examine the assembly, only the intermediate GIMPLE, I can see vector operations such as
<bb 7> [local count: 10406813]:
# ivtmp.131_127 = PHI <ivtmp.178_547(6), ivtmp.131_217(9)>
# ivtmp.139_438 = PHI <_22(6), ivtmp.139_172(9)>
_20 = MEM[(double *)_365 + ivtmp.131_127 * 8];
_390 = ivtmp.168_88 + _367;
_391 = _390 * 8;
vectp.57_387 = C_44(D) + _391;
vectp.60_395 = (double *) ivtmp.139_438;
vect_cst__404 = {_20, _20, _20, _20};
_352 = ivtmp.139_438 + 8192;
vectp.67_416 = (double *) _352;
D__lsm0.98_9 = MEM[(double *)_365 + 8B + ivtmp.131_127 * 8];
<bb 8> [local count: 56196792]:
# ivtmp.122_502 = PHI <0(7), ivtmp.122_512(8)>
vect__9.58_394 = MEM <vector(4) double> [(double *)vectp.57_387 + ivtmp.122_502 * 1];
vect__15.61_403 = MEM <vector(4) double> [(double *)vectp.60_395 + ivtmp.122_502 * 1];
vect__21.62_405 = vect__15.61_403 * vect_cst__404;
vect__22.63_406 = vect__9.58_394 + vect__21.62_405;
vect_cst__415 = {D__lsm0.98_9, D__lsm0.98_9, D__lsm0.98_9, D__lsm0.98_9};
vect__122.68_425 = MEM <vector(4) double> [(double *)vectp.67_416 + ivtmp.122_502 * 1];
vect__123.69_426 = vect_cst__415 * vect__122.68_425;
vect__124.70_427 = vect__22.63_406 + vect__123.69_426;
MEM <vector(4) double> [(double *)vectp.57_387 + ivtmp.122_502 * 1] = vect__124.70_427;
ivtmp.122_512 = ivtmp.122_502 + 32;
if (ivtmp.122_512 != 256)
goto <bb 8>; [83.33%]
else
goto <bb 9>; [16.67%]
which have not been there before.
From my uneducated examination of this GIMPLE and the GIMPLE of the Fortran version it seems to me that there is more unrolling done by gfortran than by gcc for C which might be able to explain the remaining difference
<bb 7> [local count: 267178624]:
# ivtmp.52_174 = PHI <ivtmp.79_246(6), ivtmp.52_196(7)>
# ivtmp.54_142 = PHI <_214(6), ivtmp.54_24(7)>
_126 = (void *) ivtmp.52_174;
_14 = MEM[(real(kind=8) *)_126 + -8200B];
_135 = MEM[(real(kind=8) *)_126 + -8192B];
vect_cst__93 = {_14, _14, _14, _14};
vect_cst__70 = {_135, _135, _135, _135};
_216 = (void *) ivtmp.67_176;
vect__6.26_146 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216];
_75 = (void *) ivtmp.54_142;
vect__11.29_8 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8200B];
vect__15.30_138 = vect__11.29_8 * vect_cst__93;
vect__16.31_43 = vect__15.30_138 + vect__6.26_146;
vect__126.36_3 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8B];
vect__125.37_4 = vect__126.36_3 * vect_cst__70;
vect__124.38_5 = vect__125.37_4 + vect__16.31_43;
MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216] = vect__124.38_5;
vect__6.26_114 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 32B];
vect__11.29_108 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8168B];
vect__15.30_49 = vect_cst__93 * vect__11.29_108;
vect__16.31_51 = vect__15.30_49 + vect__6.26_114;
vect__126.36_89 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 24B];
vect__125.37_88 = vect_cst__70 * vect__126.36_89;
vect__124.38_87 = vect__16.31_51 + vect__125.37_88;
MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 32B] = vect__124.38_87;
vect__6.26_59 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 64B];
vect__11.29_67 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8136B];
vect__15.30_123 = vect__11.29_67 * vect_cst__93;
vect__16.31_42 = vect__6.26_59 + vect__15.30_123;
vect__126.36_31 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 56B];
vect__125.37_30 = vect__126.36_31 * vect_cst__70;
vect__124.38_29 = vect__125.37_30 + vect__16.31_42;
MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 64B] = vect__124.38_29;
vect__6.26_159 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 96B];
vect__11.29_160 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8104B];
vect__15.30_161 = vect_cst__93 * vect__11.29_160;
vect__16.31_162 = vect__6.26_159 + vect__15.30_161;
vect__126.36_164 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 88B];
vect__125.37_165 = vect_cst__70 * vect__126.36_164;
vect__124.38_166 = vect__16.31_162 + vect__125.37_165;
MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 96B] = vect__124.38_166;
vect__6.26_181 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 128B];
vect__11.29_182 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8072B];
vect__15.30_183 = vect_cst__93 * vect__11.29_182;
vect__16.31_184 = vect__6.26_181 + vect__15.30_183;
vect__126.36_186 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 120B];
vect__125.37_187 = vect_cst__70 * vect__126.36_186;
vect__124.38_188 = vect__16.31_184 + vect__125.37_187;
MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 128B] = vect__124.38_188;
vect__6.26_203 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 160B];
vect__11.29_204 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8040B];
vect__15.30_205 = vect_cst__93 * vect__11.29_204;
vect__16.31_206 = vect__6.26_203 + vect__15.30_205;
vect__126.36_208 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 152B];
vect__125.37_209 = vect_cst__70 * vect__126.36_208;
vect__124.38_210 = vect__16.31_206 + vect__125.37_209;
MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 160B] = vect__124.38_210;
vect__6.26_225 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 192B];
vect__11.29_226 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8008B];
vect__15.30_227 = vect_cst__93 * vect__11.29_226;
vect__16.31_228 = vect__6.26_225 + vect__15.30_227;
vect__126.36_230 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 184B];
vect__125.37_231 = vect_cst__70 * vect__126.36_230;
vect__124.38_232 = vect__16.31_228 + vect__125.37_231;
MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 192B] = vect__124.38_232;
vect__6.26_22 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 224B];
vect__11.29_94 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -7976B];
vect__15.30_92 = vect_cst__93 * vect__11.29_94;
vect__16.31_91 = vect__6.26_22 + vect__15.30_92;
vect__126.36_71 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 216B];
vect__125.37_69 = vect_cst__70 * vect__126.36_71;
vect__124.38_68 = vect__125.37_69 + vect__16.31_91;
MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 224B] = vect__124.38_68;
ivtmp.52_196 = ivtmp.52_174 + 16;
ivtmp.54_24 = ivtmp.54_142 + 16384;
if (ivtmp.54_24 == ivtmp.69_195)
goto <bb 8>; [6.25%]
else
goto <bb 7>; [93.75%]