Why is there a large performance difference between C and Fortran for matrix multiplication?

I am doing comparison between Fortran and C programming language for matrix operations.

This time I have written two files (matmul.c and matmul.f90) that both do the same thing, i.e. multiply matrices A and B - which are both square matrices of size N by N.

In both files I have implemented loop blocking and I considered efficient memory access for better performance. However, Fortran is significantly faster for the multiplication of matrix A and B. When N is equal to 1024, Fortran is about three times faster. In that case, C takes about 0.57 seconds to complete matrix multiplication, while Fortran takes 0.18 seconds to complete matrix multiplication.

Can anyone take a look at my codes and tell me what can be done to reduce this difference?

I am running Windows 10, with an Intel i7-6700 CPU and 16 GB of RAM. For compilation I am using MinGW 14.2.

In order to compile matmul.f90 file I have used this command in cmd:

gfortran -O3 -march=native -funroll-all-loops matmul.f90

In order to compile matmul.c file I have used this command in cmd:

gcc -O3 -march=native -funroll-all-loops matmul.c

Here is the Fortran code:

program matmul_fortran
    implicit none
    integer, parameter :: N = 1024   ! matrix size 
    integer, parameter :: BLOCK_SIZE = 32  ! block size 
    real(8), dimension(N, N) :: A, B, C
    integer :: i, j, k, i_block, j_block, k_block
    real(8) :: start, finish, temp

    ! initialize matrices A and B with random values
    call random_seed()
    call random_number(A)
    call random_number(B)
    C = 0.0  ! set matrix C to zero values

    call cpu_time(start)

    ! multiplication
    do i_block = 1, N, BLOCK_SIZE
        do j_block = 1, N, BLOCK_SIZE
            do k_block = 1, N, BLOCK_SIZE
                do i = i_block, min(i_block + BLOCK_SIZE - 1, N)
                    do j = j_block, min(j_block + BLOCK_SIZE - 1, N)
                        do k = k_block, min(k_block + BLOCK_SIZE - 1, N)
                            C(k, i) = C(k, i) + A(k, j)*B(j, i)
                        end do
                    end do
                end do
            end do
        end do
    end do

    call cpu_time(finish)

    print *, "Fortran Matrix Multiplication Time: ", finish - start, " seconds"

end program matmul_fortran

Here is the C code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define N 1024  // matrix size
#define BLOCK_SIZE 32  // block size

// function to initliaze matrices with random values
void initialize_matrix(double *matrix) {
    for (int i = 0; i < N * N; i++) {
        matrix[i] = (double)rand() / RAND_MAX;  // Random values between 0 and 1
    }
}

int main() {
    double *A, *B, *C;
    clock_t start, end;

    A = (double *)malloc(N * N * sizeof(double));
    B = (double *)malloc(N * N * sizeof(double));
    C = (double *)malloc(N * N * sizeof(double));

    // set matrix C to zero values
    for (int i = 0; i < N; i++)
        for (int j = 0; j < N; j++)
            C[i * N + j] = 0.0; 

    // Initialize matrices
    srand(time(NULL));
    initialize_matrix(A);
    initialize_matrix(B);

    start = clock();

    // multiplication
    for (int i_block = 0; i_block < N; i_block += BLOCK_SIZE) {
        for (int j_block = 0; j_block < N; j_block += BLOCK_SIZE) {
            for (int k_block = 0; k_block < N; k_block += BLOCK_SIZE) {
                for (int i = i_block; i < i_block + BLOCK_SIZE && i < N; i++) {
                    for (int j = j_block; j < j_block + BLOCK_SIZE && j < N; j++) {
                        for (int k = k_block; k < k_block + BLOCK_SIZE && k < N; k++) {
                            C[i*N + k] += A[j*N + k]*B[i*N + j];
                        }
                    }
                }
            }
        }
    }
    
    end = clock();

    printf("C Matrix Multiplication Time: %.6f seconds\n", ((double)(end - start)) / CLOCKS_PER_SEC);

    free(A);
    free(B);
    free(C);

    return 0;
}

Solution

@Adrian McCarthy had the right observation. There is a conditional that must be evaluated to terminate the blocked loops in C and it is a compound expression involving summation and the logical && operator. Fortran, on the other hand, just calls the min function once to determine the upper bound. While some compilers might be able to optimize this, others may be not able.

When changing the loop nest to

    for (int i_block = 0; i_block < N; i_block += BLOCK_SIZE) {
        for (int j_block = 0; j_block < N; j_block += BLOCK_SIZE) {
            for (int k_block = 0; k_block < N; k_block += BLOCK_SIZE) {
                int maxi = i_block + BLOCK_SIZE < N ? i_block + BLOCK_SIZE : N;
                for (int i = i_block; i < maxi; i++) {
                    int maxj = j_block + BLOCK_SIZE < N ? j_block + BLOCK_SIZE : N;
                    for (int j = j_block; j < maxj; j++) {
                        int maxk = k_block + BLOCK_SIZE < N ? k_block + BLOCK_SIZE : N;
                        for (int k = k_block; k < maxk; k++) {
                            C[i*N + k] += A[j*N + k]*B[i*N + j];

the performance for the GCC compiler improves dramatically. In my case from 0.68 s to 0.28 s. Fortran still remains slightly faster at 0.25 s, but the speeds are much closer now. Intel is still better on my machine with 0.18-0.19 s with -O3 -xHost.

The new versions enables GCC to do more vectorization. Although I did not examine the assembly, only the intermediate GIMPLE, I can see vector operations such as

  <bb 7> [local count: 10406813]:
  # ivtmp.131_127 = PHI <ivtmp.178_547(6), ivtmp.131_217(9)>
  # ivtmp.139_438 = PHI <_22(6), ivtmp.139_172(9)>
  _20 = MEM[(double *)_365 + ivtmp.131_127 * 8];
  _390 = ivtmp.168_88 + _367;
  _391 = _390 * 8;
  vectp.57_387 = C_44(D) + _391;
  vectp.60_395 = (double *) ivtmp.139_438;
  vect_cst__404 = {_20, _20, _20, _20};
  _352 = ivtmp.139_438 + 8192;
  vectp.67_416 = (double *) _352;
  D__lsm0.98_9 = MEM[(double *)_365 + 8B + ivtmp.131_127 * 8];

  <bb 8> [local count: 56196792]:
  # ivtmp.122_502 = PHI <0(7), ivtmp.122_512(8)>
  vect__9.58_394 = MEM <vector(4) double> [(double *)vectp.57_387 + ivtmp.122_502 * 1];
  vect__15.61_403 = MEM <vector(4) double> [(double *)vectp.60_395 + ivtmp.122_502 * 1];
  vect__21.62_405 = vect__15.61_403 * vect_cst__404;
  vect__22.63_406 = vect__9.58_394 + vect__21.62_405;
  vect_cst__415 = {D__lsm0.98_9, D__lsm0.98_9, D__lsm0.98_9, D__lsm0.98_9};
  vect__122.68_425 = MEM <vector(4) double> [(double *)vectp.67_416 + ivtmp.122_502 * 1];
  vect__123.69_426 = vect_cst__415 * vect__122.68_425;
  vect__124.70_427 = vect__22.63_406 + vect__123.69_426;
  MEM <vector(4) double> [(double *)vectp.57_387 + ivtmp.122_502 * 1] = vect__124.70_427;
  ivtmp.122_512 = ivtmp.122_502 + 32;
  if (ivtmp.122_512 != 256)
    goto <bb 8>; [83.33%]
  else
    goto <bb 9>; [16.67%]

which have not been there before.

From my uneducated examination of this GIMPLE and the GIMPLE of the Fortran version it seems to me that there is more unrolling done by gfortran than by gcc for C which might be able to explain the remaining difference

 <bb 7> [local count: 267178624]:
  # ivtmp.52_174 = PHI <ivtmp.79_246(6), ivtmp.52_196(7)>
  # ivtmp.54_142 = PHI <_214(6), ivtmp.54_24(7)>
  _126 = (void *) ivtmp.52_174;
  _14 = MEM[(real(kind=8) *)_126 + -8200B];
  _135 = MEM[(real(kind=8) *)_126 + -8192B];
  vect_cst__93 = {_14, _14, _14, _14};
  vect_cst__70 = {_135, _135, _135, _135};
  _216 = (void *) ivtmp.67_176;
  vect__6.26_146 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216];
  _75 = (void *) ivtmp.54_142;
  vect__11.29_8 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8200B];
  vect__15.30_138 = vect__11.29_8 * vect_cst__93;
  vect__16.31_43 = vect__15.30_138 + vect__6.26_146;
  vect__126.36_3 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8B];
  vect__125.37_4 = vect__126.36_3 * vect_cst__70;
  vect__124.38_5 = vect__125.37_4 + vect__16.31_43;
  MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216] = vect__124.38_5;
  vect__6.26_114 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 32B];
  vect__11.29_108 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8168B];
  vect__15.30_49 = vect_cst__93 * vect__11.29_108;
  vect__16.31_51 = vect__15.30_49 + vect__6.26_114;
  vect__126.36_89 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 24B];
  vect__125.37_88 = vect_cst__70 * vect__126.36_89;
  vect__124.38_87 = vect__16.31_51 + vect__125.37_88;
  MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 32B] = vect__124.38_87;
  vect__6.26_59 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 64B];
  vect__11.29_67 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8136B];
  vect__15.30_123 = vect__11.29_67 * vect_cst__93;
  vect__16.31_42 = vect__6.26_59 + vect__15.30_123;
  vect__126.36_31 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 56B];
  vect__125.37_30 = vect__126.36_31 * vect_cst__70;
  vect__124.38_29 = vect__125.37_30 + vect__16.31_42;
  MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 64B] = vect__124.38_29;
  vect__6.26_159 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 96B];
  vect__11.29_160 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8104B];
  vect__15.30_161 = vect_cst__93 * vect__11.29_160;
  vect__16.31_162 = vect__6.26_159 + vect__15.30_161;
  vect__126.36_164 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 88B];
  vect__125.37_165 = vect_cst__70 * vect__126.36_164;
  vect__124.38_166 = vect__16.31_162 + vect__125.37_165;
  MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 96B] = vect__124.38_166;
  vect__6.26_181 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 128B];
  vect__11.29_182 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8072B];
  vect__15.30_183 = vect_cst__93 * vect__11.29_182;
  vect__16.31_184 = vect__6.26_181 + vect__15.30_183;
  vect__126.36_186 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 120B];
  vect__125.37_187 = vect_cst__70 * vect__126.36_186;
  vect__124.38_188 = vect__16.31_184 + vect__125.37_187;
  MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 128B] = vect__124.38_188;
  vect__6.26_203 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 160B];
  vect__11.29_204 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8040B];
  vect__15.30_205 = vect_cst__93 * vect__11.29_204;
  vect__16.31_206 = vect__6.26_203 + vect__15.30_205;
  vect__126.36_208 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 152B];
  vect__125.37_209 = vect_cst__70 * vect__126.36_208;
  vect__124.38_210 = vect__16.31_206 + vect__125.37_209;
  MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 160B] = vect__124.38_210;
  vect__6.26_225 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 192B];
  vect__11.29_226 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8008B];
  vect__15.30_227 = vect_cst__93 * vect__11.29_226;
  vect__16.31_228 = vect__6.26_225 + vect__15.30_227;
  vect__126.36_230 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 184B];
  vect__125.37_231 = vect_cst__70 * vect__126.36_230;
  vect__124.38_232 = vect__16.31_228 + vect__125.37_231;
  MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 192B] = vect__124.38_232;
  vect__6.26_22 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 224B];
  vect__11.29_94 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -7976B];
  vect__15.30_92 = vect_cst__93 * vect__11.29_94;
  vect__16.31_91 = vect__6.26_22 + vect__15.30_92;
  vect__126.36_71 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 216B];
  vect__125.37_69 = vect_cst__70 * vect__126.36_71;
  vect__124.38_68 = vect__125.37_69 + vect__16.31_91;
  MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 224B] = vect__124.38_68;
  ivtmp.52_196 = ivtmp.52_174 + 16;
  ivtmp.54_24 = ivtmp.54_142 + 16384;
  if (ivtmp.54_24 == ivtmp.69_195)
    goto <bb 8>; [6.25%]
  else
    goto <bb 7>; [93.75%]