The Intel MKL LINPACK test indicates too big performance

I ran an Intel MKL LINPACK test on an Intel Core i7-14700K processor and got a peak performance of 557 GFLOPS which seems quite unrealistic.

Size   LDA    Align.  Average  Maximal
1000   1000   4       155.1099 216.8890
2000   2000   4       425.5128 459.9769
5000   5008   4       379.0532 393.7132
10000  10000  4       427.9537 435.6706
15000  15000  4       426.8314 427.5827
18000  18008  4       545.7857 549.8816
20000  20016  4       553.3485 553.5723
22000  22008  4       548.1379 552.2941
25000  25000  4       549.4231 555.0353
26000  26000  4       550.3011 554.8746
27000  27000  4       542.6011 542.6011
30000  30000  1       532.8780 532.8780
35000  35000  1       534.7904 534.7904
40000  40000  1       557.7524 557.7524
45000  45000  1       557.3916 557.3916

The 155 GFLOPS value for the 1000 size seems plausible, but 557 GFLOPS is too high. Does anybody have an idea how it could happen?

I used the following suite:

http://registrationcenter-download.intel.com/akdlm/irc_nas/9752/l_mklb_p_2018.3.011.tgz

The test was started using the following command:

./runme_xeon64

Solution

I can verify these results for the 14700k. Using the Intel oneAPI math kernel and numpy, I was able to achieve between 550-650 GFLOPS in python which has significant overhead. To be clear, this was running on all cores as the Intel blas libraries are very well optimized.

import numpy as np
from time import time_ns

def benchCPU(A, B, C):

    for i in range(0, 20):
        print("Iteration: " + "%d" % i)
        C = np.matmul(C, A)
        C = np.matmul(C, B)
        C = C/np.max(C)
    return 0

if __name__ == '__main__':
    samples = 7000
    A = np.random.rand(samples, samples).astype(np.float32)
    B = np.random.rand(samples, samples).astype(np.float32)
    C = np.random.rand(samples, samples).astype(np.float32)

    t1 = time_ns()
    t2 = time_ns()
    tdly = t2 - t1

    C = np.matmul(A, B)
    print("CPU Test")
    t1 = time_ns()
    benchCPU(A, B, C)
    t2 = time_ns()

    t_cpu = t2 - t1 - tdly

    operations = 2*20*(2*samples**3 - samples**2) # Matrix Multiplication Operations take 2n^3 - n^2, there are 20 iterations which each do 2 operations, max is considered negligible

    print("CPU Throughput: " + "%.3f" % ((operations/(t_cpu*1e-9))*1e-12) + " TFLOPS")