I ran an Intel MKL LINPACK test on an Intel Core i7-14700K processor and got a peak performance of 557 GFLOPS which seems quite unrealistic.
Size LDA Align. Average Maximal
1000 1000 4 155.1099 216.8890
2000 2000 4 425.5128 459.9769
5000 5008 4 379.0532 393.7132
10000 10000 4 427.9537 435.6706
15000 15000 4 426.8314 427.5827
18000 18008 4 545.7857 549.8816
20000 20016 4 553.3485 553.5723
22000 22008 4 548.1379 552.2941
25000 25000 4 549.4231 555.0353
26000 26000 4 550.3011 554.8746
27000 27000 4 542.6011 542.6011
30000 30000 1 532.8780 532.8780
35000 35000 1 534.7904 534.7904
40000 40000 1 557.7524 557.7524
45000 45000 1 557.3916 557.3916
The 155 GFLOPS value for the 1000 size seems plausible, but 557 GFLOPS is too high. Does anybody have an idea how it could happen?
I used the following suite:
http://registrationcenter-download.intel.com/akdlm/irc_nas/9752/l_mklb_p_2018.3.011.tgz
The test was started using the following command:
./runme_xeon64
I can verify these results for the 14700k. Using the Intel oneAPI math kernel and numpy, I was able to achieve between 550-650 GFLOPS in python which has significant overhead. To be clear, this was running on all cores as the Intel blas libraries are very well optimized.
import numpy as np
from time import time_ns
def benchCPU(A, B, C):
for i in range(0, 20):
print("Iteration: " + "%d" % i)
C = np.matmul(C, A)
C = np.matmul(C, B)
C = C/np.max(C)
return 0
if __name__ == '__main__':
samples = 7000
A = np.random.rand(samples, samples).astype(np.float32)
B = np.random.rand(samples, samples).astype(np.float32)
C = np.random.rand(samples, samples).astype(np.float32)
t1 = time_ns()
t2 = time_ns()
tdly = t2 - t1
C = np.matmul(A, B)
print("CPU Test")
t1 = time_ns()
benchCPU(A, B, C)
t2 = time_ns()
t_cpu = t2 - t1 - tdly
operations = 2*20*(2*samples**3 - samples**2) # Matrix Multiplication Operations take 2n^3 - n^2, there are 20 iterations which each do 2 operations, max is considered negligible
print("CPU Throughput: " + "%.3f" % ((operations/(t_cpu*1e-9))*1e-12) + " TFLOPS")