I am recently learning some HPC topics and get to know that modern C/C++ compilers is able to detect places where optimization is entitled and conduct it using corresponding techniques such as SIMD, loop unfolding, etc, especially under flag -O3
, with a tradeoff between runtime performance vs compile time and object file size.
Then it immediately occurred to me that CPython interprets and executes on-fly, so I assume it cannot afford to conduct those compiler features because compiling time for it is equivalently runtime, so I did a toy experiment below:
import time, random
n = 512
A = [[random.random() for _ in range(n)] for _ in range(n)]
B = [[random.random() for _ in range(n)] for _ in range(n)]
C = [[0] * n for _ in range(n)]
def matMul( A, B, C ):
""" C = A .* B """
for i in range(0, n - 4, 4):
for j in range(0, n - 4, 4):
for k in range(0, n - 4, 4):
C[i][j] = A[i][k] * B[k][j]
C[i + 1][j + 1] = A[i + 1][k + 1] * B[k + 1][j + 1]
C[i + 2][j + 2] = A[i + 2][k + 2] * B[k + 2][j + 2]
C[i + 3][j + 3] = A[i + 3][k + 3] * B[k + 3][j + 3]
C[i + 4][j + 4] = A[i + 4][k + 4] * B[k + 4][j + 4]
# return C
start = time.time()
matMul( A, B, C )
end = time.time()
print( f"Elapsed {end - start}" )
With the loop unfolding, the program finishes within 3 seconds, without it, it takes up to almost 20 seconds.
Does that mean one needs to pay attention and manually implement those opt techniques when writing Python code? or does Python offer the optimization under any special setting?
Loop unrolling is useful because it can (1) reduce the overhead spent managing the loop and (2) at the assembly level, it let the processor run faster by eliminating branch penalties, keeping the instruction pipeline full, etc.
(2) doesn't really apply to an interpreted language implementation like Python - it's already doing lots of branching and decision-making at the assembly level. It might gain you with (1), but my gut feeling is that that time is often dwarfed by interpreter overhead. The golden rule of performance optimization is to first measure and confirm that the bottleneck is where you think it is.
Incidentally, your code has a bug:
C[i][j] = A[i][k] + B[k][j]
C[i + 1][j + 1] = A[i + 1][k + 1] + B[k + 1][j + 1]
C[i + 2][j + 2] = A[i + 2][k + 2] + B[k + 2][j + 2]
C[i + 3][j + 3] = A[i + 3][k + 3] + B[k + 3][j + 3]
C[i + 4][j + 4] = A[i + 4][k + 4] + B[k + 4][j + 4]
It processes cells (0, 0), (1, 1), (2, 2), (3, 3), and (4, 4) (even though (4, 4) will also be processed on the next iteration), but not (0, 1), (0, 2), (1, 0)... (That's the other reason for the golden rule of performance optimization: it's easy to make mistakes by optimizing code that doesn't need it.)
As @Mat said, the standard approach for Python in particular is to use NumPy, which uses an optimized C implementation.
All the above applies to CPython, the standard Python implementation. There are other Python implementations like Cython that offer their own optimizations; I'm less familiar with those.