pythonnumpymachine-learningblas

Numpy vectorization and algorithmic complexity


I have read many times about vectorized code in numpy. I know for a fact that a python for loop can be ~100x times slower than an equivalent numpy operation. However, I thought that the power of numpy vectorization was beyond a mere translation of a for loop from Python to C/Fortran code.

I have read here and there about SIMD (Single Instruction, Multiple Data), BLAS (Basic Linear Algebra Subprograms) and other low level stuff, without a clear understanding of what is going on under the hood, and what I thought was that those libraries, thanks to parallelization at the CPU level, were able to perform the operations so that they scale in a sublinear fashion.

Maybe an example will help clarify this. Let's say we wish to compute the product of two matrices, and I want to check how increasing the number of rows of the first matrix will affect the elapsed time (this operation has a lot to do with machine learning, if we think that the number of rows is the batch size, the left matrix is the data and the right matrix contains parameters of the model). Well, my naïve understanding was that, in this case, the total elapsed time will scale (up to some limit) sub linearly, so that, in principle, and as long as everything fits into the RAM, I expected that increasing the bath size was always a good idea.

I've made some benchmarks and the situation is not what I expected. It looks like growth is linear, and given that the number of operations is a linear function on the number of rows, it looks like the C/Fortran code that runs under the hood is merely doing for loops.

This is the code:

k = 128
N = 100
time_info = {}
for size in list(range(100, 5000, 200)):
    A, B = np.random.random((size, k)), np.random.random((k, k))
    t = time()
    for i in range(N):
        np.dot(A, B)
    delta = time() - t
    time_info[size] = delta / N

x = np.array(list(time_info.keys()))
y = np.array(list(time_info.values())) * 1000
 
# Plotting the Graph
plt.plot(x, y)
plt.title("Elapsed time vs number of rows")
plt.xlabel("Number of rows of left matrix")
plt.ylabel("Time in miliseconds")
plt.show()

Benchmark

It looks like the trend is quite linear. By the way, I've checked np.show_config() and it shows that I have openblas.

So my questions are:

This last option could make sense if only when the size is very small the CPU is not fully occupied. For example, and correct me if I am saying something stupid, if we had an architecture capable of performing 8 mathematical operations in parallel, then you would expect that multiplying a (1,) vector times a (1,) vector will be as fast as multiplying a (8,) vector times a (1,) vector. So, for very small sizes, the gain will be huge (8x times gain). But if you have vectors of thousand of elements, then this effect will be negligible and the time will scale linearly, because you will always have the CPU fully occupied. Does this naïve explanation make sense?


Solution

  • What is exactly the meaning of vectorization?

    The meaning changes regarding the context. In Python, and especially Numpy codes, the vectorization is the use of a native (C) code so to speed up computations and avoid the overhead of Python and mainly the CPython interpreter. More specifically, the Numpy documentation defines it as follows:

    NumPy hands off array processing to C, where looping and computation are much faster than in Python. To exploit this, programmers using NumPy eliminate Python loops in favor of array-to-array operations. vectorization can refer both to the C offloading and to structuring NumPy code to leverage it.

    This does not mean the function is particularly optimized though it is often quite the case with most Numpy functions.


    Is the benchmark reasonable, and what you would have expected?

    Overall, yes, this is rather expected.

    A matrix multiplication is compute-bound so the size of the input does not impact much performance since BLAS use tiling so the computation efficiently use multi-level CPU caches.

    SIMD instructions are used for all the provided input size. It is not the case only for very small matrices, but you cannot measure that using Numpy since the overhead of calling a BLAS function from CPython using Numpy is far bigger than the computational time of a 8x8 matrix multiplication.

    OpenBLAS use multiple threads for matrices bigger than a predefined threshold. This threshold is relatively small so a 128x128 matrix multiplication should already use multiple threads (at least, this is the case on my machine with OpenBLAS on Windows). The thing is OpenBLAS does not perfectly parallelize the computation for small matrices as well as for very long-shaped matrices like in your benchmark. This cause some cores not to work for a small fraction of time and the computation to last longer than expected. That being said, this overhead is a constant factor so it is not much impacted by the input size either in your specific benchmark. You should see a small performance increase with bigger square matrices (relative to the input size). Alternatively, you can try other BLAS implementation like BLIS or the MKL.

    Thus, in the end, you should see mostly a linear curve. It might not be the case on some machines with other BLAS implementation or because of some low-level overheads (especially if the number of threads grow with the input size).


    If so, is there any noticeable optimization thanks to low level stuff like SIMD that should have an impact with vectorized operations? Or, maybe, it will only have an impact when you go from very small vectors/matrices to medium size vectors/matrices?

    See above. SIMD optimizations are not visible here because the computation operate on relatively large matrices and the benchmark measure the relative time for different matrix sizes. SIMD optimization does significantly improve the speed of most Numpy functions. Most of the optimizations are performed by the compiler (ie. not manually), though we need to help compilers doing so. Some specific functions are manually optimized (ongoing work).

    If you want to see the impact of SIMD optimizations (as well as Numpy overhead and unrolling), then you can measure the time taken by an integer sum with different matrix shape:

    dtype = np.uint64
    
    a = np.full((2, 16*1024*1024), 1, dtype=dtype)
    %timeit -n 10 np.sum(a, axis=0, dtype=dtype)       # 55.7 ms
    
    a = np.full((4, 8*1024*1024), 1, dtype=dtype)
    %timeit -n 10 np.sum(a, axis=0, dtype=dtype)       # 41.2 ms
    
    a = np.full((8, 4*1024*1024), 1, dtype=dtype)
    %timeit -n 10 np.sum(a, axis=0, dtype=dtype)       # 33.7 ms
    
    a = np.full((16, 2*1024*1024), 1, dtype=dtype)
    %timeit -n 10 np.sum(a, axis=0, dtype=dtype)       # 29.2 ms
    
    a = np.full((32, 1*1024*1024), 1, dtype=dtype)
    %timeit -n 10 np.sum(a, axis=0, dtype=dtype)       # 26.2 ms
    
    a = np.full((64, 512*1024), 1, dtype=dtype)
    %timeit -n 10 np.sum(a, axis=0, dtype=dtype)       # 19.8 ms
    
    a = np.full((128, 256*1024), 1, dtype=dtype)
    %timeit -n 10 np.sum(a, axis=0, dtype=dtype)       # 19.9 ms  <-- no speed-up
    

    While the total input size is the same, the performance is pretty different. The impact of SIMD instruction are only responsible for a fraction of the overheads. The thing is you cannot see the impact of SIMD optimizations independently of other optimizations like unrolling from a Numpy code like this. Thus, this benchmark is not very rigorous. Note that this assume the target function is vectorized, which should only be the case with recent versions of Numpy.

    For a better benchmark, you need to rebuild Numpy without SIMD instructions (in fact, compiler will still generate SIMD instruction but scalar one on mainstream x86/x86-64 processors).


    Note that most Numpy functions are not using multiple threads. In fact, the matrix multiplication as well as other BLAS functions are AFAIK the only one to be multi-threaded (LAPACK functions can also be multi-threaded thanks to internal BLAS functions calls).