arrays numpy performance row-major-order column-major-order

Performance difference between C-Contiguous and Fortran-Contiguous arrays while using numpy functions

I was working around an example from a book compairing performance between C-Contiguous and Fortran-Contiguous arrays, where i observed that the Fortran-Contiguous array performed better for row-wise operations than C-Contiguous array when using np.sum(axis=1) ie. opposite of what should be expected. While using np.cumsum(axis=1) performance of C-contiguous array was better as expected theoretically. What could be the reason?

The Jupyter Notebook code:

import numpy as np
arr_c=np.ones((1000,1000),order='C')
arr_f=np.ones((1000,1000),order='F')

The observed results were just opposite of what should be expected theoretically when used np.sum(axis=1).

%timeit arr_c.sum(axis=1)

605 µs ± 40.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit arr_f.sum(axis=1)

398 µs ± 25.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

However the results were as expected when used np.cumsum(axis=1).

%timeit arr_c.cumsum(axis=1)

4.01 ms ± 83 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit arr_f.cumsum(axis=1)

15.6 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Solution

I don't which book you are referring to. But I think either the book is wrong, either you misread it (or it is in a particular context, and is talking about some other consideration than the one I have in mind). But people are usually taught in HPC lessons is that C order is better when the inner part of your algorithm iterates columns, and Fortran when inner part iterates rows.

Why order change performance

Firstly (I suppose this is what your book says) it is important to understand why order is important. This is because of cache memory. If your algorithm access data in order in which they are stored, then, most of the time, the data is already in cache memory. If not, then, you'll have to retrieve a new block from memory more often, which costs.

So, if memory contains M₀=0 M₁=1 M₂=2 M₃=3 M₄=4 M₅=5 M₆=6 M₇=7 M₈=8 M₉=9 M₁₀=10 M₁₁=11 M₁₂=12 M₁₃=13 M₁₄=14 M₁₅=15, then algorithm

s=0
for i in range(4):
    for j in range(4):
        s+=Mᵢₓ₄₊ⱼ

is more efficient than

s=0
for i in range(4):
    for j in range(4):
        s+=Mⱼₓ₄₊ᵢ

Because the first one access M₀, M₂, ..., M₁₅ in that order, when the second one accesses M₀, M₄, M₈, M₁₂, M₁, M₅, ...

If your memory cache hold only 4 M places, then 1st version will have to reload a block only 4 times, which second version will have to get a new block each 16 times.

Of course, that is a theoretical example. Cache are bigger than that. But that is the reason.

Now, if we assume that those 16 memory places are the 16 data of a matrix M[row, colum], if you are summing just one column ("just one" is important here) of a matrix M[row, colum], let say column 1,

def sumColumn(col):
    s=0
    for row in range(4):
        s+=M[row, col]
sumColumn(1)

Would access to memory M₄, M₅, M₆, M₇ is matrix is in Fortran order (so M[i,j] is at memory Mᵢ₊₄ⱼ)

But would access to memory M₁,M₅,M₉,M₁₃ is matrix is in C order (so M[i,j] is at memory M₄ᵢ₊ⱼ)

Why it is not applicable directly to whole matrix operations

If implementation of sum(axis=1) is equivalent of

result=np.zeros((4,))
for column in range(4):
    result[column] = sumColumn(column)

That is, if I unroll it

result=np.zeros((4,))
for column in range(4):
    for row in range(4):
        result[column] += M[row, column]

Then, it is 4 times the same thing. So 4 times an algorithm that is efficient in Fortran order, less so in C order

But nothing says that this is the implementation

It could also be

result=np.zeros((4,))
for row in range(4):
    for column in range(4):
        result[column] += M[row, column]

And then you get an algorithm that is more efficient (cache-wise) in C order than in Fortran order.

So, point is, it is implementation dependent. So implementation are better when data are in C order (because they iterate row first, column second, that is in memory order data are in C order). And some are better when data are in Fortran order (because they iterate column fist, row second, that is in memory order if data are in Fortran order). And you can bet that numpy uses the best implementation for each context, when possible.

tl;dr

Usually when people talk about performance related to order, it is because of cache. And cache-wise, summing a single column is faster if data are in Fortran order (and all data of the column are adjacent in memory).

But that doesn't apply to summing all columns, like .sum(axis=1) or np.cumsum(axis=1) do, because you don't know if the implementation iterates row first column second.