c optimization matrix-multiplication neon row-major-order

If C is row-major order, why does ARM intrinsic code assume column-major order?

im not sure where is the best place to ask this but I am currently working on using ARM intrinsics and am following this guide: https://developer.arm.com/documentation/102467/0100/Matrix-multiplication-example

However, the code there was written assuming that the arrays are stored column-major order. I have always thought C arrays were stored row-major. Why did they assume this?

EDIT: For example, if instead of this:

void matrix_multiply_c(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
    for (int i_idx=0; i_idx < n; i_idx++) {
        for (int j_idx=0; j_idx < m; j_idx++) {
            for (int k_idx=0; k_idx < k; k_idx++) {
                C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx + k_idx];
            }
        }
    }
}

They had done this:

void matrix_multiply_c(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
    for (int i_idx=0; i_idx < n; i_idx++) {
        for (int k_idx=0; k_idx < k; k_idx++) {
            for (int j_idx=0; j_idx < m; j_idx++) {
                C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx + k_idx];
            }
        }
    }
}

The code would run faster due to spatial locality of accessing C in the order C[0], C[1], C[2], C[3] instead of in the order C[0], C[2], C[1], C[3] (where C[0], C[1], C[2], C[3] are contiguous in memory).

Solution

You're not using C 2D arrays like C[i][j], so it's not a matter of how C stores anything, it's how 2D indexing is done manually in this code, using n * idx_1 + idx_2, with a choice of which you loop over in the inner vs. outer loops.

But the hard part of a matmul with both matrices non-transposed is that you need to make opposite choices for the two input matrices: a naive matmul has to stride through distant elements of one of the input matrices, so it's inherently screwed. That's a major part of why careful cache-blocking / loop-tiling is important for matrix multiplication. (O(n^3) work over O(n^2) data - you want to get the most use out of it for every time you bring it into L1d cache, and/or into registers.)

Loop interchange can speed things up to take advantage of spatial locality in the inner-most loop, if you do it right.

See the cache-blocked matmul example in What Every Programmer Should Know About Memory? which traverses contiguous memory in all 3 inputs in the inner few loops, picking the index that isn't scaled in any of the 3 matrices as the inner one. That looks like this:

  for (j_idx)
    for (k_idx)
      for (i_idx)
          C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx + k_idx];

Notice that B[k * j_idx + k_idx] is invariant over the loop inner loop, and that you're doing a simple dst[0..n] += const * src[0..n] operation over contiguous memory (which is easy to SIMD vectorize), although you're still doing 2 loads + 1 store for every FMA, so that's not going to max out your FP throughput.

Separate from the cache access pattern, that also avoids a long dependency chain into a single accumulator (element of C). But that's not a real problem for an optimized implementation: you can of course use multiple accumulators. FP math isn't strictly associative because of rounding error, but multiple accumulators are closer to pairwise summation and likely to have less bad FP rounding error than serially adding each element of the row x column dot product. It will have different results to adding in the order standard simple C loop does, but usually closer to the exact answer.

Your proposed loop order i,k,j is the worst possible.

You're striding through distant elements of 2 of the 3 matrices in the inner loop, including discontiguous access to C[], opposite of what you said in your last paragraph.

With j as the inner-most loop, you'd access C[0], C[n], C[2n], etc. on the first outer iteration. And same for B[], so that's really bad.

Interchanging the i and j loops would give you contiguous access to C[] in the middle loop instead of strided, and still rows of one, columns of the other, in the inner-most loop. So that would be strictly an improvement: yes you're right that this naive example is constructed even worse than it needs to be.

But the key issue is the strided access to something in the inner loop: that's a performance disaster; that's a major part of why careful cache-blocking / loop-tiling is important for matrix multiplication. The only index that is never used with a scale factor is i.