I have a 2D array a1[10000][100]
with 10000 rows and 100 columns, and also a 2D array a2[100][10000]
which is the transposed matrix of a1
.
Now I need to access 2 columns (eg. the 21th and the 71th columns) of a1
in the order of a1[0][20]
, a1[0][70]
, a1[1][20]
, a1[1][70]
, ..., a1[9999][20]
, a1[9999][70]
. Or I can also access a2
to achive the same goal (the order: a2[20][0]
, a2[70][0]
, a2[20][1]
, a2[70][1]
, ..., a2[20][9999]
, a2[70][9999]
). But the latter is much faster than the former. The related code is simplified as follows (size1
= 10000):
1 sum1 = 0;
2 for (i = 0; i < size1; ++i) {
3 x = a1[i][20];
4 y = a1[i][70];
5 sum1 = x + y;
6 } // this loop is slower
7 a_sum1[i] = sum1;
8
9 sum2 = 0;
10 for (i = 0; i < size1; ++i) {
11 x = a2[20][i];
12 y = a2[70][i];
14 sum2 = x + y;
15 } // this loop is faster
16 a_sum2[i] = sum2;
Accessing more rows (I have also tried 3, 4 rows rather than 2 rows in the example above) of a2
is also faster than accessing the same number of columns of a1
. Of course I can also replace Lines 3-5 (or Lines 11-14) with a loop (by using an extra array to store the column/row indexes to be accessed), it also gets the same result that the latter is faster than the former.
Why is the latter much faster than the former? I know something about cache lines but I have no idea of the reason for this case. Thanks.
You can benefit from the memory cache if you access addresses within the same cache line in a short amount of time. The explanation below assumes your arrays contain 4-byte integers.
In your first loop, your two memory accesses in the loop are 50*4 bytes apart, and the next iteration jumps forward 400 bytes. Every memory access here is a cache miss.
In the second loop, you still have two memory accesses that are 50*400 bytes apart, but on the next loop iteration you access addresses that are right next to the previously fetched value. Assuming the common 64-byte cache line size, you only have two cache misses every 16 iterations of the loop, the rest can be served from two cache lines loaded at the start of such a cycle.