I am trying to multiply two matrices in numpy with rather large dimensionality.
See the 3 methods below. I realise the 3 matrices randomly to show my problem. The first matrix, namely Y1[:,:,0]
is part of a bigger 3d-array at first. The second is a .copy()
of this matrix and the third is its own matrix.
Why is the first multiplication so much slower than the second two?
import numpy as np
from time import time
Y1 = np.random.uniform(-1, 1, (5000, 1093, 201))
Y2 = Y1[:,:,0].copy()
Y3 = np.random.uniform(-1, 1, (5000, 1093))
W = np.random.uniform(-1, 1, (1093, 30))
# method 1
START = time()
Y1[:,:,0].dot(W)
END = time()
print(f"Method 1 : {END - START}")
# method 2
START = time()
Y2.dot(W)
END = time()
print(f"Method 2 : {END - START}")
# method 3
START = time()
Y3.dot(W)
END = time()
print(f"Method 3 : {END - START}")
The output times are roughly 34, 0.06, 0.06 seconds respectively.
I see the difference: Whereas the last two matrices are "real" 2d-arrays, the first one is a slice of my bigger 3d-array.
Is the subsetting Y1[:,:,0]
what makes it so slow? Also, I noted that creating the copy of Y1 for the matrix Y2 is also quite slow.
After all, I am given this 3d-array and have to repeatedly calculate the matrix product of the slices of Y1 with a (potentially different) matrix W. Is there a better / faster way to do this?
Thanks in advance!
This is a cache problem. If you study the cost difference compared to the size of the third axis, you'll a linear relation at first (k=1 => no difference, k=2, method 1 cost twice much, k=3, method 1 cost three times more, etc.), capped by a maximum (for k=20 or k=30 doesn't really change the situation)
That maximum cap is dependent on the size of other axis
The thing is, the matrix multiplication (and, basically, any operation on arrays) operate often iteratively. So data in memory are read one after the other.
The first data read cost a bit, because memory is slow. But when you access data in memory, a whole line (something like 64 or 128 bytes) is read and stored into cache. If next operation uses next number in the matrix, and this number happens to be just next to the previous one in memory, it, likely, belongs to the same cache line. And it won't be necessary to read it in memory, we have it in (way faster) cache memory.
It is a bit oversimplified. And not that obvious to see how it applies to matrix multiplication, because a matrix multiplication is not that sequential. But, basically, the more you use data that are close to each other in memory, the faster. And people often overlook this, thinking that this is kind of hacker optimization to grab some extra nanosecond. But effect can be huge.
For very small amount of data, that fit entirely in cache (some kilobytes), and complex enough algorithm that read it more than once (even just a matrix multiplication qualifies), it doesn't really shows, because every data will end up in cache after a few computation steps.
But if everything doesn't fit in cache, the wider the space between your data, the less you can reuse a cache line. And the more you will have to reread data in memory. To the point that reading memory is the leading cost.
So, your problem is that in
Y1=np.random.uniform(-1, 1, (5000, 1093, 201))
each data of Y1[:,:,0]
is separated by at least 201x8 = 1608 bytes. So, cache (for Y1
— it is still used for W
, but all method are equal for this) is useless: no chance to have a fast access to a value of Y1[:,:,0]
thanks to to fact that we have already read a value close to it in memory: they are all far to each other.
Another way to convince you that this is your problem, and maybe the solution, if needed Just look at what would have happened if
Y1=np.random.uniform(-1, 1, (201, 5000, 1093)).transpose(1,2,0)
Y1
is exactly the same shape as yours. Same 5000x1093x201 array. And you keep the same subdata Y1[:,:,0]
of shape 5000x1093.
The only difference between my Y1
and yours, is invisible from a pure "maths" point of view; the only difference is where exactly, in physical memory the data are stored.
In my your Y1, Y1[i,j,0]
is far from Y1[i+1,j,0]
, and far from Y1[i,j+1,0]
(but close to Y1[i,j,1]
, but that won't help in your case).
You can see it by watching Y1.strides
Y1.strides
# (1757544, 1608, 8)
which tells you how many bytes separate tow consecutive values along each axis. You see that it is bigger than typical cache size along all axis but the last one, which is the once you don't use
While my Y1
(8744, 8, 43720000)
Of course, the problem is, when you reduce your problem to the single part that is slow, you may conclude that you should code Y1
as I did.
But I suppose that your code doesn't allocate 201 numbers, and just never use the 200 others. Said otherwise, some other, unshown, parts of your code probably use that 3rd axis.
So, maybe the boost you gain by ordering Y1
in the optimal order for this part of the code would have to be paid in other part of the code by slower computation.
Last remark: when doing this kind of computation, it is important to avoid running thing only once. Because, also, of cache. The first algorithm is biases against because it has to read W, while the two other may find it already waiting in cache (probably not in your case, because your data is too bing. But for smaller data, you would have concluded that first method is slower, whatever the first is, just because it is the one that paid the cost of loading data into cache