Why is the explicit dimension expansion so space inefficient as compared to leveraging implicit numpy broadcasting, which technically does the same thing i.e. copies the matrix over on a given dimension.
I have two arrays: X(500,3072) and X_train(5000,3072). I want to calculate distance of all 500 points in X from the 5000 points in X_train. When I try to do this via explicit dimension expansion, it takes over 60GB of space to do this calculation.
dists = np.linalg.norm(np.expand_dims(X,axis=1)-(X_train), axis =2)
Whereas if I leverage numpy's broadcasting, it gets done within MBs of space.
dists = np.square(X).sum(axis=1, keepdims=True)+np.square(X_train).sum(axis=1, keepdims=True).T-2*np.dot(X, X_train.T)
Why is the explicit dimension expansion so space inefficient despite of such small matrices being used.
Tracking dimensions in the 2 alternatives:
X(500,3072) and X_train(5000,3072).
(500,1,3072) - (1,5000, 30720) => (500,5000,30720)
norm
squares this, and sums on axis 2 (30720) and returns the sqrt. (500,5000)
That 3d array is quite large, 600GB
np.square(X).sum(axis=1, keepdims=True)+np.square(X_train).sum(axis=1, keepdims=True).T-2*np.dot(X, X_train.T)
(500,3072)=>(500,1)
(5000,3872)=>(5000,1)==>(1,5000)
==> (500,5000)
dot (500,3072), (3872,5000) ==> (500,5000)
net (500,5000)