How to get the dot product of inner dims in numpy array

I would like to compute the dot product (matmul) of the inner dimension of two 3D arrays. In the following example, I have an array of 10 2x3 matrixes (X) and an array of 8 1x3 matrixes. The result Z should be a 10 element array of an 8 x 2 matrix (you might also think of this as an 10 x 8 array of 2-d vectors.)

X = np.arange(0, 10 * 2 * 3).reshape(10, 2, 3)
Y = np.arange(0, 8 * 1 * 3).reshape(8, 1, 3)

# write to Z, which has the dot-product of internal dims of X and Y
Z = np.empty((10, 8, 2))
for i, x_i in enumerate(X):
    for j, y_j in enumerate(Y):
        z = x_i @ y_j.T
        # subscriptz to flatten it.
        Z[i, j, :] = z[:, 0]

is correct, but I would like a vectorized solution.

Solution

Einsum

You could use einsum

np.einsum('ijk,lmk->ilj', X, Y)

It produces an array whose shape is 3 axis

axis 0 (i) the size of first axis of X (here 10)
axis 1 (l) size of first axis of Y (here 8)
axis 2 (j) second axis of X (2) m is just ignored (size 1) and k (3rd axis of both), since it is repeated is used to sum product So, that is, for each i, l and j, result[i,l,j] is Σₖ X[i,j,k]*Y[l,m,k] (m being 0 anyway)

Dot product

You could also just use matmul as you intended. It can work with bigger than 2D arrays (and then operation are element-wise, like + or * for all axis, but the 2 last, which behave like the 2D dot product).

But for that you need some adjustment to your shapes before.

Since you want 10x8 array of result, those 10 and 8 must be on their own axis. So axis 0 for size 10, axis 1 for size 8. So we need to reshape X as a (10, 1, 2, 3) array. And Y as a (1, 8, 1, 3) array. So that the "element-wise" behavior of the 2 first axis result in broadcast (you can do element-wise operations as long as you have either matching size of each axis, or one of the axis is size 1, and is then broadcasted. So, here, the (10,1) & (1,8) sizes of the 2 first axis of X and Y would result to a (10,8) result.)

And furthermore, to be able to have some dot product on the 2 last axis, they must be in the traditional (n,m) (m,k) pattern. Here (2,3) (3,1). So we need also to swap the 2 last axis of Y.

All together

(X[:,None,...] @ (Y[None,...].transpose(0,1,3,2)))[...,0]

The [...,0] being the same as yours, to ignore the size 1 axis.

Note that all those operations (but the matmul itself, of course) are almost for free. No data is moved, copied or modified by X[:,None,...] or by .transpose(0,1,3,2). Only meta data (strides and shape) are changed. That is O(1) cost.

Only the 10x832 multiplications are done on the data. Which of course is needed. So performance-wise that is as vectorized as it can be.

Performance-wise, on your example (and my computer), it is 21 μs vs 7 μs (so in favor of second solution).

But that could be different for other shapes. For example, with sizes (30,50,60) and (20,1,60), it is even 3ms vs 3ms. So einsum could be faster in the long run.

But well, both are completely "vectorized" (all iterations are done in internal C code)