pythonnumpyvectorizationvlad-vector

Vectorise VLAD computation in numpy


I was wondering whether it was possible to vectorise this implementation of VLAD computation.

For context:

feats = numpy array of shape (T, N, F)

kmeans = KMeans object from scikit-learn initialised with K clusters.

Current method

k = kmeans.n_clusters # K
centers = kmeans.cluster_centers_ # (K, F)
vlad_feats = []

for feat in feats:
    # feat shape - (N, F) 
    cluster_label = kmeans.predict(feat) #(N,)
    vlad = np.zeros((k, feat.shape[1])) # (K, F)

    # computing the differences for all the clusters (visual words)
    for i in range(k):
        # if there is at least one descriptor in that cluster
        if np.sum(cluster_label == i) > 0:
            # add the differences
            vlad[i] = np.sum(feat[cluster_label == i, :] - centers[i], axis=0)
    vlad = vlad.flatten() # (K*F,)
    # L2 normalization
    vlad = vlad / np.sqrt(np.dot(vlad, vlad))
    vlad_feats.append(vlad)

vlad_feats = np.array(vlad_feats) # (T, K*F)

Getting the kmeans predictions as a batch is not a problem as we can do as follows:

feats2 = feats.reshape(-1, F) # (T*N, F)
labels = kmeans.predict(feats2) # (T*N, )

But I'm stuck at computing cluster distances.


Solution

  • While @MadPhysicist's answer vectorizes, I've found it hurts performance.

    Below, looping is essentially a re-written version of OP's algorithm and naivec employs vectorization through the exploded 4D tensor.

    import numpy as np
    from sklearn.cluster import MiniBatchKMeans
    
    def looping(kmeans: MiniBatchKMeans, local_tlf):
        k, (t, l, f) = kmeans.n_clusters, local_tlf.shape
        centers_kf = kmeans.cluster_centers_
        vlad_tkf = np.zeros((t, k, f))
        for vlad_kf, local_lf in zip(vlad_tkf, local_tlf):
            label_l = kmeans.predict(local_lf)
            for i in range(k):
                vlad_kf[i] = np.sum(local_lf[label_l == i] - centers_kf[i], axis=0)
            vlad_D = vlad_kf.ravel()
            vlad_D = np.sign(vlad_D) * np.sqrt(np.abs(vlad_D))
            vlad_D /= np.linalg.norm(vlad_D)
            vlad_kf[:,:] = vlad_D.reshape(k, f)
        return vlad_tkf.reshape(t, -1)
    
    
    def naivec(kmeans: MiniBatchKMeans, local_tlf):
        k, (t, l, f) = kmeans.n_clusters, local_tlf.shape
        centers_kf = kmeans.cluster_centers_
        labels_tl = kmeans.predict(local_tlf.reshape(-1,f)).reshape(t, l)
        mask_tlk = labels_tl[..., np.newaxis] == np.arange(k)
        local_tl1f = local_tlf[...,np.newaxis,:]
        delta_tlkf = local_tl1f - centers_kf # <-- easy to run out of memory
        vlad_tD = (delta_tlkf * mask_tlk[..., np.newaxis]).sum(axis=1).reshape(t, -1)
        vlad_tD = np.sign(vlad_tD) * np.sqrt(np.abs(vlad_tD))
        vlad_tD /= np.linalg.norm(vlad_tD, axis=1, keepdims=True)
        return vlad_tD
    

    Indeed, see below for a benchmark.

    np.random.seed(1234)
    # usually there are a lot more images than this
    t, l, f, k = 256, 128, 64, 512
    X = np.random.randn(t, l, f)
    km = MiniBatchKMeans(n_clusters=16, n_init=10, random_state=0)
    km.fit(X.reshape(-1, f))
    
    result_looping = looping(km, X)
    result_naivec = naivec(km, X)
    
    %timeit looping(km, X) # ~200 ms
    %timeit naivec(km, X) # ~300 ms
    
    assert np.allclose(result_looping, result_naivec)
    

    An idiomatic vectorization which avoids memory growing beyond output size (asymptotically) would leverage a scatter reduction.

    def truvec(kmeans: MiniBatchKMeans, local_tlf):
        k, (t, l, f) = kmeans.n_clusters, local_tlf.shape
        centers_kf = kmeans.cluster_centers_
        labels_tl = kmeans.predict(local_tlf.reshape(-1,f)).reshape(t, l)
        
        vlad_tkf = np.zeros((t, k, f))
        M = t * k
        labels_tl += np.arange(t)[:, np.newaxis] * k
        vlad_Mf = vlad_tkf.reshape(-1, f)
        np.add.at(vlad_Mf, labels_tl.ravel(), local_tlf.reshape(-1, f))
        counts_M = np.bincount(labels_tl.ravel(), minlength=M)
        vlad_tkf -= counts_M.reshape(t, k, 1) * centers_kf
        
        vlad_tD = vlad_tkf.reshape(t, -1)
        vlad_tD = np.sign(vlad_tD) * np.sqrt(np.abs(vlad_tD))
        vlad_tD /= np.linalg.norm(vlad_tD, axis=1, keepdims=True)
        return vlad_tD
    

    However, disappointingly, this also only gets us about 200 ms computation time. This boils down to two reasons:

    A performant vectorized version of the VLAD algorithm requires some sophisticated techniques to leverage contiguous array accesses. This version gets 40% improvement over looping(), but requires a lot of setup---see my blog on the approach here.