pythonnumpyindexingbinningset-operations

Efficient bin-assignment in numpy


i have a very large 1D python array x of somewhat repeating numbers and along with it some data d of the same size.

x = np.array([48531, 62312, 23345, 62312, 1567, ..., 23345, 23345])
d = np.array([0    , 1    , 2    , 3    , 4   , ..., 99998, 99999])

in my context "very large" refers to 10k...100k entries. Some of them are repeating so the number of unique entries is about 5k...15k.

I would like to group them into the bins. This should be done by creating two objects. One is a matrix buffer, b of data items taken from d. The other object is a vector v of unique x values each of the buffer columns refers to. Here's the example:

v =  [48531, 62312, 23345, 1567, ...]
b = [[0    , 1    , 2    , 4   , ...]
     [X    , 3    , ....., ...., ...]
     [ ...., ....., ....., ...., ...]
     [X    , X    , 99998, X   , ...]
     [X    , X    , 99999, X   , ...] ]

Since the numbers of occurrences of each unique number in x vary some of the values in the buffer b are invalid (indicated by the capital X, i.e. "don't care").


It's very easy to derive v in numpy:

v, n = np.unique(x, return_counts=True)  # yay, just 5ms

and we even get n which is the number of valid entries within each column in b. Moreover, (np.max(n), v.shape[0]) returns the shape of the matrix b that needs to be allocated.

But how to efficiently generate b? A for-loop could help

b = np.zeros((np.max(n), v.shape[0]))
for i in range(v.shape[0]):
    idx = np.flatnonzero(x == v[i])
    b[0:n[i], i] = d[idx]

This loop iterates over all columns of b and extracts the indices idxby identifying all the locations where x == v.

However I don't like the solution because of the rather slow for loop (taking about 50x longer than the unique command). I'd rather have the operation vectorized.


So one vectorized approach would be to create a matrix of indices where x == v and then run the nonzero() command on it along the columns. however, this matrix would require memory in the range of 150k x 15k, so about 8GB on a 32 bit system.

To me it sounds rather silly that the np.unique-operation can even efficiently return the inverted indices so that x = v[inv_indices] but that there is no way to get the v-to-x assignment lists for each bin in v. This should come almost for free when the function is scanning through x. Implementation-wise the only challenge would be the unknown size of the resulting index-matrix.


Another way of phrasing this problem assuming that the np.unique-command is the method-to-use for binning:

given the three arrays x, v, inv_indices where v are the unique elements in x and x = v[inv_indices] is there an efficient way of generating the index vectors v_to_x[i] such that all(v[i] == x[v_to_x[i]]) for all bins i?

I shouldn't have to spend more time than for the np.unique-command itself. And I'm happy to provide an upper bound for the number of items in each bin (say e.g. 50).


Solution

  • I received the answer I was looking for by rephrasing the question, see here: python: vectorized cumulative counting

    by "cumulative counting" the inv_indices returned by np.unique() we receive the array indices of the sparse matrix so that

    c = cumcount(inv_indices)
    b[inv_indices, c] = d
    

    cumulative counting as proposed in the thread linked above is very efficient. Run times lower than 20ms are very realistic.