pythonnumpy

Numpy array of array into array of length (vectorization)


I have a numpy array of arrays, where each sub-array represents a group of elements. I want to compute the length of each sub-array and store the result as a new numpy array.

Currently, I use a standard for loop, but I would like to vectorize this operation for better performance using only numpy.

Example

import numpy as np

indices = np.array([np.array([1, 2, 3]), np.array([4, 5]), np.array([6, 7, 8, 9])], dtype=object)
group_lengths = np.array([len(group) for group in indices])

print(group_lengths)

Expected output

[3 2 4]

I am looking for a fully vectorized solution that avoids the explicit loop. How can this be achieved with numpy?

Update

Thank you all for your insights and clarifications! I now understand that fully vectorizing my use case isn't possible. I'll stick with my current solution, as it seems to be the most efficient approach given the circumstances.


Solution

  • Currently the best option is to use np.fromiter. This solution was proposed in the comments by Jérôme Richard:

    # generate big dataset
    indices = np.array([np.random.randint(0,10,size=np.random.randint(4,500))
                        for _ in range(100000)
                        ], dtype=object)
    

    Use map function:

    group_lengths = np.array(list(map(len, indices)))
    # 6.49 ms ± 57.5 μs per loop
    

    Use list generator with len function

    group_lengths = np.array([len(group) for group in indices])
    # 8.19 ms ± 60.5 μs per loop
    

    Using numpy vectorized:

    vfunc = np.vectorize(len)
    group_lengths = vfunc(indices)
    # 4.86 ms ± 23.2 μs per loop
    

    Apply the function via np.fromiter

    group_lengths = np.fromiter(map(len, indices), dtype=object)
    # 3.53 ms ± 36.2 μs per loop