I have a numpy array of arrays, where each sub-array represents a group of elements. I want to compute the length of each sub-array and store the result as a new numpy array.
Currently, I use a standard for
loop, but I would like to vectorize this operation for better performance using only numpy.
Example
import numpy as np
indices = np.array([np.array([1, 2, 3]), np.array([4, 5]), np.array([6, 7, 8, 9])], dtype=object)
group_lengths = np.array([len(group) for group in indices])
print(group_lengths)
Expected output
[3 2 4]
I am looking for a fully vectorized solution that avoids the explicit loop. How can this be achieved with numpy?
Update
Thank you all for your insights and clarifications! I now understand that fully vectorizing my use case isn't possible. I'll stick with my current solution, as it seems to be the most efficient approach given the circumstances.
Currently the best option is to use np.fromiter
. This solution was proposed in the comments by Jérôme Richard:
# generate big dataset
indices = np.array([np.random.randint(0,10,size=np.random.randint(4,500))
for _ in range(100000)
], dtype=object)
Use map
function:
group_lengths = np.array(list(map(len, indices)))
# 6.49 ms ± 57.5 μs per loop
Use list generator with len
function
group_lengths = np.array([len(group) for group in indices])
# 8.19 ms ± 60.5 μs per loop
Using numpy vectorized:
vfunc = np.vectorize(len)
group_lengths = vfunc(indices)
# 4.86 ms ± 23.2 μs per loop
Apply the function via np.fromiter
group_lengths = np.fromiter(map(len, indices), dtype=object)
# 3.53 ms ± 36.2 μs per loop