Say I have a Numpy array of 500 lists with random sizes ranging from 0 to 9:
import numpy as np
a = np.array([[i for i in range(np.random.randint(10))] for _ in range(500)], dtype=object)
Now I want to append a value 100
to indices [0,10,20,30,40,50]
, I tried to apply a function to each list in the array:
func = np.vectorize(lambda x: x + [100])
a[[0,10,20,30,40,50]] = func(a[[0,10,20,30,40,50]])
but I get ValueError: setting an array element with a sequence.
Is there any way I can broadcast operations to all objects (with different sizes) in a Numpy array? In my case I usually have up to ~50,000 indices. Using a normal for loop would be too slow. I'm thinking maybe converting the array to a sparse matrix with equal sizes of rows if it's more efficient that way?
Setting up your array (slightly smaller)
In [32]: a = np.array([[i for i in range(np.random.randint(10))] for _ in range(100)], dtype=object)
In [33]: idx = [0,10,30,50]
By specifying the otypes
, I can run your vectorized function:
In [34]: func =lambda x: x + [100]; vfunc = np.vectorize(func, otypes=[object])
In [36]: vfunc(a[idx])
Out[36]:
array([list([0, 1, 2, 3, 4, 5, 6, 100]),
list([0, 1, 2, 3, 4, 5, 6, 7, 100]), list([0, 1, 100]),
list([0, 1, 2, 3, 4, 100])], dtype=object)
In [37]: a[idx] = vfunc(a[idx])
In [38]: a[idx]
Out[38]:
array([list([0, 1, 2, 3, 4, 5, 6, 100]),
list([0, 1, 2, 3, 4, 5, 6, 7, 100]), list([0, 1, 100]),
list([0, 1, 2, 3, 4, 100])], dtype=object)
The equivalent with iteration:
In [39]: for i in idx: a[i] = func(a[i])
In [40]: a[idx]
Out[40]:
array([list([0, 1, 2, 3, 4, 5, 6, 100, 100]),
list([0, 1, 2, 3, 4, 5, 6, 7, 100, 100]), list([0, 1, 100, 100]),
list([0, 1, 2, 3, 4, 100, 100])], dtype=object)
I can't time the assignment without playing games with deep copies (I don't want to grow each list manytimes). But timing just the append step:
In [41]: %%timeit
...: vfunc(a[idx])
...:
...:
19.4 μs ± 459 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [42]: %%timeit
...: for i in idx: func(a[i])
...:
...:
2 μs ± 57.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
The loop is quite a bit faster.
Since I specify object otypes
, I could just as well use frompyfunc
, and run faster:
In [43]: vofunc = np.frompyfunc(func,1,1)
In [44]: vofunc(a[idx])
Out[44]:
array([list([0, 1, 2, 3, 4, 5, 6, 100, 100, 100]),
list([0, 1, 2, 3, 4, 5, 6, 7, 100, 100, 100]),
list([0, 1, 100, 100, 100]), list([0, 1, 2, 3, 4, 100, 100, 100])],
dtype=object)
In [45]: %%timeit
...: vofunc(a[idx])
...:
...:
9.34 μs ± 183 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
Still iteration is faster.
In some other cases vectorize/frompyfunc
is closer in speed to iteration, even a bit faster for large samples. But it never an order of magnitude faster.