python arrays list numpy array-broadcasting

How to broadcast operation to Numpy array of objects?

Say I have a Numpy array of 500 lists with random sizes ranging from 0 to 9:

import numpy as np
a = np.array([[i for i in range(np.random.randint(10))] for _ in range(500)], dtype=object)

Now I want to append a value 100 to indices [0,10,20,30,40,50], I tried to apply a function to each list in the array:

func = np.vectorize(lambda x: x + [100])
a[[0,10,20,30,40,50]] = func(a[[0,10,20,30,40,50]])

but I get ValueError: setting an array element with a sequence.

Is there any way I can broadcast operations to all objects (with different sizes) in a Numpy array? In my case I usually have up to ~50,000 indices. Using a normal for loop would be too slow. I'm thinking maybe converting the array to a sparse matrix with equal sizes of rows if it's more efficient that way?

Solution

Setting up your array (slightly smaller)

In [32]: a = np.array([[i for i in range(np.random.randint(10))] for _ in range(100)], dtype=object)

In [33]: idx = [0,10,30,50]

By specifying the otypes, I can run your vectorized function:

In [34]: func =lambda x: x + [100]; vfunc = np.vectorize(func, otypes=[object])

In [36]: vfunc(a[idx])
Out[36]: 
array([list([0, 1, 2, 3, 4, 5, 6, 100]),
       list([0, 1, 2, 3, 4, 5, 6, 7, 100]), list([0, 1, 100]),
       list([0, 1, 2, 3, 4, 100])], dtype=object)

In [37]: a[idx] = vfunc(a[idx])

In [38]: a[idx]
Out[38]: 
array([list([0, 1, 2, 3, 4, 5, 6, 100]),
       list([0, 1, 2, 3, 4, 5, 6, 7, 100]), list([0, 1, 100]),
       list([0, 1, 2, 3, 4, 100])], dtype=object)

The equivalent with iteration:

In [39]: for i in idx: a[i] = func(a[i])

In [40]: a[idx]
Out[40]: 
array([list([0, 1, 2, 3, 4, 5, 6, 100, 100]),
       list([0, 1, 2, 3, 4, 5, 6, 7, 100, 100]), list([0, 1, 100, 100]),
       list([0, 1, 2, 3, 4, 100, 100])], dtype=object)

I can't time the assignment without playing games with deep copies (I don't want to grow each list manytimes). But timing just the append step:

In [41]: %%timeit
    ...: vfunc(a[idx])
    ...: 
    ...: 
19.4 μs ± 459 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [42]: %%timeit
    ...: for i in idx: func(a[i])
    ...: 
    ...: 
2 μs ± 57.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

The loop is quite a bit faster.

Since I specify object otypes, I could just as well use frompyfunc, and run faster:

In [43]: vofunc = np.frompyfunc(func,1,1)

In [44]: vofunc(a[idx])
Out[44]: 
array([list([0, 1, 2, 3, 4, 5, 6, 100, 100, 100]),
       list([0, 1, 2, 3, 4, 5, 6, 7, 100, 100, 100]),
       list([0, 1, 100, 100, 100]), list([0, 1, 2, 3, 4, 100, 100, 100])],
      dtype=object)

In [45]: %%timeit
    ...: vofunc(a[idx])
    ...: 
    ...: 
9.34 μs ± 183 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Still iteration is faster.

In some other cases vectorize/frompyfunc is closer in speed to iteration, even a bit faster for large samples. But it never an order of magnitude faster.