pythonarrayslistnumpyarray-broadcasting

How to broadcast operation to Numpy array of objects?


Say I have a Numpy array of 500 lists with random sizes ranging from 0 to 9:

import numpy as np
a = np.array([[i for i in range(np.random.randint(10))] for _ in range(500)], dtype=object)

Now I want to append a value 100 to indices [0,10,20,30,40,50], I tried to apply a function to each list in the array:

func = np.vectorize(lambda x: x + [100])
a[[0,10,20,30,40,50]] = func(a[[0,10,20,30,40,50]])

but I get ValueError: setting an array element with a sequence.

Is there any way I can broadcast operations to all objects (with different sizes) in a Numpy array? In my case I usually have up to ~50,000 indices. Using a normal for loop would be too slow. I'm thinking maybe converting the array to a sparse matrix with equal sizes of rows if it's more efficient that way?


Solution

  • Setting up your array (slightly smaller)

    In [32]: a = np.array([[i for i in range(np.random.randint(10))] for _ in range(100)], dtype=object)
    
    In [33]: idx = [0,10,30,50]
    

    By specifying the otypes, I can run your vectorized function:

    In [34]: func =lambda x: x + [100]; vfunc = np.vectorize(func, otypes=[object])
    
    In [36]: vfunc(a[idx])
    Out[36]: 
    array([list([0, 1, 2, 3, 4, 5, 6, 100]),
           list([0, 1, 2, 3, 4, 5, 6, 7, 100]), list([0, 1, 100]),
           list([0, 1, 2, 3, 4, 100])], dtype=object)
    
    In [37]: a[idx] = vfunc(a[idx])
    
    In [38]: a[idx]
    Out[38]: 
    array([list([0, 1, 2, 3, 4, 5, 6, 100]),
           list([0, 1, 2, 3, 4, 5, 6, 7, 100]), list([0, 1, 100]),
           list([0, 1, 2, 3, 4, 100])], dtype=object)
    

    The equivalent with iteration:

    In [39]: for i in idx: a[i] = func(a[i])
    
    In [40]: a[idx]
    Out[40]: 
    array([list([0, 1, 2, 3, 4, 5, 6, 100, 100]),
           list([0, 1, 2, 3, 4, 5, 6, 7, 100, 100]), list([0, 1, 100, 100]),
           list([0, 1, 2, 3, 4, 100, 100])], dtype=object)
    

    I can't time the assignment without playing games with deep copies (I don't want to grow each list manytimes). But timing just the append step:

    In [41]: %%timeit
        ...: vfunc(a[idx])
        ...: 
        ...: 
    19.4 μs ± 459 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
    
    In [42]: %%timeit
        ...: for i in idx: func(a[i])
        ...: 
        ...: 
    2 μs ± 57.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
    

    The loop is quite a bit faster.

    Since I specify object otypes, I could just as well use frompyfunc, and run faster:

    In [43]: vofunc = np.frompyfunc(func,1,1)
    
    In [44]: vofunc(a[idx])
    Out[44]: 
    array([list([0, 1, 2, 3, 4, 5, 6, 100, 100, 100]),
           list([0, 1, 2, 3, 4, 5, 6, 7, 100, 100, 100]),
           list([0, 1, 100, 100, 100]), list([0, 1, 2, 3, 4, 100, 100, 100])],
          dtype=object)
    
    In [45]: %%timeit
        ...: vofunc(a[idx])
        ...: 
        ...: 
    9.34 μs ± 183 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
    

    Still iteration is faster.

    In some other cases vectorize/frompyfunc is closer in speed to iteration, even a bit faster for large samples. But it never an order of magnitude faster.