pythonnumpynumpy-ufunc

Why is there no 'is' ufunc in numpy?


I can certainly do

a[a == 0] = something

that sets every entry of a that equals zero to something. Equivalently, I could write

a[np.equal(a, 0)] = something

Now, imagine a is an array of dtype=object. I cannot write a[a is None] because, of course, a itself isn't None. The intention is clear: I want the comparison is to be broadcast like any other ufunc. This list from the docs lists nothing like an is-unfunc.

Why is there none, and, more interestingly to me: what would be a performant replacement?


Solution

  • Except for operations like reshape and indexing that don't depend on dtype (except for the itemsize), operations on object dtype arrays are performed at list-comprehension speeds, iterating on the elements and applying an appropriate method to each. Sometimes that method doesn't exist, such as when doing np.sin.

    To illustrate, consider the array from one of the comments:

    In [132]: a = np.array([1, None, 0, np.nan, ''])
    In [133]: a
    Out[133]: array([1, None, 0, nan, ''], dtype=object)
    

    The object array test:

    In [134]: a==None
    Out[134]: array([False,  True, False, False, False])
    In [135]: timeit a==None
    5.16 µs ± 73.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

    An equivalent comprehension:

    In [136]: [x is None for x in a]
    Out[136]: [False, True, False, False, False]
    In [137]: timeit [x is None for x in a]
    1.52 µs ± 18.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    

    It's faster, even if we cast the result back to array (not a cheap step):

    In [138]: timeit np.array([x is None for x in a])
    4.67 µs ± 95.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

    Iteration on the list version of the array is even faster:

    In [139]: timeit np.array([x is None for x in a.tolist()])
    2.52 µs ± 48.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

    Let's look at the full assignment action:

    In [141]: a[[x is None for x in a.tolist()]]
    Out[141]: array([None], dtype=object)
    In [142]: %%timeit a1=a.copy()
         ...: a1[[x is None for x in a1.tolist()]] = np.nan
         ...: 
         ...: 
    4.03 µs ± 10 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    In [143]: %%timeit a1=a.copy()
         ...: a1[a1==None] = np.nan
         ...: 
         ...: 
    6.18 µs ± 28.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

    The usual caveat that things might scale differently.