python-3.xnumpymigrationnumpy-ndarraymixed-type

Migrating python2 mixed-type np.array operations to python3


I'm migrating from python2 to python3 and I'm facing an issue which I have simplified to this:

import numpy as np
a = np.array([1, 2, None])
(a > 0).nonzero()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
TypeError: '>' not supported between instances of 'NoneType' and 'int' 

In reality I'm processing np-arrays with millions of data and really need to keep the np-operation for performance. In python 2 this was working fine and returns what I expect, since python2 is not so keen on types. What is the best approach for migrating this?


Solution

  • One way to achieve the desired result is to use a lambda function with np.vectorize:

    >>> a = np.array([1, 2, None, 4, -1])
    >>> f = np.vectorize(lambda t: t and t>0)
    >>> np.where(f(a))
    (array([0, 1, 3], dtype=int64),)
    

    Of course, if the array doesn't contain negative integers, you could just use np.where(a), as both None and 0 would evaluate to False:

    >>> a = np.array([1, 2, None, 4, 0])
    >>> np.where(a)
    (array([0, 1, 3], dtype=int64),)
    

    Another way this can be solved is by first converting the array to use the float dtype, which has the effect of converting None to np.nan. Then np.where(a>0) can be used as normal.

    >>> a = np.array([1, 2, None, 4, -1])
    >>> np.where(a.astype(float) > 0)
    (array([0, 1, 3], dtype=int64),)
    

    Time comparison:

    enter image description here

    So Bob's approach, while not as easy on the eyes, is about twice as fast as the np.vectorise approach, and slightly slower than the float conversion approach.

    Code to reproduce:

    import perfplot
    import numpy as np
    
    f = np.vectorize(lambda t: t and t>0)
    
    choices = list(range(-10,11)) + [None]
    
    def cdjb(arr):
        return np.where(f(arr))
    
    def cdjb2(arr):
        return np.where(arr.astype(float) > 0)
    
    def Bob(arr):
        deep_copy = np.copy(arr)
        deep_copy[deep_copy == None] = 0
        return (deep_copy > 0).nonzero()[0]
    
    perfplot.show(
        setup=lambda n: np.random.choice(choices, size=n),
        n_range=[2**k for k in range(25)],
        kernels=[
            cdjb, cdjb2, Bob
            ],
        xlabel='len(a)',
        )