pythonnumpynumpy-ndarray

0-dimensional array problems with `numpy.vectorize`


numpy.vectorize conveniently converts a scalar function to vectorized functions that can be applied directly to arrays. However, when inputting a single value into the vectorized function, the output is a 0-dimentional array instead of the corresponding value type, which can cause errors when using the result elsewhere due to typing issues. My question is: is there a mechanism in numpy that can resolve this problem by automatically convert the 0-dimensional array return value to the corresponding data type?

For explanation I'd give an example:

@np.vectorize ( excluded = ( 1, 2 ) )
def rescale ( 
    value: float, 
    srcRange: tuple [ float, float ], 
    dstRange: tuple [ float, float ] = ( 0, 1 ), 
) -> float:
    srcMin, srcMax = srcRange
    dstMin, dstMax = dstRange
    t = ( value - srcMin ) / ( srcMax - srcMin )
    return dstMin + t * ( dstMax - dstMin )

When calling the function above with rescale ( 5, ( 0, 10 ) ) the return value is numpy.array(0.5) instead of just the value 0.5.

Currently I resolve this problem by a self-defined decorator:

def vectorize0dFix ( func ):
    def _func ( *args, **kwargs ):
        result = func ( *args, **kwargs )
        if isinstance ( result, np.ndarray ) and result.shape == ( ):
            return result.item ( )
        else:
            return result
    return _func

But if this problem do causes trouble there should be a mechanism in numpy which properly deals with the problem. I wonder whether there is one or why there isn't.


Solution

  • Short answer:

    Long answer:

    Following these answers to related questions, by indexing with an empty tuple (), you can systematically unwrap 0-d arrays into scalars while keeping other arrays.

    So, using the @np.vectorized function rescale() from your question, you can post-process your results accordingly, for example:

    with_scalar_input = rescale(5, (0, 10))[()]
    with_vector_input = rescale([5], (0, 10))[()]
    print(type(with_scalar_input))  # <class 'numpy.float64'>
    print(type(with_vector_input))  # <class 'numpy.ndarray'>
    

    I am not aware of any built-in NumPy mechanism that solves this edge case of @np.vectorize for you, so providing your own decorator is probably a viable way to go.

    Custom scalar-unwrapping @vectorize decorator

    Writing your own custom decorator that (a) accepts all arguments of and behaves exactly like @np.vectorize, but (b) appends the scalar unwrapping step, could look as follows:

    from functools import wraps
    import numpy as np
    
    def vectorize(*wa, **wkw):
        def decorator(f):
            @wraps(f)
            def wrap(*fa, **fkw): return np.vectorize(f, *wa, **wkw)(*fa, **fkw)[()]
            return wrap
        return decorator
    
    @vectorize(excluded=(1, 2))
    def rescale(value, srcRange, dstRange=(0, 1)):
        srcMin, srcMax = srcRange
        dstMin, dstMax = dstRange
        t = (value - srcMin) / (srcMax - srcMin)
        return dstMin + t * (dstMax - dstMin)
    
    with_scalar_input = rescale(5, (0, 10))
    with_vector_input = rescale([5], (0, 10))
    print(type(with_scalar_input))  # <class 'numpy.float64'>
    print(type(with_vector_input))  # <class 'numpy.ndarray'>
    

    If you don't care about docstring propagation (of which @functools.wraps takes care), the @vectorize decorator can be shortened to:

    import numpy as np
    
    vectorize = lambda *wa, **wkw: lambda f: lambda *fa, **fkw: \
                np.vectorize(f, *wa, **wkw)(*fa, **fkw)[()]
    
    @vectorize(excluded=(1, 2))
    def rescale(value, srcRange, dstRange=(0, 1)):
        srcMin, srcMax = srcRange
        dstMin, dstMax = dstRange
        t = (value - srcMin) / (srcMax - srcMin)
        return dstMin + t * (dstMax - dstMin)
    
    with_scalar_input = rescale(5, (0, 10))
    with_vector_input = rescale([5], (0, 10))
    print(type(with_scalar_input))  # <class 'numpy.float64'>
    print(type(with_vector_input))  # <class 'numpy.ndarray'>
    

    Caution: All approaches using (), as proposed above, produce a new edge case: if the input is provided as a 0-d NumPy array, such as np.array(5), the result will also be unwrapped into a scalar. Likewise, you might have noticed that the scalar results are NumPy scalars, <class 'numpy.float64'>, rather than native Python scalars, <class 'float'>. If either of this is not acceptable for you, then more elaborate type checking or post-processing will be necessary.

    Try to avoid @np.vectorize altogether

    As a final note: Maybe try to avoid using @np.vectorize altogether in the first place, and try to write your code such that it works both with NumPy arrays and scalars.

    As to avoiding @np.vectorize: Its documentation states:

    The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.

    As to adjusting your code accordingly: Your given function rescale() is a good example for writing code that works both with NumPy arrays and scalars correctly; in fact, it does so already, without any adjustments! You just have to ensure that vector-valued input is given as a NumPy array (rather than, say, a plain Python list or tuple):

    import numpy as np
    
    def rescale(value, srcRange, dstRange=(0, 1)):
        srcMin, srcMax = srcRange
        dstMin, dstMax = dstRange
        t = (value - srcMin) / (srcMax - srcMin)
        return dstMin + t * (dstMax - dstMin)
    
    with_scalar_input = rescale(5, (0, 10))
    with_vector_input = rescale(np.asarray([5]), (0, 10))
    print(type(with_scalar_input))  # <class 'float'>
    print(type(with_vector_input))  # <class 'numpy.ndarray'>
    

    Moreover, while producing exactly the same output for vector-type input¹, the @np.vectorized version is orders of magnitude slower:

    import numpy as np
    from timeit import Timer
    
    def rescale(value, srcRange, dstRange=(0, 1)):
        srcMin, srcMax = srcRange
        dstMin, dstMax = dstRange
        t = (value - srcMin) / (srcMax - srcMin)
        return dstMin + t * (dstMax - dstMin)
    
    vectorized = np.vectorize(rescale, excluded=(1, 2))
    
    a = np.random.normal(size=10000)
    assert (rescale(a, (0, 10)) == vectorized(a, (0, 10))).all()  # Same result?
    print("Unvectorized:", Timer(lambda: rescale(a, (0, 10))).timeit(100))
    print("Vectorized:", Timer(lambda: vectorized(a, (0, 10))).timeit(100))
    

    On my machine, this produces about 0.003 seconds for the unvectorized version and about 0.8 seconds for the vectorized version.

    In other words: we have more than a 250× speedup with the given, unvectorized function for a given 10,000-element array, while (if used carefully, i.e. by providing NumPy arrays rather than plain Python sequences for vector-type inputs) the function already produces scalar outputs for scalar inputs and vector outputs for vector inputs!

    I guess the code above might not be the code that you are actually trying to vectorize; but anyway: in a lot of cases, a similar approach is possible.

    ¹) Again, the case of a 0-d vector input is special here, but you might want to check that for yourself.