pythonnumpyvectorizationarray-broadcasting

Most 'Pythonic' way of forcing array-like behavior on non-array inputs?


I'm working on a personal project for computing and manipulating integer distributions, for application in systems like DnD and pathfinder.

I'm looking to make functions that can query the probability mass function (PMF) and cumulative distribution function (CDF) for certain inputs, in a way that works for integer and also array-like input

I know that one can use N=np.asarray(N, dtype = int) for this purpose, and is certainly the cleanest if it works. But, the inner workings of my function (as implemented, maybe there's a clean workaround...) requires making sure my input is at least 1d using N = np.atleast_1d(np.asarray(N, dtype = int)) (typecasting to int is important here, hence the intermediate step).

The function is now working perfectly as expected, except for when the input was originally just an int . The return type is an array of float with shape (1,) , rather than just a 0d array of float.

My workaround was to cache the resulting shape of the innermost operation, and reshape at the end. A template of my function might look like this:

def my_func(N):
    N_shapecache = np.asarray(N, dtype = int)
    N = np.atleast_1d(N_shapecache)
    ret = do_stuff_that_requires_N_have_a_length(N)
    return ret.reshape(N_shapecache.shape)

This gives the intended behavior perfectly for things like print statements where integer inputs will look like pure floats when printed (the type is a 0d array but that's a good thing I think)

However, I think it's also a bit cumbersome and perhaps there's a better way to handle this sort of thing in the vast numpy library, where after I write my own way to do something, 90% of the time I realize there's a built-in numpy function that does that exact thing 40x faster. But searching didn't give me what I was looking for (usually one of the two fixes I'm using in isolation, but not combined)

For the full context, here is a (mostly) minimal working code block that has my function.

import numpy as np

class integerDistribution:

    def __init__(self, min_val, probability_distribution, rtol = 1e-10):
        """Initializes the instance from a minimum value and a finite list of probabilities.

        Args:
            min_val : int
                The minimum value of our distribution. As an example, rolling a d20 
                would have a minimum value of 1, while the sum of 2d6 has a minimum 
                value of 2.
            probability_distribution : array-like of floats
                Each entry at index 'idx' in the array corresponds to the probability 
                of obtaining the value 'idx + min_val' from our distribution.
            rtol: float
                When initializing, we perform a sanity check that the sum of 
                probabilities is 1 to within a tolerance of rtol. Defaults to 1e-10

        Raises:
            ValueError: If probabilities do not sum to 1 within the specified
                tolerance (rtol)      
        """
        if not np.isclose(np.sum(probability_distribution), 1.0, rtol):
            raise ValueError(f"Probabilities summed to {np.sum(probability_distribution)}")

        self.min_val = int(min_val)
        self.max_val = self.min_val + len(probability_distribution) - 1
        self.values = self.min_val + np.arange(len(probability_distribution), dtype = int)
        self.probability_distribution = np.array(probability_distribution)
        

    
    def pmf(self, N):
        """Evaluates the probability mass function (PMF) at given value(s)

        The probability mass function is defined as p(N) = P(X = N), 
        where X is a random variable sampled from 'self', and N is the value 
        for which the probability is being calculated. 

        Args:
            N : int or list of int
                The value(s) for which to compute the probability. 
                Can be a single integer or a list/array of integers.

        Returns:
            p(N) : float or list of float
                The probability of obtaining each value in N when sampled from 'self'. 
                The return type matches the input type: a single float if N is an 
                integer, or a list/array of floats if N is a list/array of integers.
        """
        N_shapecache = np.asarray(N, dtype = int)
        N = np.atleast_1d(N_shapecache)
        p_N = np.zeros(N.shape)
        valid_indices = np.where((self.min_val <= N)*(N <= self.max_val))
        p_N[valid_indices] = self.probability_distribution[N[valid_indices] - self.min_val]
        return p_N.reshape(N_shapecache.shape)

Solution

  • I don't think there's a built-in NumPy function for this.

    I would rewrite this as a decorator. Instead of this:

    def my_func(N):
        N_shapecache = np.asarray(N, dtype = int)
        N = np.atleast_1d(N_shapecache)
        ret = do_stuff_that_requires_N_have_a_length(N)
        return ret.reshape(N_shapecache.shape)
    

    I would write a decorator.

    import functools
    
    
    def my_decorator(func):
        @functools.wraps(func)
        def inner(N, *args, **kwargs):
            N = np.asarray(N, dtype = int)
            N_shapecache = N.shape
            N = np.atleast_1d(N)
            ret = func(N, *args, **kwargs)
            return ret.reshape(N_shapecache)
        return inner
    
    
    @my_decorator
    def my_func(N):
        return do_stuff_that_requires_N_have_a_length(N)
    

    This way, the decorator can be shared with any function which has similar behavior, and if you find a bug in the decorator, it can be fixed in one place.

    Finally, I want to point out that NumPy/Python has three slightly different kinds of scalars:

    Right now, for all of these inputs, your code returns a 0-dimensional array. That's slightly unintuitive, but it might be fine for your application. You should think about what behavior you want here.