pythonnumpydata-analysisexploratory-data-analysis

A more efficient function for calculating the proportion of observations that fall within a specified interval?


I've written a function which calculates the proportion of observations that fall within a specified interval. So, if our observations are assessment marks, we can find out the proportion of students that got, say, between 70 and 100 marks. I've included a boolean parameter since in all but the last interval (with the largest observation as the upper bound) we want to say that the value on the upper bound is included in the next interval. For example, if we're looking at marks between 50-70, we don't want to include 70. My function is:

import numpy as np
def compute_interval_proportion(observations, lower_bound, upper_bound, include_upper):
"""
Calculates the proportion of observations that fall within a specified interval.
    If include_upper == True, then the interval is inclusive; otherwise not.
"""
    if include_upper == True:
        indices = np.where((observations >= lower_bound)
                       & (assessment1marks <= upper_bound))
    
    else:
        indices = np.where((observations >= lower_bound)
                       & (assessment1marks < upper_bound))

    count = len(observations[indices])
    proportion = round(count / len(assessment1marks),3)

    return proportion

This function works, I think, but I feel it is a bit pedestrian (e.g. lots of parameters) and perhaps there is a more sleek or quicker way of doing it. Perhaps there is a way of avoiding requiring the user to manually specify whether they want to include the upper bound or not. Any suggestions?


Solution

  • I've tried to simplify your function, the results are below. The main changes are:

    def compute_interval_proportion(observations, lower, upper):
        if upper >= observations.max():
            upper_cond = observations <= upper
        else: 
            upper_cond = observations < upper
        proportion = ((observations >= lower) & upper_cond).mean()
        return proportion.round(3)