pythondata-analysismaximize

Find the smallest range that contains a given percentage of values


I have a .csv file that contains file sizes. I need to find the interval from a to b, where the majority of the files (75-80-85-90%) are, while [a,b] is the minimal interval possible. I am using python.

I know how to do that if I’m checking the specific percentile of files, but I have no idea how to do that maximization problem.

percentile_80 = df['file_size'].quantile(0.8)
num_files = df.shape[0]
num_files_in_range = df[df['file_size'] <= percentile_80].shape[0]
percent_files_in_range = num_files_in_range / num_files * 100
range_start = df['file_size'].min()
range_end = percentile_80

Solution

  • Here's how I understand your question:

    You have a list of file sizes, and you're trying to find file sizes a and b such that 80% (or some other predetermined percentage) of the files have size s in the range [a,b], and |a-b| is minimized.

    I suspect there's no built-in pandas function for this, but it's not too bad to do manually:

    def minimum_size_range(file_sizes, percentage):
        # calculate how many files need to be in the range
        window_size = math.ceil(len(file_sizes) * percentage / 100)
    
        sorted_sizes = sorted(file_sizes)
    
        # initialize variables with worst-case values
        min_size, max_size = sorted_sizes[0], sorted_sizes[-1]
        min_interval = max_size - min_size
    
        # calculate interval for every window
        for i in range(len(sorted_sizes) - (window_size - 1)):
            lower, upper = sorted_sizes[i], sorted_sizes[i + (window_size - 1)]
            interval = upper - lower
    
            # if we found a new minimum interval, replace values
            if interval < min_interval:
                min_interval = interval
                min_size, max_size = lower, upper
    
        return min_size, max_size
    

    Quick explanation: Since we know the desired percentage beforehand, we know how many files we want in our range, so we can just sort our file sizes and find the window with the desired number of files that has the smallest range of sizes.

    You should be able to call this like so:

    min_size, max_size = minimum_size_range(df['file_size'], 80)