I have a .csv file that contains file sizes. I need to find the interval from a to b, where the majority of the files (75-80-85-90%) are, while [a,b] is the minimal interval possible. I am using python.
I know how to do that if I’m checking the specific percentile of files, but I have no idea how to do that maximization problem.
percentile_80 = df['file_size'].quantile(0.8)
num_files = df.shape[0]
num_files_in_range = df[df['file_size'] <= percentile_80].shape[0]
percent_files_in_range = num_files_in_range / num_files * 100
range_start = df['file_size'].min()
range_end = percentile_80
Here's how I understand your question:
You have a list of file sizes, and you're trying to find file sizes a and b such that 80% (or some other predetermined percentage) of the files have size s in the range [a,b], and |a-b| is minimized.
I suspect there's no built-in pandas function for this, but it's not too bad to do manually:
def minimum_size_range(file_sizes, percentage):
# calculate how many files need to be in the range
window_size = math.ceil(len(file_sizes) * percentage / 100)
sorted_sizes = sorted(file_sizes)
# initialize variables with worst-case values
min_size, max_size = sorted_sizes[0], sorted_sizes[-1]
min_interval = max_size - min_size
# calculate interval for every window
for i in range(len(sorted_sizes) - (window_size - 1)):
lower, upper = sorted_sizes[i], sorted_sizes[i + (window_size - 1)]
interval = upper - lower
# if we found a new minimum interval, replace values
if interval < min_interval:
min_interval = interval
min_size, max_size = lower, upper
return min_size, max_size
Quick explanation: Since we know the desired percentage beforehand, we know how many files we want in our range, so we can just sort our file sizes and find the window with the desired number of files that has the smallest range of sizes.
You should be able to call this like so:
min_size, max_size = minimum_size_range(df['file_size'], 80)