pythonnumpyscipystatisticsbeta-distribution

Beta distribution curves with Python and scipy to distribute man-days effort


The goal is to distribute man-days effort (y axis) through the days time period (x axis) according to defined curve shapes: Bell, FrontLoad, BackLoad, U-shape

These models could be perfectly drawn with Beta distribution and playing with the 'a' and 'b' params to get desired shapes.

My (weak mathematician) problem is to convert scipy beta PDF function result density list (y1, y2, y3) into percentage loads list between 0 and 1 (y-axis loads in each period point x (day#)). Sum of these percentage loads supposed to be qual to 1 (100%). X values is the days and always > 0

The idea then to iterate through y-axis percentage values list and multiply them on total man-days value to get desired man-days effort distribution through the days.

Code snippet looks like this:

import numpy as np
from scipy.stats import beta
import matplotlib.pyplot as plt

total_demand = 100  # total man-days demand
x = np.linspace(0, 1, 100)  # 100 days range. PDF function accepts values in range 0...1
y1 = beta.pdf(x, 2, 8)  # a=2, b=8. FrontLoad distribution
y2 = beta.pdf(x, 5, 5)  # a=5, b=5. Bell-shape distribution
y3 = beta.pdf(x, 8, 2)  # a=8, b=2. BackLoad distribution

y1_demands = [total_demand*y for y in y1]   # does not work because y values are density and not percentage and 
                                            # could be greater than 1

Density result graphs:


Solution

  • The instantaneous value of the pdf isn't bounded on [0,1]. Only the integral over any subset is. In this case, you want your subsets to be days, so you'd need to integrate over each day. That's not easy with the pdf, but with the pre-integrated cdf, it's easy - just take the running difference:

    y1 = np.diff(beta.cdf(x, 2, 8))  # a=2, b=8. FrontLoad distribution
    y2 = np.diff(beta.cdf(x, 5, 5))  # a=5, b=5. Bell-shape distribution
    y3 = np.diff(beta.cdf(x, 8, 2))  # a=8, b=2. BackLoad distribution
    

    I can't attach images, but that should give you your expected graphs.

    Also this:

    y1_demands = [total_demand*y for y in y1]
    

    can just be:

    y1_demands = total_demand*y1
    

    numpy is intended to prevent for loops, that are slow in python. if you find yourself iterating through arrays, there's probably a much faster (and usually clearer) way to do it in numpy.