pythonstatisticsprobability

How can I generate a function to represent a collection of probabilities so that I can estimate between points


My particular issue is that I am trying to find a better way to forecast for my business.
I have a table of jobs and their likelihood. that looks like this:

job % likely potential hours
Job 1 .4 40
job 2 .5 32

and so on for about 25 jobs.

What I want is to generate a curve that shows that out of all the possible jobs we have a high likelihood of winning 10 hours because that requires only a few jobs or just one of the bigger ones and that likely hood decreases as we approach the sum of all hour for jobs being a very low likelihood because it depends on winning all jobs. jobs are not connected in any way, so winning one job does not impact the probability of another.

I have this data brought into a dataframe. I am just unsure on the proper way to combine all of these probabilities so that I can check what is the probability of winning a certain number of hours. I expect that we will be able to calculate some particular points and then fit a curve. If I can get to the point where I can get those points then I can fit the curve.

Chat gpt was not helpful, because it was just trying to multiply all of the probabilities cumulatively, which I believe is not correct. I started to see what it would look like to basically create all possible combinations of: winning 1 job,2 jobs, 3 jobs, 4 jobs, 5 jobs, 6 jobs, and then those probabilities are easy to calculate, then I could fit all those datapoints. But I stopped because I figured there must be a more elegant way and I think this is wrong anyway. if there are two totally exclusive combinations that produce a 20% chance of getting 40 hours then I shouldn't average them because they are probabilities...


Solution

  • If I understand correctly, you have 25 independent jobs each with a probability of occurring and a payout value for each job.

    Let's look at a simplified scenario of only 5 jobs:

    jobs = ['job1', 'job2', 'job3', 'job4', 'job5']
    probabilities = [0.1, 0.1, 0.4, 0.6, 0.2]
    hours = [1, 10, 43, 2, 5]
    min_hours_desired = 10
    

    One outcome would be that job3 and job5 "succeed" and all the other jobs "fail". I might encode this scenario as 00101 for convenience.

    The probability of this scenario happening is: (1 - 0.1) * (1 - 0.1) * 0.4 * (1 - 0.6) * 0.2 = 0.02592

    and it has a total payout of: 0 + 0 + 43 + 0 + 5 = 48

    And this would be more than the desired 10 hours payout.

    The brute force solution would be to iterate across all possible scenarios and calculate each scenario's probability and payout. Note that since each job either succeeds or fails, there would be 2^n total scenarios, where n is the number of jobs. In the encoding scheme I mentioned above, you can think of them corresponding to the first 2^n binary numbers.

    You would then, find out which of those scenario payouts are above the desired outcome you want, summing the probabilities on those scenarios (summing because they are mutually exclusive scenarios).

    The implementation might be something like this:

    jobs = ['job1', 'job2', 'job3', 'job4', 'job5']
    probabilities = [0.1, 0.1, 0.4, 0.6, 0.2]
    hours = [1, 10, 43, 2, 5]
    min_hours_desired = 10
    
    # Generate arrangements of job outcomes:
    # [00000, 00001, 00010 ... 11110, 11111]
    # The first of these, 00000, corresponds to all jobs not happening.
    # And then 00001 corresponds to only job5 happening
    # 11111 corresponds to all jobs happening
    # etc.
    scenarios = []
    jobs_len = len(jobs)
    for i in range(2**jobs_len):
        scenario = bin(i).split('b')[1].zfill(jobs_len)
        scenarios.append(scenario)
    
    # Iterating through the scenarios, fill in the probability and payout.
    scenario_outcomes = []
    for scenario in scenarios:
        scenario_hours_won = 0
        scenario_probability = 1
        for j, b in enumerate(scenario):
            if b == '0':
                scenario_probability *= (1 - probabilities[j])
            else:
                scenario_probability *= probabilities[j]
                scenario_hours_won += hours[j]
        scenario_outcomes.append((scenario, scenario_probability, scenario_hours_won))
    
    # One of these will be ('00101', 0.025920000000000006, 48)
    for outcome in scenario_outcomes:
        print(outcome)
    
    prob_desired_hours = sum([o[1] for o in scenario_outcomes if o[2] > min_hours_desired])
    print(f'Probability of > {min_hours_desired} hours:', prob_desired_hours)
    
    prob_check = sum([o[1] for o in scenario_outcomes])
    print('Probability sum check adds to 1:', prob_check)
    
    
    

    To plot a curve (more a histogram) of outcome and probability, you could sum the scenario probabilities grouping by the scenario payouts. A quick and dirty way to get the data for this would be:

    import json
    possble_payouts = set(o[2] for o in scenario_outcomes)
    payout_probabilities = dict()
    for payout in possble_payouts:
        payout_probability = sum([o[1] for o in scenario_outcomes if o[2] == payout])
        payout_probabilities[payout] = payout_probability
    
    print(json.dumps(payout_probabilities, indent=2))
    

    Note this whole thing would take a lot longer for 25 jobs since the algorithm would scale like O(2^n). I managed to run it in about a minute, nevertheless.