pythonpandasmathprobabilitycomb

Calculate probability from an excel file


I have an excel file with x rows with random numbers (from 1 till 25). The total number of columns (numbers) can vary (15, 16, 17, 18, 19 or 20):

enter image description here

I need python to tell me my probability of winning a lottery with the current set of rows (may be 30k rows with 15,16,17,18,19 or 20 numbers). The lottery is a 15 number draw from 25 numbers pool (1 to 25).

This is what I have so far:

import pandas as pd
from math import comb

# Load the Excel file
file_path = 'combinations.xlsx'
df = pd.read_excel(file_path)

# Define the total number of possible combinations for a 15-number draw from a pool of 25 numbers
total_combinations = comb(25, 15)

# Function to calculate the number of matches
def count_matches(row, drawn_numbers):
    return sum(1 for number in row if number in drawn_numbers)

# Since we don't have the drawn numbers, we'll assume a hypothetical draw
# For example, let's assume the drawn numbers are the first 15 numbers in the pool
drawn_numbers = list(range(1, 16))

# Calculate the total probability for the entire file
total_probability = 0
for index, row in df.iterrows():
    matches = count_matches(row.dropna().tolist(), drawn_numbers)
    probability = matches / total_combinations
    total_probability += probability

# Calculate the inverse probability
inverse_probability = 1 / total_probability if total_probability != 0 else float('inf')

# Print the results
print(f"Total probability for the entire file: {total_probability}")
print(f"The probability of winning is 1 in {int(inverse_probability):,}")

It runs without errors, however, the result is not correct. looking at the lottery official website and calculating probability manually, it is known that:

This is the result my code gives for a file with a simple row of 15 numbers: enter image description here

My file may contain different rows of 15, 16, 17,18,19 or 20 numbers, hence the need to calculate the probability of different combinations.

Any python magician available to help?


Solution

  • This requires a little bit of probabilistic reasoning.

    As the comments have suggested, given a length n of a row, the probability that this row will win the lottery is comb(n, 15) / comb(25, 15). Therefore, we can construct an array of probabilities, p, for rows of length 15-20:

    p = np.array([comb(n, 15) / comb(25,15) for n in range(15, 21)])
    

    Assuming that the rows are indeed randomly generated and independent from one another, we can calculate the probability that the sheet has a row that wins the lottery:

    This relies on DeMorgan's law and properties of independence

    Therefore, we need to count the number of rows of each length in the data sheet (how many rows with 15 entries, how many with 16, etc). Here is a function that can do that:

    def CountRowLength(df):
        """Returns a dictionary with keys corresponding to row length and values corresponding to the number of times they occur"""
        answer = dict()
        for _, row in df.iterrows():
            rowLength = len(row.dropna().tolist())
            if rowLength in answer.keys():
                answer[rowLength] += 1
            else:
                answer[rowLength] = 1
        return answer
    

    Therefore, all that remains is to convert this dictionary to an array/list and plug it into the formula from earlier:

    # Convert the dictionary to a list
    dictionary = CountRowLength(df)
    lengths = [dictionary[i] if i in dictionary.keys() else 0 for i in range(15,21)]
    
    # The formula from earlier
    total_probability = 1 - np.product([(1-p[i])**lengths[i] for i in range(len(p))])
    print(f"The probability that a row from this sheet has a lottery win is {total_probability}")
    

    Then you can take the reciprocal to find the odds.