Calculate probability from an excel file

I have an excel file with x rows with random numbers (from 1 till 25). The total number of columns (numbers) can vary (15, 16, 17, 18, 19 or 20):

I need python to tell me my probability of winning a lottery with the current set of rows (may be 30k rows with 15,16,17,18,19 or 20 numbers). The lottery is a 15 number draw from 25 numbers pool (1 to 25).

This is what I have so far:

import pandas as pd
from math import comb

# Load the Excel file
file_path = 'combinations.xlsx'
df = pd.read_excel(file_path)

# Define the total number of possible combinations for a 15-number draw from a pool of 25 numbers
total_combinations = comb(25, 15)

# Function to calculate the number of matches
def count_matches(row, drawn_numbers):
    return sum(1 for number in row if number in drawn_numbers)

# Since we don't have the drawn numbers, we'll assume a hypothetical draw
# For example, let's assume the drawn numbers are the first 15 numbers in the pool
drawn_numbers = list(range(1, 16))

# Calculate the total probability for the entire file
total_probability = 0
for index, row in df.iterrows():
    matches = count_matches(row.dropna().tolist(), drawn_numbers)
    probability = matches / total_combinations
    total_probability += probability

# Calculate the inverse probability
inverse_probability = 1 / total_probability if total_probability != 0 else float('inf')

# Print the results
print(f"Total probability for the entire file: {total_probability}")
print(f"The probability of winning is 1 in {int(inverse_probability):,}")

It runs without errors, however, the result is not correct. looking at the lottery official website and calculating probability manually, it is known that:

For a 15 numbers ticket (in my file, it would be 1 row with 15 numbers) there is 1 in 3.268.760 chance of winning
For a 16 numbers ticket (in my file, it would be 1 row with 16 numbers) there is 1 in 204.298 chance of winning
For a 17 numbers ticket (in my file, it would be 1 row with 17 numbers) there is 1 in 24.035 chance of winning
For a 18 numbers ticket (in my file, it would be 1 row with 18 numbers) there is 1 in 4.006 chance of winning
For a 19 numbers ticket (in my file, it would be 1 row with 19 numbers) there is 1 in 843 chance of winning
For a 20 numbers ticket (in my file, it would be 1 row with 20 numbers) there is 1 in 211 chance of winning

This is the result my code gives for a file with a simple row of 15 numbers:

My file may contain different rows of 15, 16, 17,18,19 or 20 numbers, hence the need to calculate the probability of different combinations.

Any python magician available to help?

Solution

This requires a little bit of probabilistic reasoning.

As the comments have suggested, given a length n of a row, the probability that this row will win the lottery is comb(n, 15) / comb(25, 15). Therefore, we can construct an array of probabilities, p, for rows of length 15-20:

p = np.array([comb(n, 15) / comb(25,15) for n in range(15, 21)])

Assuming that the rows are indeed randomly generated and independent from one another, we can calculate the probability that the sheet has a row that wins the lottery:

Therefore, we need to count the number of rows of each length in the data sheet (how many rows with 15 entries, how many with 16, etc). Here is a function that can do that:

def CountRowLength(df):
    """Returns a dictionary with keys corresponding to row length and values corresponding to the number of times they occur"""
    answer = dict()
    for _, row in df.iterrows():
        rowLength = len(row.dropna().tolist())
        if rowLength in answer.keys():
            answer[rowLength] += 1
        else:
            answer[rowLength] = 1
    return answer

Therefore, all that remains is to convert this dictionary to an array/list and plug it into the formula from earlier:

# Convert the dictionary to a list
dictionary = CountRowLength(df)
lengths = [dictionary[i] if i in dictionary.keys() else 0 for i in range(15,21)]

# The formula from earlier
total_probability = 1 - np.product([(1-p[i])**lengths[i] for i in range(len(p))])
print(f"The probability that a row from this sheet has a lottery win is {total_probability}")

Then you can take the reciprocal to find the odds.