I have an excel file with x rows with random numbers (from 1 till 25). The total number of columns (numbers) can vary (15, 16, 17, 18, 19 or 20):
I need python to tell me my probability of winning a lottery with the current set of rows (may be 30k rows with 15,16,17,18,19 or 20 numbers). The lottery is a 15 number draw from 25 numbers pool (1 to 25).
This is what I have so far:
import pandas as pd
from math import comb
# Load the Excel file
file_path = 'combinations.xlsx'
df = pd.read_excel(file_path)
# Define the total number of possible combinations for a 15-number draw from a pool of 25 numbers
total_combinations = comb(25, 15)
# Function to calculate the number of matches
def count_matches(row, drawn_numbers):
return sum(1 for number in row if number in drawn_numbers)
# Since we don't have the drawn numbers, we'll assume a hypothetical draw
# For example, let's assume the drawn numbers are the first 15 numbers in the pool
drawn_numbers = list(range(1, 16))
# Calculate the total probability for the entire file
total_probability = 0
for index, row in df.iterrows():
matches = count_matches(row.dropna().tolist(), drawn_numbers)
probability = matches / total_combinations
total_probability += probability
# Calculate the inverse probability
inverse_probability = 1 / total_probability if total_probability != 0 else float('inf')
# Print the results
print(f"Total probability for the entire file: {total_probability}")
print(f"The probability of winning is 1 in {int(inverse_probability):,}")
It runs without errors, however, the result is not correct. looking at the lottery official website and calculating probability manually, it is known that:
This is the result my code gives for a file with a simple row of 15 numbers:
My file may contain different rows of 15, 16, 17,18,19 or 20 numbers, hence the need to calculate the probability of different combinations.
Any python magician available to help?
This requires a little bit of probabilistic reasoning.
As the comments have suggested, given a length n of a row, the probability that this row will win the lottery is comb(n, 15) / comb(25, 15). Therefore, we can construct an array of probabilities, p, for rows of length 15-20:
p = np.array([comb(n, 15) / comb(25,15) for n in range(15, 21)])
Assuming that the rows are indeed randomly generated and independent from one another, we can calculate the probability that the sheet has a row that wins the lottery:
Therefore, we need to count the number of rows of each length in the data sheet (how many rows with 15 entries, how many with 16, etc). Here is a function that can do that:
def CountRowLength(df):
"""Returns a dictionary with keys corresponding to row length and values corresponding to the number of times they occur"""
answer = dict()
for _, row in df.iterrows():
rowLength = len(row.dropna().tolist())
if rowLength in answer.keys():
answer[rowLength] += 1
else:
answer[rowLength] = 1
return answer
Therefore, all that remains is to convert this dictionary to an array/list and plug it into the formula from earlier:
# Convert the dictionary to a list
dictionary = CountRowLength(df)
lengths = [dictionary[i] if i in dictionary.keys() else 0 for i in range(15,21)]
# The formula from earlier
total_probability = 1 - np.product([(1-p[i])**lengths[i] for i in range(len(p))])
print(f"The probability that a row from this sheet has a lottery win is {total_probability}")
Then you can take the reciprocal to find the odds.