pythonprobabilitypoissonprobability-distribution

Poission Distribution considering time left


I want to calculate the remaining probabilities for each result in a football game at n minute.

In this case I have expected goals for home team of 2.69 and away team 1.12 at 70 minute for a current result of 2-1

Code

from scipy.stats import poisson
from itertools import product
import numpy as np
import pandas as pd

xgh = 2.69
xga = 1.12

minute = 70

hg, ag = 2,1
phs=[]
pas=[]
for i, l in zip(range(0, 6), range(0, 6)):
  ph = poisson.pmf(mu=xgh, k=i, loc=hg)
  phs.append(ph)
  pa = poisson.pmf(mu=xga, k=l, loc=ag)
  pas.append(pa)

prod_table = np.array([(i*j) for i, j in product(phs, pas)])
prod_table.shape = (6, 6)

prob_df = pd.DataFrame(prod_table, index=range(0,6), columns=range(0, 6))

This return a probability of 2-1 final result for 2.21% that is pretty low I expect an high probability considering only 20 minutes left


Solution

  • Math considerations

    Poisson distribution is the probability that an event occurs k times in a given time frame, knowing that, on average, it is supposed to occur μ times in this same time frame.

    The postulate of Poisson distribution is that events are totally independent. So how many times it has already occurred is meaningless. And that they are uniformly distributed (If I may use this confusing word, since this is not a uniform distribution).

    Most of the time, Poisson's usage is to compute probability of occurrence of k events in a timeframe T, when we know that μ events occur on average in a timeframe τ (difference with 1st sentence being that T and τ are not the same).

    But that is the easy part: since evens are uniformly distributed, if μ events occurs on averate in a time frame τ, then μ×T/τ events shoud occur, on average, in a time frame T (understand: if we were to experiment millions of time frame T, then on average, there should be μT/τ events in each of them).

    So, to compute the probability that event occurs k times in time frame T, knowing that it occurs μ times in time frame τ, you just have to reply to question "how many times event occurs k times in time frame T, knowing that it occurs μT/τ times in that time time frame". Which is the question Poisson can answer.

    In python, that answer is poisson.pmf(k, μT/τ).

    In your case, you know μ, the number of goals expected in a 90 minutes time frame. You know that the time frame left to score is 20 minutes. If 2.69 goals are expected in a time frame of 90 minutes then 0.5978 goals are expected in a time frame of 20 minutes (at least, that is Poisson postulates that things work that way). Therefore, the probability for that team to score no other goal in that timeframe is poisson.pmf(0, 0.5978). Or, using your keyword style poisson.pmf(mu=0.5978, k=0). Or using loc, to have the total amount of goals poisson.pmf(mu=0.5978, k=2, loc=2) (but that is just cosmetic. Having a loc parameter just replace k by k-loc)

    tl;dr solution

    So, long story short, you just need to scale down xgh and xga so that they reflect the expected number of goals in the remaining time.

    for i, l in zip(range(0, 6), range(0, 6)):
      ph = poisson.pmf(mu=xgh*(90-minute)/90, k=i, loc=hg)
      phs.append(ph)
      pa = poisson.pmf(mu=xga*(90-minute)/90, k=l, loc=ag)
      pas.append(pa)
    

    Other comments

    zip

    While at it, and since there is a python tag, some comments on the code

    for i, l in zip(range(0, 6), range(0, 6)):
        print(i,l)
    

    produces

    0 0
    1 1
    2 2
    3 3
    4 4
    5 5
    

    So it is quite strange not to use a single variable. Especially if you consider that there is no way you could use different ranges (zip must be used with iterables of the same length. And we don't see under which circumstances, we would need, for example, i to grow from 0 to 5, while l would grow from 0 to 10)

    So just

    for k in range(0, 6):
      ph = poisson.pmf(mu=xgh*(90-minute)/90, k=k, loc=hg)
      phs.append(ph)
      pa = poisson.pmf(mu=xga*(90-minute)/90, k=k, loc=ag)
      pas.append(pa)
    

    I surmise, especially because of what is the object of the next remark, that once upon a time, there was a product instead of that zip, before you realized that this was computing several time the same exact pmf.

    Cross product

    That usage of product has probably been then reduced to the task of computing phs[i]×pas[j] for all i,j. That is a good usage of product.

    But, since you have 2 arrays, and you intend to build a numpy array from those phs[i]×pas[j], let numpy do the job. It will be more efficient at it.

    prod_table = np.array(phs).reshape(-1,1)*np.array(pas)
    

    Getting arrays directly from Poisson

    Which leads to another optimization. If the goal is to transform phs and pha into arrays, so that we can mutiply them (one as a line, another as a column) to get the table, why not let numpy build that array directly. As many numpy function, pmf can have k being a list rather than a scalar, and then returns a list rather than a scalar.

    So

    phs=poisson.pmf(mu=xgh*(90-minute)/90, k=range(6), loc=hg)
    pas=poisson.pmf(mu=xga*(90-minute)/90, k=range(6), loc=ag)
    

    So, altogether

    prod_table=poisson.pmf(mu=xgh*(90-minute)/90, k=range(6), loc=hg).reshape(-1,1)*poisson.pmf(mu=xga*(90-minute)/90, k=range(6), loc=ag)
    

    Timings

    Optimisations Time in μs
    Without 1647 μs
    With 329 μs

    So, it is not just most compact and readable. It is also (almost exactly) 5 times faster.