I want to calculate the remaining probabilities for each result in a football game at n
minute.
In this case I have expected goals for home team of 2.69
and away team 1.12
at 70
minute for a current result of 2-1
Code
from scipy.stats import poisson
from itertools import product
import numpy as np
import pandas as pd
xgh = 2.69
xga = 1.12
minute = 70
hg, ag = 2,1
phs=[]
pas=[]
for i, l in zip(range(0, 6), range(0, 6)):
ph = poisson.pmf(mu=xgh, k=i, loc=hg)
phs.append(ph)
pa = poisson.pmf(mu=xga, k=l, loc=ag)
pas.append(pa)
prod_table = np.array([(i*j) for i, j in product(phs, pas)])
prod_table.shape = (6, 6)
prob_df = pd.DataFrame(prod_table, index=range(0,6), columns=range(0, 6))
This return a probability of 2-1
final result for 2.21%
that is pretty low I expect an high probability considering only 20
minutes left
Poisson distribution is the probability that an event occurs k times in a given time frame, knowing that, on average, it is supposed to occur μ times in this same time frame.
The postulate of Poisson distribution is that events are totally independent. So how many times it has already occurred is meaningless. And that they are uniformly distributed (If I may use this confusing word, since this is not a uniform distribution).
Most of the time, Poisson's usage is to compute probability of occurrence of k events in a timeframe T, when we know that μ events occur on average in a timeframe τ (difference with 1st sentence being that T and τ are not the same).
But that is the easy part: since evens are uniformly distributed, if μ events occurs on averate in a time frame τ, then μ×T/τ events shoud occur, on average, in a time frame T (understand: if we were to experiment millions of time frame T, then on average, there should be μT/τ events in each of them).
So, to compute the probability that event occurs k times in time frame T, knowing that it occurs μ times in time frame τ, you just have to reply to question "how many times event occurs k times in time frame T, knowing that it occurs μT/τ times in that time time frame". Which is the question Poisson can answer.
In python, that answer is poisson.pmf(k, μT/τ)
.
In your case, you know μ, the number of goals expected in a 90 minutes time frame. You know that the time frame left to score is 20 minutes. If 2.69 goals are expected in a time frame of 90 minutes then 0.5978 goals are expected in a time frame of 20 minutes (at least, that is Poisson postulates that things work that way).
Therefore, the probability for that team to score no other goal in that timeframe is poisson.pmf(0, 0.5978)
. Or, using your keyword style poisson.pmf(mu=0.5978, k=0)
. Or using loc
, to have the total amount of goals poisson.pmf(mu=0.5978, k=2, loc=2)
(but that is just cosmetic. Having a loc parameter just replace k by k-loc
)
So, long story short, you just need to scale down xgh
and xga
so that they reflect the expected number of goals in the remaining time.
for i, l in zip(range(0, 6), range(0, 6)):
ph = poisson.pmf(mu=xgh*(90-minute)/90, k=i, loc=hg)
phs.append(ph)
pa = poisson.pmf(mu=xga*(90-minute)/90, k=l, loc=ag)
pas.append(pa)
While at it, and since there is a python
tag, some comments on the code
for i, l in zip(range(0, 6), range(0, 6)):
print(i,l)
produces
0 0
1 1
2 2
3 3
4 4
5 5
So it is quite strange not to use a single variable. Especially if you consider that there is no way you could use different ranges (zip
must be used with iterables of the same length. And we don't see under which circumstances, we would need, for example, i to grow from 0 to 5, while l would grow from 0 to 10)
So just
for k in range(0, 6):
ph = poisson.pmf(mu=xgh*(90-minute)/90, k=k, loc=hg)
phs.append(ph)
pa = poisson.pmf(mu=xga*(90-minute)/90, k=k, loc=ag)
pas.append(pa)
I surmise, especially because of what is the object of the next remark, that once upon a time, there was a product
instead of that zip
, before you realized that this was computing several time the same exact pmf
.
That usage of product has probably been then reduced to the task of computing phs[i]×pas[j]
for all i,j
. That is a good usage of product
.
But, since you have 2 arrays, and you intend to build a numpy array from those phs[i]×pas[j]
, let numpy do the job. It will be more efficient at it.
prod_table = np.array(phs).reshape(-1,1)*np.array(pas)
Which leads to another optimization. If the goal is to transform phs
and pha
into arrays, so that we can mutiply them (one as a line, another as a column) to get the table, why not let numpy build that array directly. As many numpy function, pmf
can have k being a list rather than a scalar, and then returns a list rather than a scalar.
So
phs=poisson.pmf(mu=xgh*(90-minute)/90, k=range(6), loc=hg)
pas=poisson.pmf(mu=xga*(90-minute)/90, k=range(6), loc=ag)
So, altogether
prod_table=poisson.pmf(mu=xgh*(90-minute)/90, k=range(6), loc=hg).reshape(-1,1)*poisson.pmf(mu=xga*(90-minute)/90, k=range(6), loc=ag)
Optimisations | Time in μs |
---|---|
Without | 1647 μs |
With | 329 μs |
So, it is not just most compact and readable. It is also (almost exactly) 5 times faster.