I have a Pandas dataframe that looks like the following:
Race_ID Date Student_ID feature1
1 1/1/2023 3 0.02167131
1 1/1/2023 4 0.17349148
1 1/1/2023 6 0.08438952
1 1/1/2023 8 0.04143787
1 1/1/2023 9 0.02589056
1 1/1/2023 1 0.03866752
1 1/1/2023 10 0.0461553
1 1/1/2023 45 0.09212758
1 1/1/2023 23 0.10879326
1 1/1/2023 102 0.186921
1 1/1/2023 75 0.02990676
1 1/1/2023 27 0.02731904
1 1/1/2023 15 0.06020158
1 1/1/2023 29 0.06302721
3 17/4/2022 5 0.2
3 17/4/2022 2 0.1
3 17/4/2022 3 0.55
3 17/4/2022 4 0.15
I would like to create a new column using the following method:
import numpy as np
from scipy import integrate
from scipy.stats import norm
import scipy.integrate as integrate
from scipy.optimize import fsolve
from scipy.optimize import least_squares
def integrandforpi_i(xi, ti, *theta):
prod = 1
for t in theta:
prod = prod * (1 - norm.cdf(xi - t))
return prod * norm.pdf(xi - ti)
def pi_i(ti, *theta):
return integrate.quad(integrandforpi_i, -np.inf, np.inf, args=(ti, *theta))[0]
Race_ID
, the value for each Student_ID
in the new column is given by solving a system of nonlinear equations using least_squares
in scipy.optimize
as follows:For example, for race 1, there are 14 students in the race and we will have to solve the following system of 14 nonlinear equations and we restrict the bound in between -2 and 2:
def equations(params):
t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14 = params
return (pi_i(t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14) - 0.02167131,
pi_i(t2,t1,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14) - 0.17349148,
pi_i(t3,t2,t1,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14) - 0.08438952,
pi_i(t4,t2,t3,t1,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14) - 0.04143787,
pi_i(t5,t2,t3,t4,t1,t6,t7,t8,t9,t10,t11,t12,t13,t14) - 0.02589056,
pi_i(t6,t2,t3,t4,t5,t1,t7,t8,t9,t10,t11,t12,t13,t14) - 0.03866752,
pi_i(t7,t2,t3,t4,t5,t6,t1,t8,t9,t10,t11,t12,t13,t14) - 0.0461553,
pi_i(t8,t2,t3,t4,t5,t6,t7,t1,t9,t10,t11,t12,t13,t14) - 0.09212758,
pi_i(t9,t2,t3,t4,t5,t6,t7,t8,t1,t10,t11,t12,t13,t14) - 0.10879326,
pi_i(t10,t2,t3,t4,t5,t6,t7,t8,t9,t1,t11,t12,t13,t14) - 0.186921,
pi_i(t11,t2,t3,t4,t5,t6,t7,t8,t9,t10,t1,t12,t13,t14) - 0.02990676,
pi_i(t12,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t1,t13,t14) - 0.02731904,
pi_i(t13,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t1,t14) - 0.06020158,
pi_i(t14,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t13,t14,t1) - 0.06302721)
res = least_squares(equations, (1,1,1,1,1,1,1,1,1,1,1,1,1,1), bounds = ((-2,-2,-2,-2,-2,-2,-2,-2,-2,-2,-2,-2,-2,-2), (2,2,2,2,2,2,2,2,2,2,2,2,2,2)))
t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14 = res.x
Solving gives t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14 = [1.38473533 0.25616609 0.6935956 1.07314877 1.30201502 1.10781642 1.01839475 0.64349646 0.54630158 0.20719836 1.23347391 1.27642131 0.879412 0.83338882]
Similarly, for race 2, there are 4 students competing and we will have to solve the following system of 4 nonlinear equations:
def equations(params):
t1,t2,t3,t4 = params
return (pi_i(t1,t2,t3,t4) - 0.2,
pi_i(t2,t1,t3,t4) - 0.1,
pi_i(t3,t2,t1,t4) - 0.55,
pi_i(t4,t2,t3,t1) - 0.15)
res = least_squares(equations, (1,1,1,1), bounds = ((-2,-2,-2,-2), (2,2,2,2)))
t1,t2,t3,t4 = res.x
which gives t1,t2,t3,t4 = [0.9209873 1.37615468 0.12293934 1.11735818]
.
Hence the desired outcome looks like
Race_ID Date Student_ID feature1 new_column
1 1/1/2023 3 0.02167131 1.38473533
1 1/1/2023 4 0.17349148 0.25616609
1 1/1/2023 6 0.08438952 0.6935956
1 1/1/2023 8 0.04143787 1.07314877
1 1/1/2023 9 0.02589056 1.30201502
1 1/1/2023 1 0.03866752 1.10781642
1 1/1/2023 10 0.0461553 1.01839475
1 1/1/2023 45 0.09212758 0.64349646
1 1/1/2023 23 0.10879326 0.54630158
1 1/1/2023 102 0.186921 0.20719836
1 1/1/2023 75 0.02990676 1.23347391
1 1/1/2023 27 0.02731904 1.27642131
1 1/1/2023 15 0.06020158 0.879412
1 1/1/2023 29 0.06302721 0.83338882
3 17/4/2022 5 0.2 0.9209873
3 17/4/2022 2 0.1 1.37615468
3 17/4/2022 3 0.55 0.12293934
3 17/4/2022 4 0.15 1.11735818
I have no idea how to generate the new column. Also, my actual dataframe is much larger with many races so I would like to ask is there any way to speed up the computation too, thanks a lot.
Here is how to parametrize and automatize your regression for each group.
First we load your dataset:
import io
import numpy as np
import pandas as pd
from scipy import integrate, stats, optimize
data = pd.read_fwf(io.StringIO("""Race_ID Date Student_ID feature1
1 1/1/2023 3 0.02167131
1 1/1/2023 4 0.17349148
1 1/1/2023 6 0.08438952
1 1/1/2023 8 0.04143787
1 1/1/2023 9 0.02589056
1 1/1/2023 1 0.03866752
1 1/1/2023 10 0.0461553
1 1/1/2023 45 0.09212758
1 1/1/2023 23 0.10879326
1 1/1/2023 102 0.186921
1 1/1/2023 75 0.02990676
1 1/1/2023 27 0.02731904
1 1/1/2023 15 0.06020158
1 1/1/2023 29 0.06302721
3 17/4/2022 5 0.2
3 17/4/2022 2 0.1
3 17/4/2022 3 0.55
3 17/4/2022 4 0.15 """))
We slightly rewrite your functions:
def integrand(x, ti, *theta):
product = 1.
for t in theta:
product = product * (1 - stats.norm.cdf(x - t))
return product * stats.norm.pdf(x - ti)
def integral(ti, *theta):
return integrate.quad(integrand, -np.inf, np.inf, args=(ti, *theta))[0]
Then we parametrize the system of equations:
def change_order(parameters, i):
return [parameters[i]] + parameters[0:i] + parameters[i+1:]
def system(parameters, t):
parameters = parameters.tolist()
values = np.full(len(t), np.nan)
for i in range(len(parameters)):
parameters = change_order(parameters, i)
values[i] = integral(*parameters) - t[i]
return values
At this stage we can solve any system of length n
, now we create a function that allow us to use groupby in pandas:
def solver(x):
t = x["feature1"].values
u = np.ones_like(t)
solution = optimize.least_squares(system, u, bounds=[-2*u, 2*u], args=(t,))
return solution.x
And we apply it to the dataframe:
data["new_column"] = data.groupby("Race_ID").apply(solver, include_groups=False).explode().values
Which leads to:
Race_ID Date Student_ID feature1 new_column
0 1 1/1/2023 3 0.021671 1.383615
1 1 1/1/2023 4 0.173491 0.25823
2 1 1/1/2023 6 0.084390 0.695116
3 1 1/1/2023 8 0.041438 1.073675
4 1 1/1/2023 9 0.025891 1.301445
5 1 1/1/2023 1 0.038668 1.108209
6 1 1/1/2023 10 0.046155 1.019114
7 1 1/1/2023 45 0.092128 0.645103
8 1 1/1/2023 23 0.108793 0.548053
9 1 1/1/2023 102 0.186921 0.209302
10 1 1/1/2023 75 0.029907 1.23329
11 1 1/1/2023 27 0.027319 1.276228
12 1 1/1/2023 15 0.060202 0.880537
13 1 1/1/2023 29 0.063027 0.855987
14 3 17/4/2022 5 0.200000 0.920987
15 3 17/4/2022 2 0.100000 1.376155
16 3 17/4/2022 3 0.550000 0.122939
17 3 17/4/2022 4 0.150000 1.117358
Indeed the whole operation is rather slow for high dimensional groups.
There are few things you can do to speed up the whole process:
But it still is an heavy operation if your dataset is huge or have high dimensional groups.