python pandas dataframe scipy scipy-optimize

New column in Pandas dataframe using least squares from scipy.optimize

I have a Pandas dataframe that looks like the following:

Race_ID   Date           Student_ID      feature1  
1         1/1/2023       3               0.02167131     
1         1/1/2023       4               0.17349148     
1         1/1/2023       6               0.08438952     
1         1/1/2023       8               0.04143787     
1         1/1/2023       9               0.02589056
1         1/1/2023       1               0.03866752     
1         1/1/2023       10              0.0461553     
1         1/1/2023       45              0.09212758
1         1/1/2023       23              0.10879326      
1         1/1/2023       102             0.186921      
1         1/1/2023       75              0.02990676      
1         1/1/2023       27              0.02731904      
1         1/1/2023       15              0.06020158      
1         1/1/2023       29              0.06302721                         
3         17/4/2022      5               0.2     
3         17/4/2022      2               0.1     
3         17/4/2022      3               0.55     
3         17/4/2022      4               0.15

I would like to create a new column using the following method:

Define the following function using integrals

import numpy as np
from scipy import integrate
from scipy.stats import norm
import scipy.integrate as integrate
from scipy.optimize import fsolve
from scipy.optimize import least_squares

def integrandforpi_i(xi, ti, *theta):
  prod = 1
  for t in theta:
    prod = prod * (1 - norm.cdf(xi - t))
  return prod * norm.pdf(xi - ti)

def pi_i(ti, *theta):
  return integrate.quad(integrandforpi_i, -np.inf, np.inf, args=(ti, *theta))[0]

for each Race_ID, the value for each Student_ID in the new column is given by solving a system of nonlinear equations using least_squares in scipy.optimize as follows:

For example, for race 1, there are 14 students in the race and we will have to solve the following system of 14 nonlinear equations and we restrict the bound in between -2 and 2:

def equations(params):
    t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14 = params
    return (pi_i(t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14) - 0.02167131,
            pi_i(t2,t1,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14) - 0.17349148,
            pi_i(t3,t2,t1,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14) - 0.08438952,
            pi_i(t4,t2,t3,t1,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14) - 0.04143787,
            pi_i(t5,t2,t3,t4,t1,t6,t7,t8,t9,t10,t11,t12,t13,t14) - 0.02589056,
            pi_i(t6,t2,t3,t4,t5,t1,t7,t8,t9,t10,t11,t12,t13,t14) - 0.03866752,
            pi_i(t7,t2,t3,t4,t5,t6,t1,t8,t9,t10,t11,t12,t13,t14) - 0.0461553,
            pi_i(t8,t2,t3,t4,t5,t6,t7,t1,t9,t10,t11,t12,t13,t14) - 0.09212758,
            pi_i(t9,t2,t3,t4,t5,t6,t7,t8,t1,t10,t11,t12,t13,t14) - 0.10879326,
            pi_i(t10,t2,t3,t4,t5,t6,t7,t8,t9,t1,t11,t12,t13,t14) - 0.186921,
            pi_i(t11,t2,t3,t4,t5,t6,t7,t8,t9,t10,t1,t12,t13,t14) - 0.02990676,
            pi_i(t12,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t1,t13,t14) - 0.02731904,
            pi_i(t13,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t1,t14) - 0.06020158,
            pi_i(t14,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t13,t14,t1) - 0.06302721)

res = least_squares(equations, (1,1,1,1,1,1,1,1,1,1,1,1,1,1), bounds = ((-2,-2,-2,-2,-2,-2,-2,-2,-2,-2,-2,-2,-2,-2), (2,2,2,2,2,2,2,2,2,2,2,2,2,2)))

t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14 = res.x

Solving gives t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14 = [1.38473533 0.25616609 0.6935956 1.07314877 1.30201502 1.10781642 1.01839475 0.64349646 0.54630158 0.20719836 1.23347391 1.27642131 0.879412 0.83338882]

Similarly, for race 2, there are 4 students competing and we will have to solve the following system of 4 nonlinear equations:

def equations(params):
    t1,t2,t3,t4 = params
    return (pi_i(t1,t2,t3,t4) - 0.2,
            pi_i(t2,t1,t3,t4) - 0.1,
            pi_i(t3,t2,t1,t4) - 0.55,
            pi_i(t4,t2,t3,t1) - 0.15)

res = least_squares(equations, (1,1,1,1), bounds = ((-2,-2,-2,-2), (2,2,2,2)))

t1,t2,t3,t4 = res.x

which gives t1,t2,t3,t4 = [0.9209873 1.37615468 0.12293934 1.11735818].

Hence the desired outcome looks like

Race_ID   Date           Student_ID      feature1       new_column
1         1/1/2023       3               0.02167131     1.38473533
1         1/1/2023       4               0.17349148     0.25616609
1         1/1/2023       6               0.08438952     0.6935956
1         1/1/2023       8               0.04143787     1.07314877
1         1/1/2023       9               0.02589056     1.30201502
1         1/1/2023       1               0.03866752     1.10781642
1         1/1/2023       10              0.0461553      1.01839475
1         1/1/2023       45              0.09212758     0.64349646
1         1/1/2023       23              0.10879326     0.54630158
1         1/1/2023       102             0.186921       0.20719836
1         1/1/2023       75              0.02990676     1.23347391
1         1/1/2023       27              0.02731904     1.27642131 
1         1/1/2023       15              0.06020158     0.879412 
1         1/1/2023       29              0.06302721     0.83338882                    
3         17/4/2022      5               0.2            0.9209873
3         17/4/2022      2               0.1            1.37615468
3         17/4/2022      3               0.55           0.12293934
3         17/4/2022      4               0.15           1.11735818

I have no idea how to generate the new column. Also, my actual dataframe is much larger with many races so I would like to ask is there any way to speed up the computation too, thanks a lot.

Solution

Here is how to parametrize and automatize your regression for each group.

First we load your dataset:

import io
import numpy as np
import pandas as pd
from scipy import integrate, stats, optimize

data = pd.read_fwf(io.StringIO("""Race_ID   Date           Student_ID      feature1  
1         1/1/2023       3               0.02167131     
1         1/1/2023       4               0.17349148     
1         1/1/2023       6               0.08438952     
1         1/1/2023       8               0.04143787     
1         1/1/2023       9               0.02589056
1         1/1/2023       1               0.03866752     
1         1/1/2023       10              0.0461553     
1         1/1/2023       45              0.09212758
1         1/1/2023       23              0.10879326      
1         1/1/2023       102             0.186921      
1         1/1/2023       75              0.02990676      
1         1/1/2023       27              0.02731904      
1         1/1/2023       15              0.06020158      
1         1/1/2023       29              0.06302721                         
3         17/4/2022      5               0.2     
3         17/4/2022      2               0.1     
3         17/4/2022      3               0.55     
3         17/4/2022      4               0.15   """))

We slightly rewrite your functions:

def integrand(x, ti, *theta):
    product = 1.
    for t in theta:
        product = product * (1 - stats.norm.cdf(x - t))
    return product * stats.norm.pdf(x - ti)

def integral(ti, *theta):
    return integrate.quad(integrand, -np.inf, np.inf, args=(ti, *theta))[0]

Then we parametrize the system of equations:

def change_order(parameters, i):
    return [parameters[i]] + parameters[0:i] + parameters[i+1:]

def system(parameters, t):
    parameters = parameters.tolist()
    values = np.full(len(t), np.nan)
    for i in range(len(parameters)):
        parameters = change_order(parameters, i)
        values[i] = integral(*parameters) - t[i]
    return values

At this stage we can solve any system of length n, now we create a function that allow us to use groupby in pandas:

def solver(x):
    t = x["feature1"].values
    u = np.ones_like(t)
    solution = optimize.least_squares(system, u, bounds=[-2*u, 2*u], args=(t,))
    return solution.x

And we apply it to the dataframe:

data["new_column"] = data.groupby("Race_ID").apply(solver, include_groups=False).explode().values

Which leads to:

    Race_ID       Date  Student_ID  feature1 new_column
0         1   1/1/2023           3  0.021671   1.383615
1         1   1/1/2023           4  0.173491    0.25823
2         1   1/1/2023           6  0.084390   0.695116
3         1   1/1/2023           8  0.041438   1.073675
4         1   1/1/2023           9  0.025891   1.301445
5         1   1/1/2023           1  0.038668   1.108209
6         1   1/1/2023          10  0.046155   1.019114
7         1   1/1/2023          45  0.092128   0.645103
8         1   1/1/2023          23  0.108793   0.548053
9         1   1/1/2023         102  0.186921   0.209302
10        1   1/1/2023          75  0.029907    1.23329
11        1   1/1/2023          27  0.027319   1.276228
12        1   1/1/2023          15  0.060202   0.880537
13        1   1/1/2023          29  0.063027   0.855987
14        3  17/4/2022           5  0.200000   0.920987
15        3  17/4/2022           2  0.100000   1.376155
16        3  17/4/2022           3  0.550000   0.122939
17        3  17/4/2022           4  0.150000   1.117358

Indeed the whole operation is rather slow for high dimensional groups.

There are few things you can do to speed up the whole process:

use numba for the solving step;
parallelize group solving;

But it still is an heavy operation if your dataset is huge or have high dimensional groups.