pandasdataframegroup-byapply

Pandas dataframe groupby apply function with variable number of arguments


I have a pandas dataframe that looks like

import pandas as pd

data = {
  "Race_ID": [2,2,2,2,2,5,5,5,5,5,5],
  "Student_ID": [1,2,3,4,5,9,10,2,3,6,5],
  "theta": [8,9,2,12,4,5,30,3,2,1,50]
}

df = pd.DataFrame(data)

and I have a function f(thetai, *theta) = thetai ** 2 + the other thetas in the same race which I want to apply to the theta column in dataframe grouped by Race_ID and create a new column called feature.

So we have

For student 1 in Race 2, the value is 8^2 + 9+2+12+4

For student 2 in Race 2, the value is 9^2 + 8+2+12+4

For student 3 in Race 2, the value is 2^2 + 8+9+12+4

etc.

I know about the groupby and apply methods but I don't know how to apply the methods when the number of arguments can vary.

So the desired outcome looks like

data = {
  "Race_ID": [2,2,2,2,2,5,5,5,5,5,5],
  "Student_ID": [1,2,3,4,5,9,10,2,3,6,5],
  "theta": [8,9,2,12,4,5,30,3,2,1,50],
  "fearure": [91,107,37,167,47,111,961,97,93,91,2541]
}

df = pd.DataFrame(data)

Edit: My actual function f for my problem is actually a lot more complicated, I was just using the example here to demonstrate the main hurdle, namely the variable argument problem. Here is my actual function f:

def integrand(xi, thetai, *theta):
  S = 0
  for tj in theta:
    prod = 1
    for t in theta:
      if abs(t - tj) < 1e-10:
        continue
      prod = prod * (1 - norm.cdf(xi + thetai - t))
    S = S + norm.cdf(xi + thetai - tj) * prod
  return S * norm.pdf(xi)


def f(thetai, *theta):
  return (integrate.quad(integrand, -np.inf, np.inf, args=(thetai, *theta)))[0]

Solution

  • Just make an intermediary column of the sums of the other theta's.

    theta_sum = df.groupby("Race_ID").sum()["theta"].to_dict()
    # (theta sum of race_id) - self theta
    df["theta_feature"] = df.apply(
        lambda x: theta_sum[x["Race_ID"]] - x["theta"], axis=1
    )
    df["feature"] = df.apply(
        lambda x: x["theta"] ** 2 + x["theta_feature"], axis=1
    )
    

    and then if you don't want the undesired column you can just drop it with

    df.drop(columns="theta_feature", inplace=True)
    

    Edit:- To pass the the thetas to the function (which will give a different result for the feature column as the function isn't initially described, you can do an apply with the following:-

    def calc_grpd_thetas(x):
        _df = df[(df['Race_ID'] == x["Race_ID"]) & (df.index != x.name)]
        thetas = list(_df['theta'])
        return f(x["theta"], *thetas)
    
    df["feature"] = df.apply(calc_grpd_thetas, axis=1)
    

    which will net you this

    enter image description here