I have a pandas dataframe that looks like
import pandas as pd
data = {
"Race_ID": [2,2,2,2,2,5,5,5,5,5,5],
"Student_ID": [1,2,3,4,5,9,10,2,3,6,5],
"theta": [8,9,2,12,4,5,30,3,2,1,50]
}
df = pd.DataFrame(data)
and I have a function f(thetai, *theta) = thetai ** 2 + the other thetas in the same race
which I want to apply to the theta
column in dataframe grouped by Race_ID
and create a new column called feature
.
So we have
For student 1 in Race 2, the value is 8^2 + 9+2+12+4
For student 2 in Race 2, the value is 9^2 + 8+2+12+4
For student 3 in Race 2, the value is 2^2 + 8+9+12+4
etc.
I know about the groupby
and apply
methods but I don't know how to apply the methods when the number of arguments can vary.
So the desired outcome looks like
data = {
"Race_ID": [2,2,2,2,2,5,5,5,5,5,5],
"Student_ID": [1,2,3,4,5,9,10,2,3,6,5],
"theta": [8,9,2,12,4,5,30,3,2,1,50],
"fearure": [91,107,37,167,47,111,961,97,93,91,2541]
}
df = pd.DataFrame(data)
Edit: My actual function f
for my problem is actually a lot more complicated, I was just using the example here to demonstrate the main hurdle, namely the variable argument problem. Here is my actual function f:
def integrand(xi, thetai, *theta):
S = 0
for tj in theta:
prod = 1
for t in theta:
if abs(t - tj) < 1e-10:
continue
prod = prod * (1 - norm.cdf(xi + thetai - t))
S = S + norm.cdf(xi + thetai - tj) * prod
return S * norm.pdf(xi)
def f(thetai, *theta):
return (integrate.quad(integrand, -np.inf, np.inf, args=(thetai, *theta)))[0]
Just make an intermediary column of the sums of the other theta's.
theta_sum = df.groupby("Race_ID").sum()["theta"].to_dict()
# (theta sum of race_id) - self theta
df["theta_feature"] = df.apply(
lambda x: theta_sum[x["Race_ID"]] - x["theta"], axis=1
)
df["feature"] = df.apply(
lambda x: x["theta"] ** 2 + x["theta_feature"], axis=1
)
and then if you don't want the undesired column you can just drop it with
df.drop(columns="theta_feature", inplace=True)
Edit:- To pass the the thetas to the function (which will give a different result for the feature
column as the function isn't initially described, you can do an apply
with the following:-
def calc_grpd_thetas(x):
_df = df[(df['Race_ID'] == x["Race_ID"]) & (df.index != x.name)]
thetas = list(_df['theta'])
return f(x["theta"], *thetas)
df["feature"] = df.apply(calc_grpd_thetas, axis=1)
which will net you this