I have hightly imballanced dataset and I want to assign weights for my observations by months.
For instance, If my observation is in January 2022 I'll give it 1/5
and if it's March 2022 I'll give it 1/3
and so on.
feature_1 date weights
117 2016-11-12 0.015
... ... ...
123 2022-01-01 0.2
234 2022-01-02 0.2
... ...
345 2022-05-31 1.0
I'm using CatboostClassifier
and I guess I can pass list of weights for all my data to weight
param. So it will look smth like this
model.fit(Pool(X_train,y_train,weight=train_weight))
Problem is I can't think of elegant solution to form weights column/list.
For now, I splitted my dataframe in Months frequency like that:
g = X_train.groupby(pd.Grouper(key='date', freq='M'))
dfs = [group for _,group in g]
and made column of weights like that:
for i, df in enumerate(dfs):
weight = []
for val in dfs[i].iterrows():
weight.append(1 / (len(dfs)+2 - i))
dfs[i]['weight'] = weight
Given the following toy dataframe:
from datetime import datetime
import pandas as pd
df = pd.DataFrame(
{
"feature_1": [117, 123, 234, 345],
"date": ["2016-11-12", "2022-01-01", "2022-01-02", "2022-05-31"],
}
)
df["date"] = pd.to_datetime(df["date"])
Define a helper function to calculate weights:
def weight(current_date, previous_date):
try:
wgt = round(
1
/ (
(current_date.year - previous_date.year) * 12
+ current_date.month
- previous_date.month
),
3,
)
except ZeroDivisionError:
wgt = 1
return wgt
And so, assuming the most recent date is 31 May 2022:
df["weight"] = df["date"].apply(lambda x: weight(datetime(2022, 5, 31), x))
print(df)
# Output
feature_1 date weight
0 117 2016-11-12 0.015
1 123 2022-01-01 0.250
2 234 2022-01-02 0.250
3 345 2022-05-31 1.000