I am using SMOTENC to solve an unbalanced classification problem.
df_train, df_test = train_test_split(input_table_1_df, test_size=0.25, stratify=input_table_1_df["Target_Variable_SX_FASCIA_1"])
###### SMOTE ######
# Create features table and target table
df_x = df_train.loc[ : , df_train.columns != "Target_Variable_SX_FASCIA_1"]
df_y = df_train.drop(["Target_Variable_SX_FASCIA_1"], axis=1)
# From pandas to numpy arrays
from imblearn.over_sampling import SMOTENC
df_X=df_x.to_numpy()
df_Y=df_y.to_numpy()
column_name_x=list(df_x.columns)
column_name_y=list(df_y.columns)
# Resampling
smote_nc = SMOTENC(categorical_features=[0,1,2,3,4,5], random_state=0,sampling_strategy=.2)
X_resampled, Y_resampled = smote_nc.fit_resample(df_X, df_Y)
X_resampled_df= pd.DataFrame(X_resampled,columns=column_name_x)
Y_resampled_df= pd.DataFrame(Y_resampled,columns=column_name_y)
Training_set_Passivi_Fascia_1 = pd.concat([X_resampled_df, Y_resampled_df], axis=1)
I got the following error at line:
X_resampled, Y_resampled = smote_nc.fit_resample(df_X, df_Y)
TypeError: '<' not supported between instances of 'int' and 'str'
I can understand that it is a matter of variable types, but I can not figure out how to solve this error. I already tried to:
Other useful information: The first 6 variables of the dataset are string, others are double or integer.
Just ask if you need further information.
Thanks in advance.
It would be helpful if you can print head of df_x and df_y.
What I can infer from this line
df_y = df_train.drop(["Target_Variable_SX_FASCIA_1"], axis=1)
You are essentially dropping off the target variable and keeping the predictors in df_y. My assumption is "Target_Variable_SX_FASCIA_1" is the column name of the target variable so it should be
df_y = df_train["Target_Variable_SX_FASCIA_1"].values