I have an imbalanced dataset which has 200 million data from class 0 and 8000 data from class 1. I followed two different approaches to build a model.
Code:
import ....
df = pd.read_csv(...)
label = df['target']
es = ft.EntitySet(id='maintable')
es = es.entity_from_dataframe(entity_id='maintable',dataframe=df,make_index=True,
index='index',time_index='date_info',variable_types={'personal_id': ft.variable_types.Categorical,
'category_id': ft.variable_types.Categorical, 'name': ft.variable_types.Categorical})
es.normalize_entity(base_entity_id='maintable',new_entity_id='personal_id')
es.normalize_entity(base_entity_id='maintable',new_entity_id='category_id')
es.normalize_entity(base_entity_id='maintable',new_entity_id='name')
fm, features = ft.dfs(entityset=es,target_entity='maintable',max_depth=3)
fm = fm.set_index(label.index)
fm['target'] = label
X = fm[fm.columns.difference(['target'])]
y = fm['target']
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42,stratify=y,test_size=0.2)
rf = RandomForestClassifier(random_state=42,n_jobs=-1)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
#print results
.....
Code:
import ....
df = pd.read_csv(...)
# I merged randomly generated(with java) train and test sets to create features with featuretools. I created a column 'test_data' which takes two binary values (1 for test set 0 for train set) so I can separate train and test set for fitting model and predicting.
label = df['target','test_data']
es = ft.EntitySet(id='maintable')
es = es.entity_from_dataframe(entity_id='maintable',dataframe=df,make_index=True,
index='index',time_index='date_info',variable_types={'personal_id': ft.variable_types.Categorical,
'category_id': ft.variable_types.Categorical, 'name': ft.variable_types.Categorical})
es.normalize_entity(base_entity_id='maintable',new_entity_id='personal_id')
es.normalize_entity(base_entity_id='maintable',new_entity_id='category_id')
es.normalize_entity(base_entity_id='maintable',new_entity_id='name')
fm, features = ft.dfs(entityset=es,target_entity='maintable',max_depth=3)
fm = fm.set_index(label.index)
fm['target','test_data'] = label
df_train = fm.loc[fm['test_data'] == 0]
df_test = fm.loc[fm['test_data'] == 1]
#Drop 'test_data' column because I dont need it anymore
df_train = df_train.drop(['test_data'],axis=1)
df_test = df_test.drop(['test_data'],axis=1)
X_train = df_train[df_train.columns.difference(['target'])]
y_train = df_train['target']
X_test = df_test[df_test.columns.difference(['target'])]
y_test = df_test['target']
rf = RandomForestClassifier(random_state=42,n_jobs=-1)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
#print results
Now interesting part begins for me. Here are the results of two approaches.
1. Approach: (Class 0 is negative and class 1 is positive)
TN:6306
FP:94
TP:1385
FN:215
2. Approach:
TN:576743
FP:63257
TP:361
FN:2839
First result is pretty good for me but second one is terrible. How is this possible? I know I am using less data from class 1 to train model on second approach but it should not differ that much. I mean it is worse than coin flip. Subsets are randomly generated on both approaches and I tried many different subsets but results are pretty much same as above. Any kind of help is appreciated.
Edit: I may have an idea but not sure... I am using train_test_split on first approach. So train and test sets share some personal_id's but on second approach train and test sets have completely different personal_id's. When model encounters with a personal_id that it didn't see before it cannot predict correctly and decides to label it majority class. If this is the case then features are being created exactly for given categorical variables(overfitting). Again when it encounters with a different value for any categorical column, it just gets confused. How can I overcome such an issue?
Edit2: I tested the idea mentioned above and got weird results. First I removed personal_id column from dataset but it ended up with better model. Then I tested my second approach in a way that personal_id's appear in train set should also appear in test set. I thought I would get better model but it was worse than before. I am really confused...
I agree the model possibly overfitted and failed to generalize given the new personal id. I suggest passing the labels in with the cutoff times to get a more structured training and testing set. I'll go through a quick example using this data.
index name personal_id category_id date_info target
0 0 Samuel 3 C 2021-07-15 0
1 1 Samuel 3 C 2021-07-15 0
2 2 Samuel 3 C 2021-07-15 0
3 3 Samuel 3 C 2021-07-15 0
4 4 Rosanne 2 C 2021-05-11 0
.. ... ... ... ... ... ...
95 95 Donia 1 C 2020-09-27 1
96 96 Donia 1 C 2020-09-27 1
97 97 Fleming 1 A 2021-06-15 1
98 98 Fred 1 C 2021-02-28 0
99 99 Giacomo 1 A 2021-06-19 1
[100 rows x 6 columns]
First, create cutoff times based on the time index that also include the target column. Make sure to drop the target column from the original data.
target = df[['date_info', 'index', 'target']]
df.drop(columns='target', inplace=True)
Then, you can structure the entity set as usual.
import featuretools as ft
es = ft.EntitySet(id='maintable')
es = es.entity_from_dataframe(
entity_id='maintable',
dataframe=df,
index='index',
time_index='date_info',
variable_types={
'personal_id': ft.variable_types.Categorical,
'category_id': ft.variable_types.Categorical,
'name': ft.variable_types.Categorical
},
)
es.normalize_entity(base_entity_id='maintable', new_entity_id='personal_id', index='personal_id',)
es.normalize_entity(base_entity_id='maintable', new_entity_id='category_id', index='category_id')
es.normalize_entity(base_entity_id='maintable', new_entity_id='name', index='name')
Now, in the DFS call, you can pass in the target cutoff times. This approach will not use the target column to build features and ensures that the target column will remain aligned with the feature matrix.
fm, fd = ft.dfs(entityset=es, target_entity='maintable', max_depth=3, cutoff_time=target)
personal_id category_id name DAY(date_info) ... name.NUM_UNIQUE(maintable.MONTH(date_info)) name.NUM_UNIQUE(maintable.WEEKDAY(date_info)) name.NUM_UNIQUE(maintable.YEAR(date_info)) target
index ...
59 1 C Fred 28 ... 1 1 1 0
35 1 A Giacomo 19 ... 1 1 1 1
82 3 B Laverna 17 ... 1 1 1 0
25 2 C Rosanne 11 ... 1 1 1 0
23 1 A Giacomo 19 ... 1 1 1 1
Then, you can split the feature maxtrix into a training and testing set.
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(fm, test_size=.2, shuffle=False)
y_train, y_test = X_train.pop('target'), X_test.pop('target')
For AutoML, you can use EvalML to find the best ML pipeline and graph a confusion matrix.
from evalml import AutoMLSearch
from evalml.model_understanding.graphs import graph_confusion_matrix
automl = AutoMLSearch(
X_train=X_train,
y_train=y_train,
problem_type='binary',
allowed_model_families=['random_forest'],
)
automl.search()
y_pred = automl.best_pipeline.predict(X_test)
graph_confusion_matrix(y_test, y_pred).show()
You can find similar machine learning examples in the linked page. Let me know if you found this helpful.