pythonfeature-extractionfeature-engineeringfeaturetools

Features Created by FeatureTools Build Inconsistent Models


I have an imbalanced dataset which has 200 million data from class 0 and 8000 data from class 1. I followed two different approaches to build a model.

  1. Randomly sample a new dataset which has a ratio of 1:4. Meaning 32000 from class 0 and 8000 from class 1. Then use featuretools to generate features(70 features generated in my case) and split dataset into train and test set with test_size = 0.2 and stratify minority class. Build a model with Random Forest algorithm and predict the test set.

Code:

import ....
df = pd.read_csv(...)
label = df['target']
es = ft.EntitySet(id='maintable')

es = es.entity_from_dataframe(entity_id='maintable',dataframe=df,make_index=True,
index='index',time_index='date_info',variable_types={'personal_id': ft.variable_types.Categorical,
'category_id': ft.variable_types.Categorical, 'name': ft.variable_types.Categorical})

es.normalize_entity(base_entity_id='maintable',new_entity_id='personal_id')
es.normalize_entity(base_entity_id='maintable',new_entity_id='category_id')
es.normalize_entity(base_entity_id='maintable',new_entity_id='name')

fm, features = ft.dfs(entityset=es,target_entity='maintable',max_depth=3)

fm = fm.set_index(label.index)
fm['target'] = label

X = fm[fm.columns.difference(['target'])]
y = fm['target']

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42,stratify=y,test_size=0.2)

rf = RandomForestClassifier(random_state=42,n_jobs=-1)
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)

#print results
.....
  1. Split all the data from class 1, use 60% for train set and 40% for test set. Class ratio for train set is same as first approach(1:4) but for test set it is 1:200. Use featuretools(70 features created again), build a model with Random Forest algorithm and predict test set.

Code:

import ....
df = pd.read_csv(...)
# I merged randomly generated(with java) train and test sets to create features with featuretools. I created a column 'test_data' which takes two binary values (1 for test set 0 for train set) so I can separate train and test set for fitting model and predicting. 
label = df['target','test_data']
es = ft.EntitySet(id='maintable')

es = es.entity_from_dataframe(entity_id='maintable',dataframe=df,make_index=True,
index='index',time_index='date_info',variable_types={'personal_id': ft.variable_types.Categorical,
'category_id': ft.variable_types.Categorical, 'name': ft.variable_types.Categorical})

es.normalize_entity(base_entity_id='maintable',new_entity_id='personal_id')
es.normalize_entity(base_entity_id='maintable',new_entity_id='category_id')
es.normalize_entity(base_entity_id='maintable',new_entity_id='name')

fm, features = ft.dfs(entityset=es,target_entity='maintable',max_depth=3)

fm = fm.set_index(label.index)
fm['target','test_data'] = label

df_train = fm.loc[fm['test_data'] == 0]
df_test = fm.loc[fm['test_data'] == 1]

#Drop 'test_data' column because I dont need it anymore
df_train = df_train.drop(['test_data'],axis=1)
df_test = df_test.drop(['test_data'],axis=1)

X_train = df_train[df_train.columns.difference(['target'])]
y_train = df_train['target']

X_test = df_test[df_test.columns.difference(['target'])]
y_test = df_test['target']

rf = RandomForestClassifier(random_state=42,n_jobs=-1)
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)

#print results

Now interesting part begins for me. Here are the results of two approaches.

1. Approach: (Class 0 is negative and class 1 is positive)

TN:6306

FP:94

TP:1385

FN:215

2. Approach:

TN:576743

FP:63257

TP:361

FN:2839

First result is pretty good for me but second one is terrible. How is this possible? I know I am using less data from class 1 to train model on second approach but it should not differ that much. I mean it is worse than coin flip. Subsets are randomly generated on both approaches and I tried many different subsets but results are pretty much same as above. Any kind of help is appreciated.

Edit: I may have an idea but not sure... I am using train_test_split on first approach. So train and test sets share some personal_id's but on second approach train and test sets have completely different personal_id's. When model encounters with a personal_id that it didn't see before it cannot predict correctly and decides to label it majority class. If this is the case then features are being created exactly for given categorical variables(overfitting). Again when it encounters with a different value for any categorical column, it just gets confused. How can I overcome such an issue?

Edit2: I tested the idea mentioned above and got weird results. First I removed personal_id column from dataset but it ended up with better model. Then I tested my second approach in a way that personal_id's appear in train set should also appear in test set. I thought I would get better model but it was worse than before. I am really confused...


Solution

  • I agree the model possibly overfitted and failed to generalize given the new personal id. I suggest passing the labels in with the cutoff times to get a more structured training and testing set. I'll go through a quick example using this data.

        index     name  personal_id category_id   date_info  target
    0       0   Samuel            3           C  2021-07-15       0
    1       1   Samuel            3           C  2021-07-15       0
    2       2   Samuel            3           C  2021-07-15       0
    3       3   Samuel            3           C  2021-07-15       0
    4       4  Rosanne            2           C  2021-05-11       0
    ..    ...      ...          ...         ...         ...     ...
    95     95    Donia            1           C  2020-09-27       1
    96     96    Donia            1           C  2020-09-27       1
    97     97  Fleming            1           A  2021-06-15       1
    98     98     Fred            1           C  2021-02-28       0
    99     99  Giacomo            1           A  2021-06-19       1
    
    [100 rows x 6 columns]
    

    First, create cutoff times based on the time index that also include the target column. Make sure to drop the target column from the original data.

    target = df[['date_info', 'index', 'target']]
    df.drop(columns='target', inplace=True)
    

    Then, you can structure the entity set as usual.

    import featuretools as ft
    
    es = ft.EntitySet(id='maintable')
    es = es.entity_from_dataframe(
        entity_id='maintable',
        dataframe=df,
        index='index',
        time_index='date_info',
        variable_types={
            'personal_id': ft.variable_types.Categorical,
            'category_id': ft.variable_types.Categorical,
            'name': ft.variable_types.Categorical
        },
    )
    es.normalize_entity(base_entity_id='maintable', new_entity_id='personal_id', index='personal_id',)
    es.normalize_entity(base_entity_id='maintable', new_entity_id='category_id', index='category_id')
    es.normalize_entity(base_entity_id='maintable', new_entity_id='name', index='name')
    

    Now, in the DFS call, you can pass in the target cutoff times. This approach will not use the target column to build features and ensures that the target column will remain aligned with the feature matrix.

    fm, fd = ft.dfs(entityset=es, target_entity='maintable', max_depth=3, cutoff_time=target)
    
           personal_id category_id     name  DAY(date_info)  ...  name.NUM_UNIQUE(maintable.MONTH(date_info))  name.NUM_UNIQUE(maintable.WEEKDAY(date_info))  name.NUM_UNIQUE(maintable.YEAR(date_info))  target
    index                                                    ...
    59               1           C     Fred              28  ...                                            1                                              1                                           1       0
    35               1           A  Giacomo              19  ...                                            1                                              1                                           1       1
    82               3           B  Laverna              17  ...                                            1                                              1                                           1       0
    25               2           C  Rosanne              11  ...                                            1                                              1                                           1       0
    23               1           A  Giacomo              19  ...                                            1                                              1                                           1       1
    

    Then, you can split the feature maxtrix into a training and testing set.

    from sklearn.model_selection import train_test_split
    
    X_train, X_test = train_test_split(fm, test_size=.2, shuffle=False)
    y_train, y_test = X_train.pop('target'), X_test.pop('target')
    

    For AutoML, you can use EvalML to find the best ML pipeline and graph a confusion matrix.

    from evalml import AutoMLSearch
    from evalml.model_understanding.graphs import graph_confusion_matrix
    
    automl = AutoMLSearch(
        X_train=X_train,
        y_train=y_train,
        problem_type='binary',
        allowed_model_families=['random_forest'],
    )
    automl.search()
    y_pred = automl.best_pipeline.predict(X_test)
    graph_confusion_matrix(y_test, y_pred).show()
    

    enter image description here

    You can find similar machine learning examples in the linked page. Let me know if you found this helpful.