pythoncross-validationdecision-treetopological-sort

Value Error X has 24 features, but DecisionTreeClassifier is expecting 19 features as input


I'm trying to reproduce this GitHub project on my machine, on Topological Data Analysis (TDA).

My steps:


Background:

  1. Feature selection

In order to decide which attributes belong to which group, we created a correlation matrix. From this, we saw that there were two big groups, where player attributes were strongly correlated with each other. Therefore, we decided to split the attributes into two groups, one to summarise the attacking characteristics of a player while the other one the defensiveness. Finally, since the goalkeeper has completely different statistics with respect to the other players, we decided to take into account only the overall rating. Below, is possible to see the 24 features used for each player:

Attack: "positioning", "crossing", "finishing", "heading_accuracy", "short_passing", "reactions", "volleys", "dribbling", "curve", "free_kick_accuracy", "acceleration", "sprint_speed", "agility", "penalties", "vision", "shot_power", "long_shots" Defense: "interceptions", "aggression", "marking", "standing_tackle", "sliding_tackle", "long_passing" Goalkeeper: "overall_rating"

From this set of features, the next step we did was to, for each non-goalkeeper player, compute the mean of the attack attributes and the defensive ones.

Finally, for each team in a given match, we compute the mean and the standard deviation for the attack and the defense from these stats of the team's players, as well as the best attack and best defense.

In this way a match is described by 14 features (GK overall value, best attack, std attack, mean attack, the best defense, std defense, mean defense), that mapped the match in the space, following the characterizes of the two teams.


  1. Feature extraction

The aim of TDA is to catch the structure of the space underlying the data. In our project, we assume that the neighborhood of a data point hides meaningful information that is correlated with the outcome of the match. Thus, we explored the data space looking for this kind of correlation.


Methods:

def get_best_params():
    cv_output = read_pickle('cv_output.pickle')
    best_model_params, top_feat_params, top_model_feat_params, *_ = cv_output

    return top_feat_params, top_model_feat_params

def load_dataset():
    x_y = get_dataset(42188).get_data(dataset_format='array')[0]
    x_train_with_topo = x_y[:, :-1]
    y_train = x_y[:, -1]

    return x_train_with_topo, y_train


def extract_x_test_features(x_train, y_train, players_df, pipeline):
    """Extract the topological features from the test set. This requires also the train set

    Parameters
    ----------
    x_train:
        The x used in the training phase
    y_train:
        The 'y' used in the training phase
    players_df: pd.DataFrame
        The DataFrame containing the matches with all the players, from which to extract the test set
    pipeline: Pipeline
        The Giotto pipeline

    Returns
    -------
    x_test:
        The x_test with the topological features
    """
    x_train_no_topo = x_train[:, :14]
    y_test = np.zeros(len(players_df))  # Artificial y_test for features computation
    print('Y_TEST',y_test.shape)

    x_test_topo = extract_features_for_prediction(x_train_no_topo, y_train, players_df.values, y_test, pipeline)

    return x_test_topo

def extract_topological_features(diagrams):
    metrics = ['bottleneck', 'wasserstein', 'landscape', 'betti', 'heat']
    new_features = []
    for metric in metrics:
        amplitude = Amplitude(metric=metric)
        new_features.append(amplitude.fit_transform(diagrams))
    new_features = np.concatenate(new_features, axis=1)
    return new_features

def extract_features_for_prediction(x_train, y_train, x_test, y_test, pipeline):
    shift = 10
    top_features = []
    all_x_train = x_train
    all_y_train = y_train
    for i in tqdm(range(0, len(x_test), shift)):
        #
        print(range(0, len(x_test), shift) )
        if i+shift > len(x_test):
            shift = len(x_test) - i
        batch = np.concatenate([all_x_train, x_test[i: i + shift]])
        batch_y = np.concatenate([all_y_train, y_test[i: i + shift].reshape((-1,))])
        diagrams_batch, _ = pipeline.fit_transform_resample(batch, batch_y)
        new_features_batch = extract_topological_features(diagrams_batch[-shift:])
        top_features.append(new_features_batch)
        all_x_train = np.concatenate([all_x_train, batch[-shift:]])
        all_y_train = np.concatenate([all_y_train, batch_y[-shift:]])
    final_x_test = np.concatenate([x_test, np.concatenate(top_features, axis=0)], axis=1)
    return final_x_test

def get_probabilities(model, x_test, team_ids):
    """Get the probabilities on the outcome of the matches contained in the test set

    Parameters
    ----------
    model:
        The model (must have the 'predict_proba' function)
    x_test:
        The test set
    team_ids: pd.DataFrame
        The DataFrame containing, for each match in the test set, the ids of the two teams
    Returns
    -------
    probabilities:
        The probabilities for each match in the test set
    """
    prob_pred = model.predict_proba(x_test)
    prob_match_df = pd.DataFrame(data=prob_pred, columns=['away_team_prob', 'draw_prob', 'home_team_prob'])
    prob_match_df = pd.concat([team_ids.reset_index(drop=True), prob_match_df], axis=1)
    return prob_match_df

Working code:

best_pipeline_params, best_model_feat_params = get_best_params()

# 'best_pipeline_params' -> {'k_min': 50, 'k_max': 175, 'dist_percentage': 0.1}
# best_model_feat_params -> {'n_estimators': 1000, 'max_depth': 10, 'random_state': 52, 'max_features': 0.5}

pipeline = get_pipeline(best_pipeline_params)
# pipeline -> Pipeline(steps=[('extract_point_clouds',
            # SubSpaceExtraction(dist_percentage=0.1, k_max=175, k_min=50)),
            #('create_diagrams', VietorisRipsPersistence(n_jobs=-1))])

x_train, y_train = load_dataset()

# x_train.shape ->  (2565, 19)
# y_train.shape -> (2565,)

x_test = extract_x_test_features(x_train, y_train, new_players_df_stats, pipeline)

# x_test.shape -> (380, 24)

rf_model = RandomForestClassifier(**best_model_feat_params)
rf_model.fit(x_train, y_train)
matches_probabilities = get_probabilities(rf_model, x_test, team_ids)  # <-- breaks here
matches_probabilities.head()
compute_final_standings(matches_probabilities, 'premier league')

But I'm getting the error:

ValueError: X has 24 features, but DecisionTreeClassifier is expecting 19 features as input.

Loaded dataset (X_train):

Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   home_best_attack    2565 non-null   float64
 1   home_best_defense   2565 non-null   float64
 2   home_avg_attack     2565 non-null   float64
 3   home_avg_defense    2565 non-null   float64
 4   home_std_attack     2565 non-null   float64
 5   home_std_defense    2565 non-null   float64
 6   gk_home_player_1    2565 non-null   float64
 7   away_avg_attack     2565 non-null   float64
 8   away_avg_defense    2565 non-null   float64
 9   away_std_attack     2565 non-null   float64
 10  away_std_defense    2565 non-null   float64
 11  away_best_attack    2565 non-null   float64
 12  away_best_defense   2565 non-null   float64
 13  gk_away_player_1    2565 non-null   float64
 14  bottleneck_metric   2565 non-null   float64
 15  wasserstein_metric  2565 non-null   float64
 16  landscape_metric    2565 non-null   float64
 17  betti_metric        2565 non-null   float64
 18  heat_metric         2565 non-null   float64
 19  label               2565 non-null   float64

Please note that the first 14 columns are the features that describe the match, and that the 5 remaining features (minus label) are the topological ones, that are already extracted.

The problem seems to be when code gets to extract_x_test_features() and extract_features_for_prediction(), which should get the tolopogical features and stack the train dataset with it.

Since X_train already has topological features, it adds 5 more and so I end up with 24 features.

I'm not sure, though. I'm just trying to wrap this project around my head...and how prediction is being made here.


How do I fix the mismatch using the code above?


NOTES:

1- x_train and y_test are not dataframes but numpy.ndarray

2 - This question is completely reproducible if one clones or downloads the project from the following link:

Github Link


Solution

  • Returning a slice with 19 features here:

    def extract_features_for_prediction(x_train, y_train, x_test, y_test, pipeline):
       (...)
       return final_x_test[:, :19]
    

    Got rid of the error and ran the test.


    I still don't get the gist of it, though.

    I will grant the bounty to anyone who explains me the idea behind the test set in the context of this project, in the project notebook, which can be found here:

    Project Notebook