I'm trying to reproduce this GitHub project on my machine, on Topological Data Analysis (TDA).
My steps:
Background:
In order to decide which attributes belong to which group, we created a correlation matrix. From this, we saw that there were two big groups, where player attributes were strongly correlated with each other. Therefore, we decided to split the attributes into two groups, one to summarise the attacking characteristics of a player while the other one the defensiveness. Finally, since the goalkeeper has completely different statistics with respect to the other players, we decided to take into account only the overall rating. Below, is possible to see the 24 features used for each player:
Attack: "positioning", "crossing", "finishing", "heading_accuracy", "short_passing", "reactions", "volleys", "dribbling", "curve", "free_kick_accuracy", "acceleration", "sprint_speed", "agility", "penalties", "vision", "shot_power", "long_shots" Defense: "interceptions", "aggression", "marking", "standing_tackle", "sliding_tackle", "long_passing" Goalkeeper: "overall_rating"
From this set of features, the next step we did was to, for each non-goalkeeper player, compute the mean of the attack attributes and the defensive ones.
Finally, for each team in a given match, we compute the mean and the standard deviation for the attack and the defense from these stats of the team's players, as well as the best attack and best defense.
In this way a match is described by 14 features (GK overall value, best attack, std attack, mean attack, the best defense, std defense, mean defense), that mapped the match in the space, following the characterizes of the two teams.
The aim of TDA is to catch the structure of the space underlying the data. In our project, we assume that the neighborhood of a data point hides meaningful information that is correlated with the outcome of the match. Thus, we explored the data space looking for this kind of correlation.
Methods:
def get_best_params():
cv_output = read_pickle('cv_output.pickle')
best_model_params, top_feat_params, top_model_feat_params, *_ = cv_output
return top_feat_params, top_model_feat_params
def load_dataset():
x_y = get_dataset(42188).get_data(dataset_format='array')[0]
x_train_with_topo = x_y[:, :-1]
y_train = x_y[:, -1]
return x_train_with_topo, y_train
def extract_x_test_features(x_train, y_train, players_df, pipeline):
"""Extract the topological features from the test set. This requires also the train set
Parameters
----------
x_train:
The x used in the training phase
y_train:
The 'y' used in the training phase
players_df: pd.DataFrame
The DataFrame containing the matches with all the players, from which to extract the test set
pipeline: Pipeline
The Giotto pipeline
Returns
-------
x_test:
The x_test with the topological features
"""
x_train_no_topo = x_train[:, :14]
y_test = np.zeros(len(players_df)) # Artificial y_test for features computation
print('Y_TEST',y_test.shape)
x_test_topo = extract_features_for_prediction(x_train_no_topo, y_train, players_df.values, y_test, pipeline)
return x_test_topo
def extract_topological_features(diagrams):
metrics = ['bottleneck', 'wasserstein', 'landscape', 'betti', 'heat']
new_features = []
for metric in metrics:
amplitude = Amplitude(metric=metric)
new_features.append(amplitude.fit_transform(diagrams))
new_features = np.concatenate(new_features, axis=1)
return new_features
def extract_features_for_prediction(x_train, y_train, x_test, y_test, pipeline):
shift = 10
top_features = []
all_x_train = x_train
all_y_train = y_train
for i in tqdm(range(0, len(x_test), shift)):
#
print(range(0, len(x_test), shift) )
if i+shift > len(x_test):
shift = len(x_test) - i
batch = np.concatenate([all_x_train, x_test[i: i + shift]])
batch_y = np.concatenate([all_y_train, y_test[i: i + shift].reshape((-1,))])
diagrams_batch, _ = pipeline.fit_transform_resample(batch, batch_y)
new_features_batch = extract_topological_features(diagrams_batch[-shift:])
top_features.append(new_features_batch)
all_x_train = np.concatenate([all_x_train, batch[-shift:]])
all_y_train = np.concatenate([all_y_train, batch_y[-shift:]])
final_x_test = np.concatenate([x_test, np.concatenate(top_features, axis=0)], axis=1)
return final_x_test
def get_probabilities(model, x_test, team_ids):
"""Get the probabilities on the outcome of the matches contained in the test set
Parameters
----------
model:
The model (must have the 'predict_proba' function)
x_test:
The test set
team_ids: pd.DataFrame
The DataFrame containing, for each match in the test set, the ids of the two teams
Returns
-------
probabilities:
The probabilities for each match in the test set
"""
prob_pred = model.predict_proba(x_test)
prob_match_df = pd.DataFrame(data=prob_pred, columns=['away_team_prob', 'draw_prob', 'home_team_prob'])
prob_match_df = pd.concat([team_ids.reset_index(drop=True), prob_match_df], axis=1)
return prob_match_df
Working code:
best_pipeline_params, best_model_feat_params = get_best_params()
# 'best_pipeline_params' -> {'k_min': 50, 'k_max': 175, 'dist_percentage': 0.1}
# best_model_feat_params -> {'n_estimators': 1000, 'max_depth': 10, 'random_state': 52, 'max_features': 0.5}
pipeline = get_pipeline(best_pipeline_params)
# pipeline -> Pipeline(steps=[('extract_point_clouds',
# SubSpaceExtraction(dist_percentage=0.1, k_max=175, k_min=50)),
#('create_diagrams', VietorisRipsPersistence(n_jobs=-1))])
x_train, y_train = load_dataset()
# x_train.shape -> (2565, 19)
# y_train.shape -> (2565,)
x_test = extract_x_test_features(x_train, y_train, new_players_df_stats, pipeline)
# x_test.shape -> (380, 24)
rf_model = RandomForestClassifier(**best_model_feat_params)
rf_model.fit(x_train, y_train)
matches_probabilities = get_probabilities(rf_model, x_test, team_ids) # <-- breaks here
matches_probabilities.head()
compute_final_standings(matches_probabilities, 'premier league')
But I'm getting the error:
ValueError: X has 24 features, but DecisionTreeClassifier is expecting 19 features as input.
Loaded dataset (X_train
):
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 home_best_attack 2565 non-null float64
1 home_best_defense 2565 non-null float64
2 home_avg_attack 2565 non-null float64
3 home_avg_defense 2565 non-null float64
4 home_std_attack 2565 non-null float64
5 home_std_defense 2565 non-null float64
6 gk_home_player_1 2565 non-null float64
7 away_avg_attack 2565 non-null float64
8 away_avg_defense 2565 non-null float64
9 away_std_attack 2565 non-null float64
10 away_std_defense 2565 non-null float64
11 away_best_attack 2565 non-null float64
12 away_best_defense 2565 non-null float64
13 gk_away_player_1 2565 non-null float64
14 bottleneck_metric 2565 non-null float64
15 wasserstein_metric 2565 non-null float64
16 landscape_metric 2565 non-null float64
17 betti_metric 2565 non-null float64
18 heat_metric 2565 non-null float64
19 label 2565 non-null float64
Please note that the first 14 columns are the features that describe the match, and that the 5 remaining features (minus label) are the topological ones, that are already extracted.
The problem seems to be when code gets to extract_x_test_features()
and extract_features_for_prediction()
, which should get the tolopogical features and stack the train dataset with it.
Since X_train already has topological features, it adds 5 more and so I end up with 24 features.
I'm not sure, though. I'm just trying to wrap this project around my head...and how prediction is being made here.
How do I fix the mismatch using the code above?
NOTES:
1- x_train and y_test are not dataframes
but numpy.ndarray
2 - This question is completely reproducible if one clones or downloads the project from the following link:
Returning a slice with 19 features here:
def extract_features_for_prediction(x_train, y_train, x_test, y_test, pipeline):
(...)
return final_x_test[:, :19]
Got rid of the error and ran the test.
I still don't get the gist of it, though.
I will grant the bounty to anyone who explains me the idea behind the test set in the context of this project, in the project notebook, which can be found here: