I am very new and fresh to machine learning and this is my first project I am working on as part of a college course. I chose UK football (soccer) matches. I have chosen to use a Random Forest.
Using different sources I have managed to get 20 years worth of data on said matches, clean the data and build my model.
However, I am stuck. How do I actually get the model to make predictions for future matches?
Thanks
I've tried to load the model and then use a CSV file with only the Date, Home_Team and Away_Team columns populated, leaving the other columns blank for the model to predict these values - is this the correct way to do this?
Update:
Thanks - please see the code used to build the model;
from sklearn.ensemble import RandomForestClassifier
train = matches[matches["Date"] < '2012-06-01']
test = matches[matches["Date"] > '2012-06-01']
predictors = ['Home_Team', 'Away_Team', 'HT_Winner', 'FT_Winner', 'match_result', 'ht_match_result', 'HomeShots', 'AwayShots', 'HomeCorners', 'AwayCorners']
rf.fit(train[predictors], train["FT_Winner"])
preds = rf.predict(test[predictors])
New CSV for future predictions:
import pandas as pd
new_data_df = pd.read_csv(..)
predictions = model.predict(new_data_df)
Updated CSV contains all of the same columns (with only Date, Home_Team and Away_Team columns populated as this is the only info currently available and the other columns wanting the model to make predictions for. But when attempting to get predictions for the new CSV, I get the following;
"Feature names must be in the same order as they were in fit.\n"
ValueError(message)
ValueError: The feature names should match those that were passed during fit.
Your issue is your about the input features. You are using the predictor list as list of features for training. For testing, you still need the same input list in your new_data_df. If the new_data_df has all of the predictors column, you can use the following to line of code for predictions
predictions = model.predict(new_data_df[predictors])
Just remember, for prediction, you always need to pass all the parameters you passed during training apart from the target or result column.