I have a TPOT regressor set up to predict stock prices on a dataset (after some feature engineering), and I ran into an issue when the XGBoost regressor was involved, I would receive an error that said:
feature_names mismatch:
...and it would then display the list of column names for my dataset. A solution was brought up for this issue over at Github that suggested to convert the dataframes of the X features and Y label to a Numpy Array during the train_test_split to take care of it, so that's what I did, but now I receive an error:
X_train, X_test, Y_train, Y_test = train_test_split(X.values, Y.values, test_size = test_size, random_state = seed)
print('[INFO] Printing the shapes of the training/testing feature/label sets...')
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
[INFO] Printing the shapes of the training/testing feature/label sets...
(1374, 68)
(459, 68)
(1374,)
(459,)
Best pipeline: ExtraTreesRegressor(DecisionTreeRegressor(input_matrix, max_depth=1, min_samples_leaf=9, min_samples_split=11), bootstrap=False, max_features=0.8500000000000001, min_samples_leaf=1, min_samples_split=9, n_estimators=100)
Traceback (most recent call last):
File "main2.py", line 656, in <module>
predictions = best_model.predict(X_test)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\tpot\base.py", line 921, in predict
return self.fitted_pipeline_.predict(features)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\utils\metaestimators.py", line 116, in <lambda>
out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\pipeline.py", line 422, in predict
return self.steps[-1][-1].predict(Xt, **predict_params)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\ensemble\forest.py", line 693, in predict
X = self._validate_X_predict(X)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\ensemble\forest.py", line 359, in _validate_X_predict
return self.estimators_[0]._validate_X_predict(X, check_input=True)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\tree\tree.py", line 402, in _validate_X_predict
% (self.n_features_, n_features))
ValueError: Number of features of the model must match the input. Model n_features is 68 and input n_features is 69
The issue at Github is now closed, but I'm hoping someone here can explain what I'm missing here? As you can see there's 68 feature columns and 1 label column. And you'll also see that the model this time doesn't even use XGBoost, but I'd like to be able to have any model it comes up with work with the .predict()
function.
UPDATE WTIH CODE
Ok I'm seriously stuck here. I've posted a working code below to duplicate the error. Let me know what you see. Input stock ticker CLVS. I've added print shapes of the dataframes and arrays throughout the entire process and it still says the shapes are fine so what am I not seeing? You'll need Pandas 0.23 (yes old version) and TPOT and DASK installed. Thanks:
def main():
# 1. Input a stock ticker
ticker_input = input('Which stock ticker would you like to predict?') # Start with CLVS for testing
print('Getting the historical data for: ',ticker_input)
# 2. Download the historical daily data
# Import dependencies
from datetime import datetime
from pandas_datareader import data as web
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format
import seaborn as sns
import matplotlib.pyplot as plt
import random
import os
import numpy as np
import time
# Downloading historical data as dataframe
ex = 'yahoo'
start = datetime(2000, 1, 1)
end = datetime.now()
dataset = web.DataReader(ticker_input, ex, start, end) #.reset_index()
# 3. Construct the dataframe from the historical data
# Only use the Adj Close, and use the open price
# of the current day. Then shift all the other
# data 1 day to make the dataset include the
# previous day's values for each.
# (This is because on the trading day, we won't know what the
# High or Low or Close or Volume is, but we would
# know the Open.)
dataset = dataset.drop(['Close'],axis=1)
dataset['PrevOpen'] = dataset['Open'].shift(1)
dataset['PrevHigh'] = dataset['High'].shift(1)
dataset['PrevLow'] = dataset['Low'].shift(1)
dataset['PrevAdjClose'] = dataset['Adj Close'].shift(1)
dataset['PrevVol'] = dataset['Volume'].shift(1)
dataset = dataset.drop(['High'],axis=1)
dataset = dataset.drop(['Low'],axis=1)
dataset = dataset.drop(['Volume'],axis=1)
# Add in moving averages based on Opening prices
dataset['9MA'] = dataset['Open'].rolling(window=9).mean()
dataset['20MA'] = dataset['Open'].rolling(window=20).mean()
# Get which industry the stock is in to get the industry performance data
from bs4 import BeautifulSoup
import requests
headers = requests.utils.default_headers()
headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
# Get the industry name of the stock
url = 'https://finance.yahoo.com/quote/' + ticker_input + '/profile'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('p', {'class' :'D(ib) Va(t)'})
industry = table.findAll('span')
indust = industry[3].text
print(indust)
print('Getting Industry ETF historical data...')
# Then get historical data for that industry's ETF
if indust == "Biotechnology":
etf_ticker = "IBB"
elif indust == "Specialty Retail":
etf_ticker = "XRT"
elif indust == "Oil & Gas E&P":
etf_ticker = "XOP"
ex = 'yahoo'
etf_df = web.DataReader(etf_ticker, ex, start, end)
dataset['PrevIndOpen'] = etf_df['Open'].shift(1)
dataset['PrevIndHigh'] = etf_df['High'].shift(1)
dataset['PrevIndLow'] = etf_df['Low'].shift(1)
dataset['PrevIndClose'] = etf_df['Adj Close'].shift(1)
dataset['PrevIndVol'] = etf_df['Volume'].shift(1)
# Reshape the dataframe to put Adj Close at the far right
# so when we export the predictions dataset, the predictions
# column will be right next to it for easier analysis
dataset = dataset[['Open','9MA','20MA','PrevOpen','PrevHigh','PrevLow','PrevAdjClose','PrevVol','PrevIndOpen','PrevIndHigh','PrevIndLow','PrevIndClose','PrevIndVol','Adj Close']]
# Disable the Future Warnings that repeat "needlessly" (for now)
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings("ignore")
# 5. Explore the inital dataset
# Show the shape of the dataset
print("[INFO] features shape : {}".format(dataset.shape))
# Print the feature names
print("[INFO] dataset names : {}".format(dataset.columns))
# Convert the dataframe into a Pandas dataframe and print the first 5 rows
df = pd.DataFrame(dataset)
print("[INFO] df type : {}".format(type(df)))
print("[INFO] df shape: {}".format(df.shape))
print(df.head())
# Specify the column names and print
df.columns = dataset.columns
#print('[INFO] df shape with features:')
#print(df.head())
# This prints the same as above
# Find any columns with missing values? If you find them, you either have to:
# 1. Replace the missing value with a large negative number (e.g. -999).
# 2. Replace the missing value with mean of the column.
# 3. Replace the missing value with median of the column.
# Because of our 1 day shift, the first row will have empty values,
# so we'll drop them as one day won't make much difference in our entire model
print('[INFO] Checking for any columns with no values...')
df = df.dropna(how='any')
print(pd.isnull(df).any())
# Ensure numeric datatypes of the dataframe.
# If a column has different datatype such as string or character,
# we need to map that column to a numeric datatype such as integer
# or float. For this dataset, the Date index column is one.
print('[INFO] Feature types:')
print(df.dtypes)
# Print a statistical summary of the dataset for reference
print('[INFO] Print a statistical summary of dataset:')
print(df.describe())
# # Reset the index column for FeatureTools to use Date as the index, then it'll revert it back after feature stuff is done
# df = df.reset_index()
# This is not good way to drop the rows here because if there are any
# nan values in the middle of the dataset, those will get lost too.
# Need to work with this
df = df.dropna()
print(df)
# 4. Hold out a prediction dataset to predict on later
prediction_df = df.tail(90).copy()
df = df.iloc[:-90,:].copy() # subtracting 90 rows/days from the dataset to use as the predictions dataset later
# 7. Split the dataset into features (X) and target (Y)
# Split into features (x) and target (y) and print the shapes of them
X = df.drop("Adj Close", axis=1)
Y = df["Adj Close"]
print('Shape of features: ', X.shape)
print('Shape of target: ', Y.shape)
# Standardize the data. Commenting this out until you can figure out how to
# unscale the prediction dataset for review
#from sklearn.preprocessing import StandardScaler, MinMaxScaler
#scaler = MinMaxScaler().fit(X)
#scaled_X = scaler.transform(X)
print('Printing X and Y shape :')
print(X.shape)
print(Y.shape)
# 8. Split dataset into training and validation data
# Split the data into training and testing data and print their shapes
from sklearn.model_selection import train_test_split
seed = 9
test_size = 0.25
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = test_size, random_state = seed)
print('[INFO] Printing the shapes of the training/testing feature/label sets...')
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
X_train=X_train.values
X_test=X_test.values
Y_train=Y_train.values
Y_test=Y_test.values
print('[INFO] Printing the arrays of the training/testing feature/label sets...')
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
# 9. Start a TPOT Auto Regression to find the best Regression model and export feature importances
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
from tpot import TPOTRegressor
import os
# Create a custom config dictionary for TPOT to use.
# I've made this list full of Regressors that use the
# .feature_importances_ attribute. How to implement XGBoost
# into the plotting of feature importances below? IF XGBOOST is
# present in the final model, then plot one way, ELSE, plot the
# way it is now?
tpot_config = {
'sklearn.ensemble.ExtraTreesRegressor': {
'n_estimators': [100],
'max_features': np.arange(0.05, 1.01, 0.05),
'min_samples_split': range(2, 21),
'min_samples_leaf': range(1, 21),
'bootstrap': [True, False]
},
'sklearn.tree.DecisionTreeRegressor': {
'max_depth': range(1, 11),
'min_samples_split': range(2, 21),
'min_samples_leaf': range(1, 21)
},
'sklearn.ensemble.RandomForestRegressor': {
'n_estimators': [100],
'max_features': np.arange(0.05, 1.01, 0.05),
'min_samples_split': range(2, 21),
'min_samples_leaf': range(1, 21),
'bootstrap': [True, False]
},
# Preprocesssors
'sklearn.preprocessing.Binarizer': {
'threshold': np.arange(0.0, 1.01, 0.05)
},
'sklearn.decomposition.FastICA': {
'tol': np.arange(0.0, 1.01, 0.05)
},
'sklearn.cluster.FeatureAgglomeration': {
'linkage': ['ward', 'complete', 'average'],
'affinity': ['euclidean', 'l1', 'l2', 'manhattan', 'cosine']
},
'sklearn.preprocessing.MaxAbsScaler': {
},
'sklearn.preprocessing.MinMaxScaler': {
},
'sklearn.preprocessing.Normalizer': {
'norm': ['l1', 'l2', 'max']
},
'sklearn.kernel_approximation.Nystroem': {
'kernel': ['rbf', 'cosine', 'chi2', 'laplacian', 'polynomial', 'poly', 'linear', 'additive_chi2', 'sigmoid'],
'gamma': np.arange(0.0, 1.01, 0.05),
'n_components': range(1, 11)
},
'sklearn.decomposition.PCA': {
'svd_solver': ['randomized'],
'iterated_power': range(1, 11)
},
'sklearn.preprocessing.PolynomialFeatures': {
'degree': [2],
'include_bias': [False],
'interaction_only': [False]
},
'sklearn.kernel_approximation.RBFSampler': {
'gamma': np.arange(0.0, 1.01, 0.05)
},
'sklearn.preprocessing.RobustScaler': {
},
'sklearn.preprocessing.StandardScaler': {
},
'tpot.builtins.ZeroCount': {
},
'tpot.builtins.OneHotEncoder': {
'minimum_fraction': [0.05, 0.1, 0.15, 0.2, 0.25],
'sparse': [False],
'threshold': [10]
},
# Selectors
'sklearn.feature_selection.SelectFwe': {
'alpha': np.arange(0, 0.05, 0.001),
'score_func': {
'sklearn.feature_selection.f_regression': None
}
},
'sklearn.feature_selection.SelectPercentile': {
'percentile': range(1, 100),
'score_func': {
'sklearn.feature_selection.f_regression': None
}
},
'sklearn.feature_selection.VarianceThreshold': {
'threshold': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2]
},
'sklearn.feature_selection.SelectFromModel': {
'threshold': np.arange(0, 1.01, 0.05),
'estimator': {
'sklearn.ensemble.ExtraTreesRegressor': {
'n_estimators': [100],
'max_features': np.arange(0.05, 1.01, 0.05)
}
}
}
}
# Cross Validation folds to run
folds = 10
# Start the TPOT regression
best_model = TPOTRegressor(use_dask=True,n_jobs=-1,config_dict=tpot_config, cv=folds,
generations=5, population_size=20, verbosity=2, random_state=seed) #memory='./PipelineCache', memory='auto',
best_model.fit(X_train, Y_train)
# Export the TPOT pipeline if you want to use it for anything later
if os.path.exists('./Exported Pipelines'):
pass
else:
os.mkdir('./Exported Pipelines')
best_model.export('./Exported Pipelines/' + ticker_input + '-prediction-pipeline.py')
# Extract what the best pipeline was and fit it to the training set
# to get an idea of the most important features used by the model
exctracted_best_model = best_model.fitted_pipeline_.steps[-1][1]
# Train the `exctracted_best_model` using the training/vildation set.
# You need to use the whole dataset in order to get feature importance for all the
# features in your dataset.
exctracted_best_model.fit(X_train, Y_train)
# plot model's feature importance and save the plot for later
feature_importance = exctracted_best_model.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, df.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.savefig("feature_importance.png")
plt.clf()
plt.close()
print(X_test.shape)
# 10. See the stats of the validation predictions from the tuned model and export more plots
# Make predictions using the tuned model and display error metrics
# R2 and Explained Variance, best is 1
predictions = best_model.predict(X_test)
print('=============================')
print("TPOT's final score on testing dataset is : ", best_model.score(X_test, Y_test))
print('=============================')
print("[INFO] MSE on test set : {}".format(round(mean_squared_error(Y_test, predictions), 3)))
print('[INFO] R2 Score on test set : {}'.format(round(r2_score(Y_test, predictions), 3)))
print('[INFO] Explained Variance Score on test set : {}'.format(round(explained_variance_score(Y_test, predictions), 3)))
# Plot between predictions and Y_test
x_axis = np.array(range(0, predictions.shape[0]))
plt.plot(x_axis, predictions, linestyle="--", marker="o", alpha=0.7, color='r', label="predictions")
plt.plot(x_axis, Y_test, linestyle="--", marker="o", alpha=0.7, color='g', label="Y_test")
plt.xlabel('Row number')
plt.ylabel('PRICE')
plt.title('Predictions vs Y_test')
plt.legend(loc='lower right')
plt.savefig("predictions_vs_ytest.png")
plt.clf()
plt.close()
# 11. Use the model on the held-out prediction dataset
# Now, run the model on the prediction dataset
features = prediction_df.drop(['Adj Close'], axis=1)
labels = prediction_df['Adj Close']
# Fit the model to the prediction_df and predict the labels
#tpot.fit(features, labels)
results = best_model.predict(features)
predictions_list = []
for preds in results:
predictions_list.append(preds)
prediction_df['Predictions'] = predictions_list
prediction_df.to_csv('Final Predictions Performance.csv', index=True)
print('============================')
print("[INFO] MSE on prediction set : {}".format(round(mean_squared_error(labels, results), 3)))
print('[INFO] R2 Score on prediction set : {}'.format(round(r2_score(labels, results), 3)))
print('[INFO] Explained Variance Score on prediction set : {}'.format(round(explained_variance_score(labels, results), 3)))
# 12. Review the exported .csv file of the predictions, and review all your plots
print('DONE!')
if __name__ == "__main__":
main()
It looks like I may have found a solution. I've run a few models using XGBRegressor and RandomDecisionTrees and it seems to be working.
Just had to turn make "X_train=X_train.values", and "X_test=X_test.values", but leave the Y's alone as a dataframe because when I changed both groups, I got an error. So I'm leaving it as this for now.