pythonpython-3.xpandasnumpytpot

Converting Pandas DF to Numpy Array gives me a # of features error when trying to predict?


I have a TPOT regressor set up to predict stock prices on a dataset (after some feature engineering), and I ran into an issue when the XGBoost regressor was involved, I would receive an error that said:

feature_names mismatch:

...and it would then display the list of column names for my dataset. A solution was brought up for this issue over at Github that suggested to convert the dataframes of the X features and Y label to a Numpy Array during the train_test_split to take care of it, so that's what I did, but now I receive an error:

X_train, X_test, Y_train, Y_test = train_test_split(X.values, Y.values, test_size = test_size, random_state = seed)
print('[INFO] Printing the shapes of the training/testing feature/label sets...')
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)


[INFO] Printing the shapes of the training/testing feature/label sets...
(1374, 68)
(459, 68)
(1374,)
(459,)

Best pipeline: ExtraTreesRegressor(DecisionTreeRegressor(input_matrix, max_depth=1, min_samples_leaf=9, min_samples_split=11), bootstrap=False, max_features=0.8500000000000001, min_samples_leaf=1, min_samples_split=9, n_estimators=100)

Traceback (most recent call last):
  File "main2.py", line 656, in <module>
    predictions = best_model.predict(X_test)

File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\tpot\base.py", line 921, in predict
        return self.fitted_pipeline_.predict(features)
      File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\utils\metaestimators.py", line 116, in <lambda>
        out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
      File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\pipeline.py", line 422, in predict
        return self.steps[-1][-1].predict(Xt, **predict_params)
      File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\ensemble\forest.py", line 693, in predict
        X = self._validate_X_predict(X)
      File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\ensemble\forest.py", line 359, in _validate_X_predict
        return self.estimators_[0]._validate_X_predict(X, check_input=True)
      File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\tree\tree.py", line 402, in _validate_X_predict
        % (self.n_features_, n_features))
    ValueError: Number of features of the model must match the input. Model n_features is 68 and input n_features is 69

The issue at Github is now closed, but I'm hoping someone here can explain what I'm missing here? As you can see there's 68 feature columns and 1 label column. And you'll also see that the model this time doesn't even use XGBoost, but I'd like to be able to have any model it comes up with work with the .predict() function.

UPDATE WTIH CODE

Ok I'm seriously stuck here. I've posted a working code below to duplicate the error. Let me know what you see. Input stock ticker CLVS. I've added print shapes of the dataframes and arrays throughout the entire process and it still says the shapes are fine so what am I not seeing? You'll need Pandas 0.23 (yes old version) and TPOT and DASK installed. Thanks:

def main():


    # 1. Input a stock ticker
    ticker_input = input('Which stock ticker would you like to predict?') # Start with CLVS for testing
    print('Getting the historical data for: ',ticker_input)










    # 2. Download the historical daily data
    # Import dependencies
    from datetime import datetime
    from pandas_datareader import data as web
    import pandas as pd
    pd.options.display.float_format = '{:,.2f}'.format
    import seaborn as sns
    import matplotlib.pyplot as plt
    import random
    import os
    import numpy as np
    import time
    # Downloading historical data as dataframe
    ex = 'yahoo'
    start = datetime(2000, 1, 1)
    end = datetime.now()
    dataset = web.DataReader(ticker_input, ex, start, end) #.reset_index()









    # 3. Construct the dataframe from the historical data
    # Only use the Adj Close, and use the open price
    # of the current day. Then shift all the other
    # data 1 day to make the dataset include the 
    # previous day's values for each. 

    # (This is because on the trading day, we won't know what the 
    # High or Low or Close or Volume is, but we would
    # know the Open.)
    dataset = dataset.drop(['Close'],axis=1)
    dataset['PrevOpen'] = dataset['Open'].shift(1)
    dataset['PrevHigh'] = dataset['High'].shift(1)
    dataset['PrevLow'] = dataset['Low'].shift(1)
    dataset['PrevAdjClose'] = dataset['Adj Close'].shift(1)
    dataset['PrevVol'] = dataset['Volume'].shift(1)

    dataset = dataset.drop(['High'],axis=1)
    dataset = dataset.drop(['Low'],axis=1)
    dataset = dataset.drop(['Volume'],axis=1)

    # Add in moving averages based on Opening prices
    dataset['9MA'] = dataset['Open'].rolling(window=9).mean()
    dataset['20MA'] = dataset['Open'].rolling(window=20).mean()



    # Get which industry the stock is in to get the industry performance data
    from bs4 import BeautifulSoup
    import requests
    headers = requests.utils.default_headers() 
    headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
    # Get the industry name of the stock
    url = 'https://finance.yahoo.com/quote/' + ticker_input + '/profile'
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    table = soup.find('p', {'class' :'D(ib) Va(t)'})
    industry = table.findAll('span')
    indust = industry[3].text
    print(indust)
    print('Getting Industry ETF historical data...')
    # Then get historical data for that industry's ETF
    if indust == "Biotechnology":
        etf_ticker = "IBB"
    elif indust == "Specialty Retail":
        etf_ticker = "XRT"
    elif indust == "Oil & Gas E&P":
        etf_ticker = "XOP"
    ex = 'yahoo'
    etf_df = web.DataReader(etf_ticker, ex, start, end)
    dataset['PrevIndOpen'] = etf_df['Open'].shift(1)
    dataset['PrevIndHigh'] = etf_df['High'].shift(1)
    dataset['PrevIndLow'] = etf_df['Low'].shift(1)
    dataset['PrevIndClose'] = etf_df['Adj Close'].shift(1)
    dataset['PrevIndVol'] = etf_df['Volume'].shift(1)





    # Reshape the dataframe to put Adj Close at the far right
    # so when we export the predictions dataset, the predictions
    # column will be right next to it for easier analysis
    dataset = dataset[['Open','9MA','20MA','PrevOpen','PrevHigh','PrevLow','PrevAdjClose','PrevVol','PrevIndOpen','PrevIndHigh','PrevIndLow','PrevIndClose','PrevIndVol','Adj Close']]










    # Disable the Future Warnings that repeat "needlessly" (for now)
    import warnings
    warnings.simplefilter(action='ignore', category=FutureWarning)
    warnings.filterwarnings("ignore")









    # 5. Explore the inital dataset
    # Show the shape of the dataset
    print("[INFO] features shape : {}".format(dataset.shape))

    # Print the feature names
    print("[INFO] dataset names : {}".format(dataset.columns))

    # Convert the dataframe into a Pandas dataframe and print the first 5 rows
    df = pd.DataFrame(dataset)
    print("[INFO] df type : {}".format(type(df)))
    print("[INFO] df shape: {}".format(df.shape))
    print(df.head())

    # Specify the column names and print
    df.columns = dataset.columns
    #print('[INFO] df shape with features:')
    #print(df.head())
    # This prints the same as above

    # Find any columns with missing values? If you find them, you either have to:
    # 1. Replace the missing value with a large negative number (e.g. -999).
    # 2. Replace the missing value with mean of the column.
    # 3. Replace the missing value with median of the column.
    # Because of our 1 day shift, the first row will have empty values,
    # so we'll drop them as one day won't make much difference in our entire model
    print('[INFO] Checking for any columns with no values...')
    df = df.dropna(how='any')
    print(pd.isnull(df).any())


    # Ensure numeric datatypes of the dataframe.
    # If a column has different datatype such as string or character, 
    # we need to map that column to a numeric datatype such as integer 
    # or float. For this dataset, the Date index column is one.
    print('[INFO] Feature types:')
    print(df.dtypes)

    # Print a statistical summary of the dataset for reference
    print('[INFO] Print a statistical summary of dataset:')
    print(df.describe())




    # # Reset the index column for FeatureTools to use Date as the index, then it'll revert it back after feature stuff is done
    # df = df.reset_index()


    # This is not  good way to drop the rows here because if there are any
    # nan values in the middle of the dataset, those will get lost too.
    # Need to work with this
    df = df.dropna()
    print(df)


    # 4. Hold out a prediction dataset to predict on later
    prediction_df = df.tail(90).copy()
    df = df.iloc[:-90,:].copy() # subtracting 90 rows/days from the dataset to use as the predictions dataset later








    # 7. Split the dataset into features (X) and target (Y)
    # Split into features (x) and target (y) and print the shapes of them
    X = df.drop("Adj Close", axis=1)
    Y = df["Adj Close"]
    print('Shape of features: ', X.shape)
    print('Shape of target: ', Y.shape)
    # Standardize the data. Commenting this out until you can figure out how to
    # unscale the prediction dataset for review
    #from sklearn.preprocessing import StandardScaler, MinMaxScaler
    #scaler = MinMaxScaler().fit(X)
    #scaled_X = scaler.transform(X)

    print('Printing X and Y shape :')
    print(X.shape)
    print(Y.shape)







    # 8. Split dataset into training and validation data
    # Split the data into training and testing data and print their shapes
    from sklearn.model_selection import train_test_split
    seed = 9
    test_size = 0.25
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = test_size, random_state = seed)
    print('[INFO] Printing the shapes of the training/testing feature/label sets...')
    print(X_train.shape)
    print(X_test.shape)
    print(Y_train.shape)
    print(Y_test.shape)

    X_train=X_train.values
    X_test=X_test.values
    Y_train=Y_train.values
    Y_test=Y_test.values
    print('[INFO] Printing the arrays of the training/testing feature/label sets...')
    print(X_train.shape)
    print(X_test.shape)
    print(Y_train.shape)
    print(Y_test.shape)







    # 9. Start a TPOT Auto Regression to find the best Regression model and export feature importances
    from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
    from tpot import TPOTRegressor
    import os
    # Create a custom config dictionary for TPOT to use.
    # I've made this list full of Regressors that use the
    # .feature_importances_ attribute. How to implement XGBoost
    # into the plotting of feature importances below? IF XGBOOST is 
    # present in the final model, then plot one way, ELSE, plot the
    # way it is now?
    tpot_config = {



        'sklearn.ensemble.ExtraTreesRegressor': {
            'n_estimators': [100],
            'max_features': np.arange(0.05, 1.01, 0.05),
            'min_samples_split': range(2, 21),
            'min_samples_leaf': range(1, 21),
            'bootstrap': [True, False]
        },



        'sklearn.tree.DecisionTreeRegressor': {
            'max_depth': range(1, 11),
            'min_samples_split': range(2, 21),
            'min_samples_leaf': range(1, 21)
        },

        'sklearn.ensemble.RandomForestRegressor': {
            'n_estimators': [100],
            'max_features': np.arange(0.05, 1.01, 0.05),
            'min_samples_split': range(2, 21),
            'min_samples_leaf': range(1, 21),
            'bootstrap': [True, False]
        },


        # Preprocesssors
        'sklearn.preprocessing.Binarizer': {
            'threshold': np.arange(0.0, 1.01, 0.05)
        },

        'sklearn.decomposition.FastICA': {
            'tol': np.arange(0.0, 1.01, 0.05)
        },

        'sklearn.cluster.FeatureAgglomeration': {
            'linkage': ['ward', 'complete', 'average'],
            'affinity': ['euclidean', 'l1', 'l2', 'manhattan', 'cosine']
        },

        'sklearn.preprocessing.MaxAbsScaler': {
        },

        'sklearn.preprocessing.MinMaxScaler': {
        },

        'sklearn.preprocessing.Normalizer': {
            'norm': ['l1', 'l2', 'max']
        },

        'sklearn.kernel_approximation.Nystroem': {
            'kernel': ['rbf', 'cosine', 'chi2', 'laplacian', 'polynomial', 'poly', 'linear', 'additive_chi2', 'sigmoid'],
            'gamma': np.arange(0.0, 1.01, 0.05),
            'n_components': range(1, 11)
        },

        'sklearn.decomposition.PCA': {
            'svd_solver': ['randomized'],
            'iterated_power': range(1, 11)
        },

        'sklearn.preprocessing.PolynomialFeatures': {
            'degree': [2],
            'include_bias': [False],
            'interaction_only': [False]
        },

        'sklearn.kernel_approximation.RBFSampler': {
            'gamma': np.arange(0.0, 1.01, 0.05)
        },

        'sklearn.preprocessing.RobustScaler': {
        },

        'sklearn.preprocessing.StandardScaler': {
        },

        'tpot.builtins.ZeroCount': {
        },

        'tpot.builtins.OneHotEncoder': {
            'minimum_fraction': [0.05, 0.1, 0.15, 0.2, 0.25],
            'sparse': [False],
            'threshold': [10]
        },


        # Selectors
        'sklearn.feature_selection.SelectFwe': {
            'alpha': np.arange(0, 0.05, 0.001),
            'score_func': {
                'sklearn.feature_selection.f_regression': None
            }
        },

        'sklearn.feature_selection.SelectPercentile': {
            'percentile': range(1, 100),
            'score_func': {
                'sklearn.feature_selection.f_regression': None
            }
        },

        'sklearn.feature_selection.VarianceThreshold': {
            'threshold': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2]
        },

        'sklearn.feature_selection.SelectFromModel': {
            'threshold': np.arange(0, 1.01, 0.05),
            'estimator': {
                'sklearn.ensemble.ExtraTreesRegressor': {
                    'n_estimators': [100],
                    'max_features': np.arange(0.05, 1.01, 0.05)
                }
            }
        }

    }

    # Cross Validation folds to run
    folds   = 10
    # Start the TPOT regression
    best_model = TPOTRegressor(use_dask=True,n_jobs=-1,config_dict=tpot_config, cv=folds, 
                               generations=5, population_size=20, verbosity=2, random_state=seed) #memory='./PipelineCache',       memory='auto',
    best_model.fit(X_train, Y_train)

    # Export the TPOT pipeline if you want to use it for anything later
    if os.path.exists('./Exported Pipelines'):
        pass
    else:
        os.mkdir('./Exported Pipelines')
    best_model.export('./Exported Pipelines/' + ticker_input + '-prediction-pipeline.py')

    # Extract what the best pipeline was and fit it to the training set
    # to get an idea of the most important features used by the model
    exctracted_best_model = best_model.fitted_pipeline_.steps[-1][1]
    # Train the `exctracted_best_model` using the training/vildation set.
    # You need to use the whole dataset in order to get feature importance for all the
    # features in your dataset.
    exctracted_best_model.fit(X_train, Y_train)

    # plot model's feature importance and save the plot for later
    feature_importance = exctracted_best_model.feature_importances_
    feature_importance = 100.0 * (feature_importance / feature_importance.max())
    sorted_idx = np.argsort(feature_importance)
    pos        = np.arange(sorted_idx.shape[0]) + .5
    plt.barh(pos, feature_importance[sorted_idx], align='center')
    plt.yticks(pos, df.columns[sorted_idx])
    plt.xlabel('Relative Importance')
    plt.title('Variable Importance')
    plt.savefig("feature_importance.png")
    plt.clf()
    plt.close()






print(X_test.shape)



    # 10. See the stats of the validation predictions from the tuned model and export more plots
    # Make predictions using the tuned model and display error metrics
    # R2 and Explained Variance, best is 1
    predictions = best_model.predict(X_test)
    print('=============================')
    print("TPOT's final score on testing dataset is : ", best_model.score(X_test, Y_test))
    print('=============================')
    print("[INFO] MSE on test set : {}".format(round(mean_squared_error(Y_test, predictions), 3)))
    print('[INFO] R2 Score on test set : {}'.format(round(r2_score(Y_test, predictions), 3)))
    print('[INFO] Explained Variance Score on test set : {}'.format(round(explained_variance_score(Y_test, predictions), 3)))

    # Plot between predictions and Y_test
    x_axis = np.array(range(0, predictions.shape[0]))
    plt.plot(x_axis, predictions, linestyle="--", marker="o", alpha=0.7, color='r', label="predictions")
    plt.plot(x_axis, Y_test, linestyle="--", marker="o", alpha=0.7, color='g', label="Y_test")
    plt.xlabel('Row number')
    plt.ylabel('PRICE')
    plt.title('Predictions vs Y_test')
    plt.legend(loc='lower right')
    plt.savefig("predictions_vs_ytest.png")
    plt.clf()
    plt.close()










    # 11. Use the model on the held-out prediction dataset
    # Now, run the model on the prediction dataset
    features = prediction_df.drop(['Adj Close'], axis=1)
    labels = prediction_df['Adj Close']
    # Fit the model to the prediction_df and predict the labels
    #tpot.fit(features, labels)
    results = best_model.predict(features)
    predictions_list = []
    for preds in results:
        predictions_list.append(preds)
    prediction_df['Predictions'] = predictions_list
    prediction_df.to_csv('Final Predictions Performance.csv', index=True)
    print('============================')
    print("[INFO] MSE on prediction set : {}".format(round(mean_squared_error(labels, results), 3)))
    print('[INFO] R2 Score on prediction set : {}'.format(round(r2_score(labels, results), 3)))
    print('[INFO] Explained Variance Score on prediction set : {}'.format(round(explained_variance_score(labels, results), 3)))










    # 12. Review the exported .csv file of the predictions, and review all your plots
    print('DONE!')


if __name__ == "__main__":
    main()

Solution

  • It looks like I may have found a solution. I've run a few models using XGBRegressor and RandomDecisionTrees and it seems to be working.

    Just had to turn make "X_train=X_train.values", and "X_test=X_test.values", but leave the Y's alone as a dataframe because when I changed both groups, I got an error. So I'm leaving it as this for now.