python machine-learning ensemble-learning boosting

low training (~64%) and test accuracy (~14%) with 5 different models

Im struggling to find a learning algorithm that works for my dataset.

I am working with a typical regressor problem. There are 6 features in the dataset that I am concerned with. There are about 800 data points in my dataset. The features and the predicted values have high non-linear correlation so the features are not useless (as far as I understand). The predicted values have a bimodal distribution so I disregard linear model pretty quickly.

So I have tried 5 different models: random forest, extra trees, AdaBoost, gradient boosting and xgb regressor. The training dataset returns accuracy and the test data returns 11%-14%. Both numbers scare me haha. I try tuning the parameters for the random forest but seems like nothing particularly make a drastic difference.

Function to tune the parameters

def hyperparatuning(model, train_features, train_labels, param_grid = {}):
    grid_search = GridSearchCV(estimator = model, param_grid = param_grid, cv = 3, n_jobs = -1, verbose =2)
    grid_search.fit(train_features, train_labels)
    print(grid_search.best_params_)
    return grid_search.best_estimator_`

Function to evaluate the model

def evaluate(model, test_features, test_labels):
    predictions = model.predict(test_features)
    errors = abs(predictions - test_labels)
    mape = 100*np.mean(errors/test_labels)
    accuracy = 100 - mape
    print('Model Perfomance')
    print('Average Error: {:0.4f} degress. '.format(np.mean(errors)))
    print('Accuracy = {:0.2f}%. '.format(accuracy))

I expect the output to be at least ya know acceptable but instead i got training data to be 64% and testing data to be 12-14%. It is a real horror to look at this numbers!

Solution

There are several issues with your question.

For starters, you are trying to use accuracy in what it seems to be a regression problem, which is meaningless.

Although you don't provide the exact models (it would arguably be a good idea), this line in your evaluation function

errors = abs(predictions - test_labels)

is actually the basis of the mean absolute error (MAE - although you should actually take its mean, as the name implies). MAE, like MAPE, is indeed a performance metric for regression problems; but the formula you use next

accuracy = 100 - mape

does not actually hold, neither it is used in practice.

It is true that, intuitively, one might want to get the 1-MAPE quantity; but this is not a good idea, as MAPE itself has a lot of drawbacks which seriously limit its use; here is a partial list from Wikipedia:

It cannot be used if there are zero values (which sometimes happens for example in demand data) because there would be a division by zero.

For forecasts which are too low the percentage error cannot exceed 100%, but for forecasts which are too high there is no upper limit to the percentage error.