deep-learning pytorch regression prediction

Deep learning predictions with similar values

I am getting many of the predictions with a similar value with deep learning, which generates a horizontal line in the correlation plot.

I generated a small dataset that can reproduce the problem (data) but my dataset is much larger. That's why the layers are so big, but I get the same problem if I adapt them to the size of this simplified case.

If I try to predict the target values with other algorithm like random forest I get a R of 0.4 with this small dataset. With the full dataset, if I run the deep learning method and afterwards I remove all the values form the horizontal line, I get a similar R as the one of random forest. I don't know why it is not predicting in the same way for the samples of the horizontal line. Do you have any clue?

This is a code that reproduces the problem and some correlation plots:

import torch, torch.nn as nn
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

var='target'
data = pd.read_csv('data800.csv', index_col=0)

train_dataset = data.sample(frac=0.8,random_state=1)
test_dataset = data.drop(train_dataset.index)

train_labels = train_dataset.pop(var)
test_labels = test_dataset.pop(var)

model = nn.Sequential(nn.Linear(train_dataset.shape[1], 1024), nn.ReLU(), nn.BatchNorm1d(1024),                   
                      nn.Linear(1024, 128), nn.ReLU(),  nn.BatchNorm1d(128),
                      nn.Linear(128, 64), nn.ReLU(),  nn.BatchNorm1d(64),
                      nn.Linear(64, 1))
optim = torch.optim.Adam(model.parameters(), 0.01)

for epoch in range(200):
    yhat = model(torch.tensor(train_dataset.values).to(torch.float32))
    loss = nn.MSELoss()(yhat.ravel(), torch.tensor(train_labels).to(torch.float32))
    optim.zero_grad()
    loss.backward()
    optim.step()
    yhatt=model(torch.tensor(test_dataset.values).to(torch.float32))
    yhatt = yhatt.detach().numpy()
    score = np.corrcoef(test_labels, yhatt.reshape(test_labels.shape))
    if epoch % 20 == 0:
        print('epoch', epoch, '| loss:', loss.item(), '| R:', score[0,1])

yhat = model(torch.tensor(test_dataset.values).to(torch.float32))

yhat = yhat.detach().numpy()
plt.scatter(test_labels, yhat)
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.axis('equal')
plt.axis('square')
_ = plt.plot([-1000, 1000], [-1000, 1000])
plt.show()

Solution

I think the issue was that each feature typically had a mixed distribution. ML algorithms generally work best when the features are symmetrically distributed and on a similar scale. I transformed the features to a uniform distribution by replacing each feature with its percentile. This flattens the distribution:

The model had better convergence. I then also tweaked the architecture. It was initially stepping up from ~50 features to 1024. I changed to a tapered architecture where it gradually scales down from the input feature size. That also improved the results. Final train RMSE was 0.14, and test set r=0.42. Code below.

import torch, torch.nn as nn
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

var = 'target'
data = pd.read_csv('data800.csv', index_col=0)

train_dataset = data.sample(frac=0.8, random_state=1)
test_dataset = data.drop(train_dataset.index)

train_labels = train_dataset.pop(var)
test_labels = test_dataset.pop(var)

#Flatten distribution by replacing each value with its percentile
train_dataset_transformed = train_dataset.copy()
test_dataset_transformed = test_dataset.copy()
for feature in train_dataset.columns:
    #Percentiles estimated from train data
    bin_res = 0.2
    eval_percentiles = np.arange(bin_res, 100, bin_res)
    percentiles = [
        np.percentile(train_dataset[feature], p)
        for p in eval_percentiles
    ]

    #Apply to both train and test data
    train_dataset_transformed[feature] = pd.cut(
        train_dataset[feature],
        bins=[-np.inf] + percentiles + [np.inf],
        labels=False
    ).astype(np.float32)
    
    test_dataset_transformed[feature] = pd.cut(
        test_dataset[feature],
        bins=[-np.inf] + percentiles + [np.inf],
        labels=False
    ).astype(np.float32)

#Hist before and after:
# plt.hist(train_dataset.iloc[:, 0])
# plt.hist(train_dataset_transformed.iloc[:, 0], bins=100)
n_feat = train_dataset.shape[1]

model = nn.Sequential(
    nn.Linear(n_feat, n_feat), nn.ReLU(), nn.BatchNorm1d(n_feat),                   
    nn.Linear(n_feat, n_feat // 2), nn.ReLU(), nn.BatchNorm1d(n_feat // 2),                   
    # nn.Linear(n_feat // 2, n_feat // 2), nn.ReLU(),  nn.BatchNorm1d(n_feat // 2),
    nn.Linear(n_feat // 2, n_feat // 4), nn.ReLU(),  nn.BatchNorm1d(n_feat // 4),
    # nn.Linear(n_feat // 4, n_feat // 4), nn.ReLU(),  nn.BatchNorm1d(n_feat // 4),
    nn.Linear(n_feat // 4, 1)
)

optim = torch.optim.Adam(model.parameters(), 0.01)

#Scale
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(train_dataset_transformed)

X_train = scaler.transform(train_dataset_transformed)
X_test = scaler.transform(test_dataset_transformed)

#Convert to tensors
X_train = torch.tensor(X_train).float()
y_train = torch.tensor(train_labels.values).float()

X_test = torch.tensor(X_test).float()
y_test = torch.tensor(test_labels.values).float()

torch.manual_seed(0)
for epoch in range(1770):
    yhat = model(X_train)

    loss = nn.MSELoss()(yhat.ravel(), y_train)
    optim.zero_grad()
    loss.backward()
    optim.step()

    with torch.no_grad():
        yhatt = model(X_test)
        score = np.corrcoef(y_test, yhatt.ravel())
        if epoch % 30 == 0:
            print('epoch', epoch, '| loss:', loss.item(), '| R:', score[0, 1])

yhat = model(X_test)
yhat = yhat.detach().numpy()
plt.scatter(test_labels, yhat)
ax_lims = plt.gca().axis()
plt.plot([0, 100], [0, 100], 'k:', label='y=x')
plt.gca().axis(ax_lims)
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.legend()