I am getting many of the predictions with a similar value with deep learning, which generates a horizontal line in the correlation plot.
I generated a small dataset that can reproduce the problem (data) but my dataset is much larger. That's why the layers are so big, but I get the same problem if I adapt them to the size of this simplified case.
If I try to predict the target values with other algorithm like random forest I get a R of 0.4 with this small dataset. With the full dataset, if I run the deep learning method and afterwards I remove all the values form the horizontal line, I get a similar R as the one of random forest. I don't know why it is not predicting in the same way for the samples of the horizontal line. Do you have any clue?
This is a code that reproduces the problem and some correlation plots:
import torch, torch.nn as nn
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
var='target'
data = pd.read_csv('data800.csv', index_col=0)
train_dataset = data.sample(frac=0.8,random_state=1)
test_dataset = data.drop(train_dataset.index)
train_labels = train_dataset.pop(var)
test_labels = test_dataset.pop(var)
model = nn.Sequential(nn.Linear(train_dataset.shape[1], 1024), nn.ReLU(), nn.BatchNorm1d(1024),
nn.Linear(1024, 128), nn.ReLU(), nn.BatchNorm1d(128),
nn.Linear(128, 64), nn.ReLU(), nn.BatchNorm1d(64),
nn.Linear(64, 1))
optim = torch.optim.Adam(model.parameters(), 0.01)
for epoch in range(200):
yhat = model(torch.tensor(train_dataset.values).to(torch.float32))
loss = nn.MSELoss()(yhat.ravel(), torch.tensor(train_labels).to(torch.float32))
optim.zero_grad()
loss.backward()
optim.step()
yhatt=model(torch.tensor(test_dataset.values).to(torch.float32))
yhatt = yhatt.detach().numpy()
score = np.corrcoef(test_labels, yhatt.reshape(test_labels.shape))
if epoch % 20 == 0:
print('epoch', epoch, '| loss:', loss.item(), '| R:', score[0,1])
yhat = model(torch.tensor(test_dataset.values).to(torch.float32))
yhat = yhat.detach().numpy()
plt.scatter(test_labels, yhat)
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.axis('equal')
plt.axis('square')
_ = plt.plot([-1000, 1000], [-1000, 1000])
plt.show()
I think the issue was that each feature typically had a mixed distribution. ML algorithms generally work best when the features are symmetrically distributed and on a similar scale. I transformed the features to a uniform distribution by replacing each feature with its percentile. This flattens the distribution:
The model had better convergence. I then also tweaked the architecture. It was initially stepping up from ~50 features to 1024. I changed to a tapered architecture where it gradually scales down from the input feature size. That also improved the results. Final train RMSE was 0.14, and test set r=0.42. Code below.
import torch, torch.nn as nn
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
var = 'target'
data = pd.read_csv('data800.csv', index_col=0)
train_dataset = data.sample(frac=0.8, random_state=1)
test_dataset = data.drop(train_dataset.index)
train_labels = train_dataset.pop(var)
test_labels = test_dataset.pop(var)
#Flatten distribution by replacing each value with its percentile
train_dataset_transformed = train_dataset.copy()
test_dataset_transformed = test_dataset.copy()
for feature in train_dataset.columns:
#Percentiles estimated from train data
bin_res = 0.2
eval_percentiles = np.arange(bin_res, 100, bin_res)
percentiles = [
np.percentile(train_dataset[feature], p)
for p in eval_percentiles
]
#Apply to both train and test data
train_dataset_transformed[feature] = pd.cut(
train_dataset[feature],
bins=[-np.inf] + percentiles + [np.inf],
labels=False
).astype(np.float32)
test_dataset_transformed[feature] = pd.cut(
test_dataset[feature],
bins=[-np.inf] + percentiles + [np.inf],
labels=False
).astype(np.float32)
#Hist before and after:
# plt.hist(train_dataset.iloc[:, 0])
# plt.hist(train_dataset_transformed.iloc[:, 0], bins=100)
n_feat = train_dataset.shape[1]
model = nn.Sequential(
nn.Linear(n_feat, n_feat), nn.ReLU(), nn.BatchNorm1d(n_feat),
nn.Linear(n_feat, n_feat // 2), nn.ReLU(), nn.BatchNorm1d(n_feat // 2),
# nn.Linear(n_feat // 2, n_feat // 2), nn.ReLU(), nn.BatchNorm1d(n_feat // 2),
nn.Linear(n_feat // 2, n_feat // 4), nn.ReLU(), nn.BatchNorm1d(n_feat // 4),
# nn.Linear(n_feat // 4, n_feat // 4), nn.ReLU(), nn.BatchNorm1d(n_feat // 4),
nn.Linear(n_feat // 4, 1)
)
optim = torch.optim.Adam(model.parameters(), 0.01)
#Scale
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(train_dataset_transformed)
X_train = scaler.transform(train_dataset_transformed)
X_test = scaler.transform(test_dataset_transformed)
#Convert to tensors
X_train = torch.tensor(X_train).float()
y_train = torch.tensor(train_labels.values).float()
X_test = torch.tensor(X_test).float()
y_test = torch.tensor(test_labels.values).float()
torch.manual_seed(0)
for epoch in range(1770):
yhat = model(X_train)
loss = nn.MSELoss()(yhat.ravel(), y_train)
optim.zero_grad()
loss.backward()
optim.step()
with torch.no_grad():
yhatt = model(X_test)
score = np.corrcoef(y_test, yhatt.ravel())
if epoch % 30 == 0:
print('epoch', epoch, '| loss:', loss.item(), '| R:', score[0, 1])
yhat = model(X_test)
yhat = yhat.detach().numpy()
plt.scatter(test_labels, yhat)
ax_lims = plt.gca().axis()
plt.plot([0, 100], [0, 100], 'k:', label='y=x')
plt.gca().axis(ax_lims)
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.legend()