I am training a MLP on a tabular dataset, the pendigits dataset. Problem is that training loss and accuracy are more or less stable, while validation and test loss and accuracy are completely constant. The pendigits dataset contains 10 classes. My code is exactly the same with other experiments that I did for example on MNIST or CIFAR10 that work correctly. The only things that change are the dataset from MNIST/CIFAR10 to pendigits and the NN, from a ResNet-18 to a simple MLP. Below the training function and the network:
def train(net, loaders, optimizer, criterion, epochs=100, dev=dev, save_param = True, model_name="only-pendigits"):
torch.manual_seed(myseed)
try:
net = net.to(dev)
print(net)
# Initialize history
history_loss = {"train": [], "val": [], "test": []}
history_accuracy = {"train": [], "val": [], "test": []}
# Process each epoch
for epoch in range(epochs):
# Initialize epoch variables
sum_loss = {"train": 0, "val": 0, "test": 0}
sum_accuracy = {"train": 0, "val": 0, "test": 0}
# Process each split
for split in ["train", "val", "test"]:
# Process each batch
for (input, labels) in loaders[split]:
# Move to CUDA
input = input.to(dev)
labels = labels.to(dev)
# Reset gradients
optimizer.zero_grad()
# Compute output
pred = net(input)
#labels = labels.long()
loss = criterion(pred, labels)
# Update loss
sum_loss[split] += loss.item()
# Check parameter update
if split == "train":
# Compute gradients
loss.backward()
# Optimize
optimizer.step()
# Compute accuracy
_,pred_labels = pred.max(1)
batch_accuracy = (pred_labels == labels).sum().item()/input.size(0)
# Update accuracy
sum_accuracy[split] += batch_accuracy
scheduler.step()
# Compute epoch loss/accuracy
epoch_loss = {split: sum_loss[split]/len(loaders[split]) for split in ["train", "val", "test"]}
epoch_accuracy = {split: sum_accuracy[split]/len(loaders[split]) for split in ["train", "val", "test"]}
# Update history
for split in ["train", "val", "test"]:
history_loss[split].append(epoch_loss[split])
history_accuracy[split].append(epoch_accuracy[split])
# Print info
print(f"Epoch {epoch+1}:",
f"TrL={epoch_loss['train']:.4f},",
f"TrA={epoch_accuracy['train']:.4f},",
f"VL={epoch_loss['val']:.4f},",
f"VA={epoch_accuracy['val']:.4f},",
f"TeL={epoch_loss['test']:.4f},",
f"TeA={epoch_accuracy['test']:.4f},",
f"LR={optimizer.param_groups[0]['lr']:.5f},")
except KeyboardInterrupt:
print("Interrupted")
finally:
# Plot loss
plt.title("Loss")
for split in ["train", "val", "test"]:
plt.plot(history_loss[split], label=split)
plt.legend()
plt.show()
# Plot accuracy
plt.title("Accuracy")
for split in ["train", "val", "test"]:
plt.plot(history_accuracy[split], label=split)
plt.legend()
plt.show()
Network:
#RETE TESTO
class TextNN(nn.Module):
#Constructor
def __init__(self):
# Call parent contructor
super().__init__()
torch.manual_seed(myseed)
self.relu = nn.ReLU()
self.linear1 = nn.Linear(16, 128) #16 sono le colonne in input
self.linear2 = nn.Linear(128, 128)
self.linear3 = nn.Linear(128, 32)
self.linear4 = nn.Linear(32, 10)
def forward(self, tab):
tab = self.linear1(tab)
tab = self.relu(tab)
tab = self.linear2(tab)
tab = self.relu(tab)
tab = self.linear3(tab)
tab = self.relu(tab)
tab = self.linear4(tab)
return tab
model = TextNN()
print(model)
Is it possible that the model is too simple that it does not learn anything? I do not think so. I think that there is some error in training (but the function is exactly the same with the function I use for MNIST or CIFAR10 that works correctly), or in the data loading. Below is how I load the dataset:
pentrain = pd.read_csv("pendigits.tr.csv")
pentest = pd.read_csv("pendigits.te.csv")
class TextDataset(Dataset):
"""Tabular and Image dataset."""
def __init__(self, excel_file, transform=None):
self.excel_file = excel_file
#self.tabular = pd.read_csv(excel_file)
self.tabular = excel_file
self.transform = transform
def __len__(self):
return len(self.tabular)
def __getitem__(self, idx):
if torch.is_tensor(idx):
idx = idx.tolist()
tabular = self.tabular.iloc[idx, 0:]
y = tabular["class"]
tabular = tabular[['input1', 'input2', 'input3', 'input4', 'input5', 'input6', 'input7',
'input8', 'input9', 'input10', 'input11', 'input12', 'input13',
'input14', 'input15', 'input16']]
tabular = tabular.tolist()
tabular = torch.FloatTensor(tabular)
if self.transform:
tabular = self.transform(tabular)
return tabular, y
penditrain = TextDataset(excel_file=pentrain, transform=None)
train_size = int(0.80 * len(penditrain))
val_size = int((len(penditrain) - train_size))
pentrain, penval = random_split(penditrain, (train_size, val_size))
pentest = TextDataset(excel_file=pentest, transform=None)
All is loaded correctly, indeed if I print an example:
text_x, label_x = pentrain[0]
print(text_x.shape, label_x)
text_x
torch.Size([16]) 1
tensor([ 48., 74., 88., 95., 100., 100., 78., 75., 66., 49., 64., 23.,
32., 0., 0., 1.])
And these are my dataloaders:
#Define generators
generator=torch.Generator()
generator.manual_seed(myseed)
# Define loaders
from torch.utils.data import DataLoader
train_loader = DataLoader(pentrain, batch_size=128, num_workers=2, drop_last=True, shuffle=True, generator=generator)
val_loader = DataLoader(penval, batch_size=128, num_workers=2, drop_last=False, shuffle=False, generator=generator)
test_loader = DataLoader(pentest, batch_size=128, num_workers=2, drop_last=False, shuffle=False, generator=generator)
I have been stuck with this problem for 2 days, and I do not know what the problem is...
EDIT: Basically, if I write print(list(net.parameters()))
at the beginning of each epoch, I see that weights does never change, and for this reason loss and accuracy remain constant. Why weights are not changing?
EDIT2: also with another dataset, like digits of sklearn, the problem is exactly the same.
EDIT3: I see online that simple MLP like the one I am using, obtains good results on these datasets. I compared my training function with online notebooks, and the steps are the same. Moreover, my training function works on other datasets like MNIST. So I do not know where is the problem...
I solved... mistake was that I was calling again model = TextNN()
after instantiating the optimizer, so weights were not changing... So, every part was ok, apart from the optimizer that was working with another (unused) model.