python tensorflow machine-learning pytorch neural-network

Investigating discrepancies in TensorFlow and PyTorch performance

In my pursuit of mastering PyTorch neural networks, I've attempted to replicate an existing TensorFlow architecture. However, I've encountered a significant performance gap. While TensorFlow achieves rapid learning within 25 epochs, PyTorch requires at least 250 epochs for comparable generalization. Despite meticulous code scrutiny, I've been unable to identify further enhancements. Despite carefully aligning the architectures of both neural networks, disparities still persist. Can anyone shed light on what else might be amiss here?

In the subsequent section, I'll present the full Python code for both implementations, along with the CLI output and graphical visualization.

Reproducibility: As I prefer not to share the original dataset, I've attached a piece of code that emulates the dataset instead. The generated data_inverter.csv can be used to reproduce the observed behavior.

PyTorch code:

# Standard library imports
import pandas as pd
import matplotlib.pyplot as plt

# External library imports
import torch
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.metrics import max_error, mean_absolute_error, mean_squared_error

# Loading dataset
df_data = pd.read_csv("./data_inverter.csv", names=["pvt", "edge", "slew", "load", "delay"])

# Selecting subset of data based on specific conditions
df_select = df_data[(df_data["pvt"] == "PtypV1500T027") & (df_data["edge"] == "rise")]

# Splitting features and target variable
X = df_select.drop(["pvt", "edge", "delay"], axis='columns')
y = df_select["delay"]

# Scaling input features using Min-Max scaling
slew_scaler = MinMaxScaler()
load_scaler = MinMaxScaler()

X_scaled = X.copy()
X_scaled["slew"] = slew_scaler.fit_transform(X_scaled.slew.values.reshape(-1, 1))
X_scaled["load"] = load_scaler.fit_transform(X_scaled.load.values.reshape(-1, 1))

# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.1, random_state=42)

# Converting data to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train.values)
y_train_tensor = torch.FloatTensor(y_train.values).view(-1, 1)
X_test_tensor = torch.FloatTensor(X_test.values)
y_test_tensor = torch.FloatTensor(y_test.values).view(-1, 1)

# Setting random seed for reproducibility
torch.manual_seed(42)

# Defining neural network architecture
model = torch.nn.Sequential(
    torch.nn.Linear(X_train_tensor.shape[1], 128),
    torch.nn.ReLU(),
    torch.nn.Linear(128, 128),
    torch.nn.ReLU(),
    torch.nn.Linear(128, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 32),
    torch.nn.ReLU(),
    torch.nn.Linear(32, 16),
    torch.nn.ReLU(),
    torch.nn.Linear(16, 1),
    torch.nn.ELU()
)

# Loss function and optimizer
criterion = torch.nn.MSELoss()
criterion_val = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters())

# Training the model
num_epochs = 25
progress = {'loss': [], 'mae': [], 'mse': [], 'val_loss': [], 'val_mae': [], 'val_mse': []}

for epoch in range(num_epochs):
    # Forward pass
    y_predict = model(X_train_tensor)
    loss = criterion(y_predict, y_train_tensor)

    # Backward and optimize
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    # Validation
    with torch.no_grad():
        model.eval()
        y_test_predict = model(X_test_tensor)
        loss_val = criterion_val(y_test_predict, y_test_tensor)

    model.train()

    # Record progress
    progress['loss'].append(loss.item())
    progress['mae'].append(mean_absolute_error(y_train_tensor, y_predict.detach().numpy()))
    progress['mse'].append(mean_squared_error(y_train_tensor, y_predict.detach().numpy()))
    progress['val_loss'].append(loss_val.item())
    progress['val_mae'].append(mean_absolute_error(y_test_tensor, y_test_predict.detach().numpy()))
    progress['val_mse'].append(mean_squared_error(y_test_tensor, y_test_predict.detach().numpy()))

    print("Epoch %i/%i   -   loss: %0.5F" % (epoch, num_epochs, loss.item()))

# Displaying model summary
print(model)

# Plotting training progress
df_progress = pd.DataFrame(progress)
df_progress.plot()
plt.title("Model training progress: DNN PyTorch")
plt.tight_layout()
plt.show()

# Making predictions on the testing set
with torch.no_grad():
    model.eval()
    y_predict_tensor = model(X_test_tensor)
    y_predict = y_predict_tensor.numpy()

# Displaying model performance metrics
print("Model performance metrics: DNN PyTorch")
print("MAX error:", max_error(y_test_tensor, y_predict))
print("MAE error:", mean_absolute_error(y_test_tensor, y_predict))
print("MSE error:", mean_squared_error(y_test_tensor, y_predict, squared=False))

plt.scatter(y_test, y_predict)
plt.scatter(y_test, y_test, marker='.')
plt.title("Model predictions: DNN PyTorch")
plt.tight_layout()
plt.show()

TensorFlow code:

# Standard library imports
import pandas as pd
import matplotlib.pyplot as plt

# External library imports
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.metrics import max_error, mean_absolute_error, mean_squared_error

# Loading dataset
df_data = pd.read_csv("./data_inverter.csv", names=["pvt", "edge", "slew", "load", "delay"])

# Selecting subset of data based on specific conditions
df_select = df_data[(df_data["pvt"] == "PtypV1500T027") & (df_data["edge"] == "rise")]

# Splitting features and target variable
X = df_select.drop(["pvt", "edge", "delay"], axis='columns')
y = df_select["delay"]

# Scaling input features using Min-Max scaling
slew_scaler = MinMaxScaler()
load_scaler = MinMaxScaler()

X_scaled = X.copy()
X_scaled["slew"] = slew_scaler.fit_transform(X_scaled.slew.values.reshape(-1, 1))
X_scaled["load"] = load_scaler.fit_transform(X_scaled.load.values.reshape(-1, 1))

# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.1, random_state=42)

# Converting data to TensorFlow tensors
X_train_tensor = tf.constant(X_train.values, dtype=tf.float32)
y_train_tensor = tf.constant(y_train.values, dtype=tf.float32)
X_test_tensor = tf.constant(X_test.values, dtype=tf.float32)
y_test_tensor = tf.constant(y_test.values, dtype=tf.float32)

# Setting random seed for reproducibility
tf.keras.utils.set_random_seed(42)

# Defining neural network architecture
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_dim=X_train_tensor.shape[1]),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(1, activation='elu')
])

# Compiling the model
model.compile(
    loss=tf.keras.losses.MeanSquaredError(),  # Using Mean Squared Error loss function
    optimizer=tf.keras.optimizers.Adam(),  # Using Adam optimizer
    metrics=['mae', 'mse']  # Using Mean Absolute Error and Mean Squared Error as metrics
)

# Training the model
progress = model.fit(X_train_tensor, y_train_tensor, validation_data=(X_test_tensor, y_test_tensor), epochs=25)

# Evaluating model performance on the testing set
model.evaluate(X_test_tensor, y_test_tensor, verbose=2)

# Displaying model summary
print(model.summary())

# Plotting training progress
pd.DataFrame(progress.history).plot()
plt.title("Model training progress: DNN TensorFlow")
plt.tight_layout()
plt.show()

# Making predictions on the testing set
y_predict = model.predict(X_test_tensor)

# Displaying model performance metrics
print("Model performance metrics: DNN TensorFlow")
print("MAX error:", max_error(y_test_tensor, y_predict))
print("MAE error:", mean_absolute_error(y_test_tensor, y_predict))
print("MSE error:", mean_squared_error(y_test_tensor, y_predict, squared=False))

plt.scatter(y_test, y_predict)
plt.scatter(y_test, y_test, marker='.')
plt.title("Model predictions: DNN TensorFlow")
plt.tight_layout()
plt.show()

CLI output of PyTorch model performance metrics after 25 epochs:

Sequential(
  (0): Linear(in_features=2, out_features=128, bias=True)
  (1): ReLU()
  (2): Linear(in_features=128, out_features=128, bias=True)
  (3): ReLU()
  (4): Linear(in_features=128, out_features=64, bias=True)
  (5): ReLU()
  (6): Linear(in_features=64, out_features=32, bias=True)
  (7): ReLU()
  (8): Linear(in_features=32, out_features=16, bias=True)
  (9): ReLU()
  (10): Linear(in_features=16, out_features=1, bias=True)
  (11): ELU(alpha=1.0)
)
Model performance metrics: DNN PyTorch
MAX error: 1.2864852
MAE error: 0.3353702
MSE error: 0.42874745

CLI output of TensorFlow model performance metrics after 25 epochs:

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 128)               384       
                                                                 
 dense_1 (Dense)             (None, 128)               16512     
                                                                 
 dense_2 (Dense)             (None, 64)                8256      
                                                                 
 dense_3 (Dense)             (None, 32)                2080      
                                                                 
 dense_4 (Dense)             (None, 16)                528       
                                                                 
 dense_5 (Dense)             (None, 1)                 17        
                                                                 
=================================================================
Total params: 27777 (108.50 KB)
Trainable params: 27777 (108.50 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None
6/6 [==============================] - 0s 750us/step
Model performance metrics: DNN TensorFlow
MAX error: 0.013849139
MAE error: 0.0029576812
MSE error: 0.0036013061

PyTorch training progress:

TensorFlow training progress:

PyTorch scatter plot (orange = target against itself, blue = target against prediction):

TensorFlow scatter plot (orange = target against itself, blue = target against prediction):

..............................................................................................................................................

Appending additional info (reaction to the questions and comments):

torch.optim.Adam - the default learning rate is set to 0.001.

tf.keras.optimizers.Adam - the default learning rate is set to 0.001

..............................................................................................................................................

Here's the PyTorch model performance after 250 epoch:

Sequential(
  (0): Linear(in_features=2, out_features=128, bias=True)
  (1): ReLU()
  (2): Linear(in_features=128, out_features=128, bias=True)
  (3): ReLU()
  (4): Linear(in_features=128, out_features=64, bias=True)
  (5): ReLU()
  (6): Linear(in_features=64, out_features=32, bias=True)
  (7): ReLU()
  (8): Linear(in_features=32, out_features=16, bias=True)
  (9): ReLU()
  (10): Linear(in_features=16, out_features=1, bias=True)
  (11): ELU(alpha=1.0)
)
Model performance metrics: DNN PyTorch
MAX error: 0.025619686
MAE error: 0.006687804
MSE error: 0.008531998

..............................................................................................................................................

If you want to run reproduce the issue, you can use this code to emulate the dataset:

import csv
import math

x_values = [0.003, 0.00354604, 0.00546274, 0.00912297, 0.0148254, 0.0228266, 0.0333551, 0.0466191, 0.0628111, 0.0821111, 0.104689, 0.130705, 0.160313, 0.193659, 0.230886, 0.272128, 0.317517, 0.36718, 0.42124, 0.479818, 0.54303, 0.61099, 0.683809, 0.761595, 0.844455, 0.932492, 1.02581, 1.1245, 1.22868, 1.33842, 1.45383, 1.57501, 1.70203, 1.835, 1.974]
y_values = [0.001, 0.00102008, 0.00109058, 0.0012252, 0.00143494, 0.00172922, 0.00211646, 0.0026043, 0.00319984, 0.0039097, 0.0047401, 0.00569697, 0.00678594, 0.00801243, 0.00938161, 0.0108985, 0.0125679, 0.0143945, 0.0163828, 0.0185373, 0.0208622, 0.0233618, 0.0260401, 0.028901, 0.0319486, 0.0351866, 0.0386187, 0.0422487, 0.0460802, 0.0501166, 0.0543615, 0.0588182, 0.0634902, 0.0683808, 0.0734931, 0.0788305, 0.0843961, 0.0901929, 0.0962242, 0.102493, 0.109002, 0.115755, 0.122753, 0.130001, 0.137502, 0.145257, 0.153269, 0.161543, 0.170079, 0.178881]
z_values = [[math.sqrt(5*(x+0.25)) * math.sqrt(3*(y+0.005)) for y in y_values] for x in x_values]

with open("./data_inverter.csv", 'w') as fid:
    writer = csv.writer(fid)

    for i in range(len(x_values)):
        for j in range(len(y_values)):
            writer.writerow(["PtypV1500T027", "rise", x_values[i], y_values[j], z_values[i][j]])

Solution

The difference is that TensorFlow's model.fit default to mini-batching* (with a batch size of 32, see the doc of model.fit), while your PyTorch training loop is simply batching*. As a result, your PyTorch model is doing only 25 weights update, while the TensorFlow model does (N/32)*25 (where N is your number of sample), hence being able to find a better local minima.

By implementing mini-batching, you get similar results in Pytorch:

batch_size = 32
for epoch in range(num_epochs):
    # Forward pass
    batches = list()
    # mini-batching
    for x_batch, y_true in zip(
        torch.split(X_train_tensor, batch_size, dim=0),
        torch.split(y_train_tensor, batch_size, dim=0),
    ):
        y_predict_batch = model(x_batch)
        loss = criterion(y_predict_batch, y_true)
        batches.append(y_predict_batch)
        # Backward and optimize
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    y_predict = torch.concat(batches, dim=0)
    # Validation
    with torch.no_grad():
        model.eval()
        y_test_predict = model(X_test_tensor)
        print(y_test_predict.shape, y_test_tensor.shape)
        loss_val = criterion_val(y_test_predict, y_test_tensor)

    model.train()

    # Record progress
    progress['loss'].append(loss.item())
    progress['mae'].append(mean_absolute_error(y_train_tensor, y_predict.detach().numpy()))
    progress['mse'].append(mean_squared_error(y_train_tensor, y_predict.detach().numpy()))
    progress['val_loss'].append(loss_val.item())
    progress['val_mae'].append(mean_absolute_error(y_test_tensor, y_test_predict.detach().numpy()))
    progress['val_mse'].append(mean_squared_error(y_test_tensor, y_test_predict.detach().numpy()))

    print("Epoch %i/%i   -   loss: %0.5F" % (epoch, num_epochs, loss.item()))

I would suggest to use the torch.utils.data module to do the mini-batching rather than my implementation.

*: for the difference between batching and mini-batching, see this question: What is the meaning of a 'mini-batch' in deep learning?

You could theoretically compensate the larger batch size by using a larger learning rate in PyTorch. I get not completely terrible results with a learning rate of 0.02, but:

this depends highly on weights initialization
this does not play well if you're using an optimizer that has an internal state such as Adam

With a bit of tuning (like using SDG and a scheduler), you could probably get better results, but mini-batching is just much easier in that case.