So I have changed from my Windows machine to a MacBook Pro with Apple M3 Pro (36 GB) running with macOS Sonoma (version 14.5) due to a work requirement. I realized something very strange. In a small sample script I managed to extract the root cause of this issue.
When I import pandas before tensorflow / keras the script freezes. It works the other way around.
The script:
import numpy as np
import os
import pandas as pd
from tensorflow.keras import layers, models
print("Creating simple model...")
try:
model = models.Sequential([
layers.Input(shape=(10,)),
layers.Dense(64, activation='relu'),
layers.Dense(1, activation='linear')
])
print("Model created successfully.")
except Exception as e:
print(f"Error creating model: {e}")
x_train = np.random.rand(100, 10)
y_train = np.random.rand(100, 1)
# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')
# Train the model
try:
model.fit(x_train, y_train, epochs=5, batch_size=32)
print("Model training completed successfully.")
except Exception as e:
print(f"Error during training: {e}")
This, when run, gives me the following output:
Creating simple model...
2024-05-31 18:04:07.639131: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M3 Pro
2024-05-31 18:04:07.639149: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 36.00 GB
2024-05-31 18:04:07.639154: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 13.50 GB
2024-05-31 18:04:07.639170: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-05-31 18:04:07.639186: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
The script freezes at this point and has to be terminated. When I swap the order of import
from tensorflow.keras import layers, models
import pandas as pd
I get the following:
Creating simple model...
2024-05-31 18:07:18.879661: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M3 Pro
2024-05-31 18:07:18.879680: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 36.00 GB
2024-05-31 18:07:18.879685: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 13.50 GB
2024-05-31 18:07:18.879705: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-05-31 18:07:18.879717: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
Model created successfully.
Epoch 1/5
2024-05-31 18:07:19.269585: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.
4/4 ━━━━━━━━━━━━━━━━━━━━ 1s 16ms/step - loss: 0.1177
Epoch 2/5
4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.1078
Epoch 3/5
4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0932
Epoch 4/5
4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.1008
Epoch 5/5
4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0865
Model training completed successfully.
Note that I dont even use pandas in the script. For reference I imported os and didnt use it anywhere in the script either but it doesnt affect it.
Here is my env package pip list:
Package Version
---------------------------- -----------
absl-py 2.1.0
astunparse 1.6.3
Bottleneck 1.3.7
cachetools 5.3.3
certifi 2024.2.2
charset-normalizer 3.3.2
db-dtypes 1.2.0
flatbuffers 24.3.25
gast 0.5.4
google-api-core 2.19.0
google-auth 2.29.0
google-cloud-bigquery 3.23.1
google-cloud-core 2.4.1
google-crc32c 1.5.0
google-pasta 0.2.0
google-resumable-media 2.7.0
googleapis-common-protos 1.63.0
grpcio 1.64.0
grpcio-status 1.62.2
h5py 3.11.0
idna 3.7
importlib_metadata 7.1.0
joblib 1.4.2
keras 3.3.3
libclang 18.1.1
Markdown 3.6
markdown-it-py 3.0.0
MarkupSafe 2.1.5
mdurl 0.1.2
ml-dtypes 0.3.2
namex 0.0.8
numexpr 2.8.7
numpy 1.26.4
opt-einsum 3.3.0
optree 0.11.0
packaging 24.0
pandas 2.2.1
pip 24.0
proto-plus 1.23.0
protobuf 4.25.3
pyarrow 16.1.0
pyasn1 0.6.0
pyasn1_modules 0.4.0
Pygments 2.18.0
python-dateutil 2.9.0.post0
pytz 2024.1
requests 2.32.3
rich 13.7.1
rsa 4.9
scikit-learn 1.4.2
scipy 1.11.4
setuptools 69.5.1
six 1.16.0
tensorboard 2.16.2
tensorboard-data-server 0.7.2
tensorflow 2.16.1
tensorflow-io-gcs-filesystem 0.37.0
tensorflow-macos 2.16.1
tensorflow-metal 1.1.0
termcolor 2.4.0
threadpoolctl 3.5.0
tqdm 4.66.4
typing_extensions 4.12.0
tzdata 2024.1
urllib3 2.2.1
Werkzeug 3.0.3
wheel 0.43.0
wrapt 1.16.0
zipp 3.19.0
Suggestion from comments (@Ze'ev Ben-Tsvi)
import numpy as np
import os
import pandas as pd
from tensorflow.keras import layers, models
print("Creating simple model...")
try:
print("Initializing Sequential model...")
model = models.Sequential()
print("Adding input layer...")
model.add(layers.Input(shape=(10,)))
print("Adding first Dense layer...")
model.add(layers.Dense(64, activation='relu'))
print("Adding output Dense layer...")
model.add(layers.Dense(1, activation='linear'))
print("Model created successfully.")
except Exception as e:
print(f"Error creating model: {e}")
x_train = np.random.rand(100, 10)
y_train = np.random.rand(100, 1)
# Compile the model
try:
print("Compiling model...")
model.compile(optimizer='adam', loss='mean_squared_error')
print("Model compiled successfully.")
except Exception as e:
print(f"Error during compilation: {e}")
# Train the model
try:
print("Training model...")
model.fit(x_train, y_train, epochs=5, batch_size=32)
print("Model training completed successfully.")
except Exception as e:
print(f"Error during training: {e}")
The output of this script is:
Initializing Sequential model...
Adding input layer...
Adding first Dense layer...
Adding output Dense layer...
Model created successfully.
Compiling model...
Model compiled successfully.
Training model...
Epoch 1/5
It seems to get a little bit further in the execution when written like this. Now it doesnt get stuck at models.Sequential anymore but at model.fit.
Swapping the order of import again (tensorflow then pandas) I get:
Creating simple model...
Initializing Sequential model...
Adding input layer...
Adding first Dense layer...
Adding output Dense layer...
Model created successfully.
Compiling model...
Model compiled successfully.
Training model...
Epoch 1/5
4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - loss: 0.4620
Epoch 2/5
4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 636us/step - loss: 0.3263
Epoch 3/5
4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - loss: 0.2322
Epoch 4/5
4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 629us/step - loss: 0.1395
Epoch 5/5
4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 690us/step - loss: 0.1251
Model training completed successfully.
The main issue here is that at no point do I get an exception, not even when wrapping all imports individually in try/catch blocks. Something seems to either swallow the errors or none are thrown.
I have spent hours debugging this and found a somewhat satisfying solution. It works without requiring the import order swap quick fix. I decided to post the answer in case someone else runs into this issue.
Downgrading TensorFlow to version 2.15.0 resolved the issue, allowing the script to run regardless of the import order of pandas and tensorflow.
pip install tensorflow==2.15.0
Context: The freezing occurs in the quick_execute function in TensorFlow's execute.py and did so only when pandas was imported before tensorflow, for some reason:
def quick_execute(op_name, num_outputs, inputs, attrs, ctx, name=None):
device_name = ctx.device_name
try:
ctx.ensure_initialized()
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, inputs, attrs, num_outputs)
except core._NotOkStatusException as e:
if name is not None:
e.message += " name: " + name
raise core._status_to_exception(e) from None
except TypeError as e:
keras_symbolic_tensors = [x for x in inputs if _is_keras_symbolic_tensor(x)]
if keras_symbolic_tensors:
raise core._SymbolicException(
"Inputs to eager execution function cannot be Keras symbolic "
"tensors, but found {}".format(keras_symbolic_tensors))
raise e
return tensors
The function call that froze was:
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, inputs, attrs, num_outputs)
I couldn't determine why this function within TensorFlow causes the freeze as it did not allow me to step-into further from that point, but downgrading to TensorFlow 2.15.0 avoids the issue.