I am trying to use a TPU with Google Cloud's Kubernetes engine. My code returns several errors when I try to initialize the TPU, and any other operations only run on the CPU. To run this program, I am transferring a Python file from my Dockerhub workspace to Kubernetes, then executing it on a single v2 preemptible TPU. The TPU uses Tensorflow 2.3, which is the latest supported version for Cloud TPUs to the best of my knowledge. (I get an error saying the version is not yet supported when I try to use Tensorflow 2.4 or 2.5).
When I run my code, Google Cloud sees the TPU but fails to connect to it and instead uses the CPU. It returns this error:
tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (resnet-tpu-fxgz7): /proc/driver/nvidia/version does not exist
tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2299995000 Hz
tensorflow/compiler/xla/service/service.cc:168] XLA service 0x561fb2112c20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.8.16.2:8470}
tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30001}
tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.8.16.2:8470}
tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30001}
tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:405] Started server with target: grpc://localhost:30001
TPU name grpc://10.8.16.2:8470
The errors seem to indicate that tensorflow needs NVIDIA packages installed, but I understood from the Google Cloud TPU documentation that I shouldn't need to use tensorflow-gpu for a TPU. I tried using tensorflow-gpu anyways and received the same error, so I am not sure how to fix this problem. I've tried deleting and recreating my cluster and TPU numerous times, but I can't seem to make any progress. I'm relatively new to Google Cloud, so I may be missing something obvious, but any help would be greatly appreciated.
This is the Python script I am trying to run:
import tensorflow as tf
import os
import sys
# Parse the TPU name argument
tpu_name = sys.argv[1]
tpu_name = tpu_name.replace('--tpu=', '')
print("TPU name", tpu_name)
tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu_name) # TPU detection
tpu_name = 'grpc://' + str(tpu.cluster_spec().as_dict()['worker'][0])
print("TPU name", tpu_name)
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
Here is the yaml configuration file for my Kubernetes cluster (though I'm including a placeholder for my real workspace name and image for this post):
apiVersion: batch/v1
kind: Job
metadata:
name: test
spec:
template:
metadata:
name: test
annotations:
tf-version.cloud-tpus.google.com: "2.3"
spec:
restartPolicy: Never
imagePullSecrets:
- name: regcred
containers:
- name: test
image: my_workspace/image
command: ["/bin/bash","-c","pip3 install cloud-tpu-client tensorflow==2.3.0 && python3 DebugTPU.py --tpu=$(KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS)"]
resources:
limits:
cloud-tpus.google.com/preemptible-v2: 8
backoffLimit: 0
There are actually no errors in this workload you've provided or the logs. A few comments which I think might help:
pip install tensorflow
as you have noted installs tensorflow-gpu
. By default, it tries to run GPU specific initializations and fails (failed call to cuInit: UNKNOWN ERROR (303)
), so it falls back to local CPU execution. This is an error if you're trying to develop on a GPU VM, but in a typical CPU workload that doesn't matter. Essentially tensorflow == tensorflow-gpu
and without a GPU available it's equivalent to tensorflow-cpu
with additional error messages. Installing tensorflow-cpu
would make these warnings go away.tensorflow-gpu
or tensorflow-cpu
, as long as it's the same TF version as the TPU server. Your workload here is successfully connecting to the TPU server, indicated by:tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.8.16.2:8470}
tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30001}
tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.8.16.2:8470}
tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30001}
tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:405] Started server with target: grpc://localhost:30001