tensorflowgoogle-cloud-platformtpugoogle-cloud-tpu

GCP and TPU, experimental_connect_to_cluster give no response


I am trying to use TPU on GCP with tensorflow 2.1 with Keras API. Unfortunately, I am stuck after creating the tpu-node. In fact, it seems that my VM "see" the tpu, but could not connect to it.

The code I am using :

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(TPU_name)
print('Running on TPU ', resolver.master())
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)

The code is stuck line 3, I received few messages and then nothing, so I do not know what could be the issue. Therefore I am suspecting some connection's issue between the VM and the TPU.

The message :

2020-04-22 15:46:25.383775: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2020-04-22 15:46:25.992977: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz 2020-04-22 15:46:26.042269: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5636e4947610 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-04-22 15:46:26.042403: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-04-22 15:46:26.080879: I tensorflow/core/common_runtime/process_util.cc:147] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance. E0422 15:46:26.263937297 2263 socket_utils_common_posix.cc:198] check for SO_REUSEPORT: {"created":"@1587570386.263923266","description":"SO_REUSEPORT unavailable on compiling system","file":"external/grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":166} 2020-04-22 15:46:26.269134: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> 10.163.38.90:8470} 2020-04-22 15:46:26.269192: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:32263}

Moreover, I am using the "Deep Learning" Image from gcp, so I should not need to install anything, right ?

Does anyone have the same issue with TF 2.1 ? P.S : the same code works fine on Kaggle and Colab.


Solution

  • Trying to reproduce, I used ctpu up --zone=europe-west4-a --disk-size-gb=50 --machine-type=n1-standard-8 --tf-version=2.1 to create vm and tpu. Then ran your code, and it succeeded.

    taylanbil@taylanbil:~$ python3 run.py 
    Running on TPU  grpc://10.240.1.2:8470
    2020-04-28 19:18:32.597556: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
    2020-04-28 19:18:32.627669: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2000189999 Hz
    2020-04-28 19:18:32.630719: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x471b980 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
    2020-04-28 19:18:32.630759: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
    2020-04-28 19:18:32.665388: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> 10.240.1.2:8470}
    2020-04-28 19:18:32.665439: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:33355}
    2020-04-28 19:18:32.683216: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> 10.240.1.2:8470}
    2020-04-28 19:18:32.683268: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:33355}
    2020-04-28 19:18:32.690405: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:33355
    taylanbil@taylanbil:~$ cat run.py 
    import tensorflow as tf
    TPU_name='taylanbil'
    resolver = tf.distribute.cluster_resolver.TPUClusterResolver(TPU_name)
    print('Running on TPU ', resolver.master())
    tf.config.experimental_connect_to_cluster(resolver)
    tf.tpu.experimental.initialize_tpu_system(resolver)
    strategy = tf.distribute.experimental.TPUStrategy(resolver)
    

    How do you create your tpu resources? Can you double check if there is no version mismatch?