pythongoogle-cloud-platformnlpvirtual-machinetpu

Training model on TPU VM aborts with core dump


I am trying to train T5X on the Winograd schema challenge. When I run my training script, I receive the following error.

2022-08-21 17:27:01.141608: F ./tensorflow/core/tpu/tpu_library_init_fns.inc:101] TpuEmbeddingEngine_ConfigureCommunication not available in this library.
Aborted (core dumped) 

Any idea of what is going on?


Solution

  • We faced this issue yesterday as well, what fixed it for us was setting the TPU software version as tpu-vm-base when provisioning the TPU node.