tensorflowgoogle-cloud-platformgoogle-compute-enginegoogle-cloud-tpu

Check TPU workload/utilization


I am training a model, and when I open the TPU in the Google Cloud Platform console, it shows me the CPU utilization (on the TPU, I suppose). It is really, really, low (like 0.07%), so maybe it is the VM CPU? I am wondering whether the training is really proper or if the TPUs are just that strong.

Is there any other way to check the TPU usage? Maybe with a ctpu command?


Solution

  • I would recommend using the TPU profiling tools that plug into TensorBoard. A good tutorial for install and use of these tools can be found here.

    You'll run the profiler while your TPU is training. It will add an extra tab to your TensorBoard with TPU-specific profiling information. Among the most useful:

    Based on these metrics, the profiler will suggest ways to start optimizing your model to train well on a TPU. You can also dig into the more sophisticated profiling tools like a trace viewer, or a list of the most expensive graph operations.

    For some guidelines on performance tuning (in addition to those ch_mike already linked) you can look at the TPU performance guide.