google-cloud-tpu

TPU VM vs. VM instance - usage


I just started to learn to use google TPUs and am confused about TPU instance (or TPU resources/TPU VM) and VM instance.

I followed the google cloud guide and created a tpu vm, where I cloned my github repo, create a conda environment and installed the additionally needed package for training.

Just as I thought I was ready with the setup, I saw various tutorials discussing creating VM instance and link the created TPU instance in this VM instance. But I could not find more details about it on google cloud documentation.

It would be great if someone could explain to me: how are we supposed to use TPU VM and VM instances, together or separately? What's the connection between these two (from workflow's point of view)?

Background info if needed: I will run pytorch code using XLA on TPUs.

Many many thanks in advance!


Solution

  • Creating a user VM is only needed for the TPU Node architecture. TPU VM architecture comes with it's own VM that you, as the user, can SSH into and run your ML workloads.

    Differences between TPU VM and TPU Node architecture are described here: https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#tpu-arch

    If using TPU VM architecture, please follow guides and tutorials that are specific to Cloud TPU VM like this one: https://cloud.google.com/tpu/docs/run-calculation-pytorch