I have access to 1 Preemptible Cloud TPU v3-32, and I want to train my LM on it, however, since it is preemptible, I can't attach a persistent disk (read-write mode) to it as it is also mentioned in Docs.
My dataset is around 100GB.
These were the things I did but none worked:
Preprocessed and Cached the data on another VM and saved them on PD then attached the PD to TPU in read-only mode: Write Permission Error for the time my code wants to lock the lock file.
Using Google Buckets and TFDA to stream the data: The problem here is the caching, Space needed for caching is about 250GB which is not available.
I am using Jax/Flax and the script is available here. SCRIPT
TPU v3-32
has 4 hosts (each has 8 TPU cores attached), each one with 340 GB DRAM and about 100 GB disk storage. So if you wanted to shared your dataset 4 ways you could save it on 4 hosts.
But I recommend storing your dataset in GCS bucket, and using distributed tf.data (or other options) to process, map, prefetch, batch in parallel on each hosts (each hosts needs to process 1/4 of the data from your dataset per epoch).
https://www.tensorflow.org/guide/data_performance
https://github.com/google/seqio is another option to consider.