I’ve managed to port a version of my TensorFlow model to a Graphcore IPU and to run with data parallelism. However the full-size model won’t fit on a single IPU and I’m looking for strategies to implement model parallelism.
I’ve not had much luck so far in finding information about model parallelism approaches, apart from https://www.graphcore.ai/docs/targeting-the-ipu-from-tensorflow#sharding-a-graph in the Targeting the IPU from TensorFlow guide, in which the concept of sharding is introduced.
Is sharding the recommended approach for splitting my model across multiple IPUs? Are there more resources I can refer to?
Sharding consists in partitioning the model across multiple IPUs so that each IPU device computes part of the graph. However, this approach is generally recommended for niche use cases involving multiple models in a single graph e.g. ensembles.
A different approach to implement model parallelism across multiple IPUs is pipelining. The model is still split into multiple compute stages on multiple IPUs; the stages are executed in parallel and the outputs of a stage are the inputs to the next one. Pipelining ensures improved utilisation of the hardware during execution, which leads to better efficiency and performance in terms of throughput and latency, if compared to sharding.
Therefore, pipelining is the recommended method to parallelise a model across multiple IPUs.
You can find more details on pipelined training in this section of the Targeting the IPU from TensorFlow guide.
A more comprehensive review of those two model parallelism approaches is provided in this dedicated guide.
You could also consider using IPUPipelineEstimator
: it is a variant of the IPUEstimator
that automatically handles most aspects of running a (pipelined) program on an IPU. Here you can find a code example showing how to use the IPUPipelineEstimator
to train a simple CNN on the CIFAR-10 dataset.