pythontensorflowmachine-learninginputsharding

What is sharding in machine learning and how to do sharding in Tensorflow?


What is sharding in the context of machine learning specifically ( a more generic antic question is asked here ) and how is it implemented in Tensorflow ?

What is referred to as sharding, why do we need sharding altogether, when speaking about the data pipeline in machine learning ?


Solution

  • In Tensorflow - In Dataset the function shard() creates a Dataset that includes only 1/num_shards of this dataset. Shard is deterministic. The Dataset produced by A.shard(n, i) will contain all elements of A whose index mod n = i.

    A = tf.data.Dataset.range(10)
    B = A.shard(num_shards=3, index=0)
    list(B.as_numpy_iterator())
    [0,3,6,9]
    C = A.shard(num_shards=3, index=1)
    list(C.as_numpy_iterator())
    [1,4,7]
    D = A.shard(num_shards=3, index=2)
    list(D.as_numpy_iterator())
    [2,5,8]
    

    Important caveats: Be sure to shard before you use any randomizing operator (such as shuffle).

    Generally it is best if the shard operator is used early in the dataset pipeline. For example, when reading from a set of TFRecord files, shard before converting the dataset to input samples. This avoids reading every file on every worker. The following is an example of an efficient sharding strategy within a complete pipeline:

    Autosharding a dataset over a set of workers means that each worker is assigned a subset of the entire dataset (if the right tf.data.experimental.AutoShardPolicy is set).

    This is to ensure that at each step, a global batch size of non overlapping dataset elements will be processed by each worker.

    Setting autosharding options, example

    dataset = tf.data.Dataset.from_tensors(([1.],[1.])).repeat(64).batch(16)
    options = tf.data.Options()
    options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
    dataset = dataset.with_options(options)
    

    There is no autosharding in multi worker training with ParameterServerStrategy.

    Autosharding options are