Why PyTorch creates another repro called TorchData
for similar/new Dataset
and DataLoader
instead of adding them in the existing PyTorch repro? What's the difference of Dataset
and Datapipe
? Thanks.
TorchData
is a library of common modular data loading primitives for easily constructing flexible and performant data pipelines.
It aims to provide composable Iterable-style
and Map-style
building blocks called DataPipes
that work well out of the box with the PyTorch's DataLoader. It contains functionality to reproduce many different datasets in TorchVision and TorchText, namely including loading, parsing, caching, and several other utilities (e.g. hash checking).
DataPipe
is simply a renaming and repurposing of the PyTorch Dataset for composed usage. A DataPipe takes in some access function over Python data structures, __iter__
for IterDataPipes
and __getitem__
for MapDataPipes, and returns a new access function with a slight transformation applied.