pytorchdatasetdataloaderpytorch-dataloaderdata-pipeline

Why PyTorch creates another data repro TorchData


Why PyTorch creates another repro called TorchData for similar/new Dataset and DataLoader instead of adding them in the existing PyTorch repro? What's the difference of Dataset and Datapipe? Thanks.


Solution

  • TorchData is a library of common modular data loading primitives for easily constructing flexible and performant data pipelines.

    It aims to provide composable Iterable-style and Map-style building blocks called DataPipes that work well out of the box with the PyTorch's DataLoader. It contains functionality to reproduce many different datasets in TorchVision and TorchText, namely including loading, parsing, caching, and several other utilities (e.g. hash checking).

    DataPipe is simply a renaming and repurposing of the PyTorch Dataset for composed usage. A DataPipe takes in some access function over Python data structures, __iter__ for IterDataPipes and __getitem__ for MapDataPipes, and returns a new access function with a slight transformation applied.