amazon-web-servicesamazon-s3pytorchamazon-sagemakertorchvision

Using Torchvison ImageFolder on AWS S3


I'm working with an AWS S3 instance and trying to deploy a SSL model loading a dataset from a bucket list I have defined on S3. The DL framework I'm using is PyTorch and more concretely to load the images dataset from S3 I'm using Torchvsion. However, when I try to load the images from S3 with torchvision.dataset.ImageFolder it raises an error.

Apparently, this is not possible (the post is from 2019): https://discuss.pytorch.org/t/can-i-use-torchvision-dataset-and-dataloader-with-aws-s3/34096

But I would like to know if there is an option rather than the specified here in 2021: https://aws.amazon.com/blogs/machine-learning/announcing-the-amazon-s3-plugin-for-pytorch/

EDIT:

I'm also trying to address this problem using the 'new' Amazon's S3 connector for PyTorch:

https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-s3-connector-pytorch/

https://github.com/awslabs/s3-connector-for-pytorch


Solution

  • This depends on how you're implementing it. The AWS solution of utilizing the Pytorch-S3 connector should work but since you prefer not to use it, Try this:

    The simplest solution you could implement is to download the dataset to a local directory on your training instance and instantiate your ImageFolder by providing the local directory. If you're training on a Jupyter Notebook, you can use AWS's S3 CLI Sync command. If you're doing this from inside a SageMaker Training job, you can just pass the S3 bucket in as a TrainingInput object parameter when you fit (see this and this).