pythonimagegcloudopen-sourcegoogle-cloud-vertex-ai

What is the easiest way to create a toy image dataset on GCP Vertex AI?


Toy datasets are useful to share reproducible issues. I would like to easily create image datasets on Vertex AI from open-source data.

For example, Keras provides some public data sets (boston_housing, cifar10, cifar100, fashion_mnist, imdb, mnist, reuters).

How to load one of them easily in a Vertex AI image dataset ? With gcloud commands and/or Python script for example ?


Solution

  • Assuming you have GCP credentials to perform the following actions, a Vertex AI dataset with single-label image can be created with the following commands.

    1. Connect to your GCP project and activate a cloud shell console (at the top right)enter image description here
    2. Install the cifar2png Python package and import images as .png in a directory called cifar10_png on your local disk storage.
        $ pip install cifar2png
        $ cifar10 cifar10_png
    
    1. Move the files to GC Storage. This operation can take few minutes depending the amount of data. Here we only move test images.
        $ BUCKET_NAME="your_bucket_name"
        $ gsutil -m -q cp -r cifar10_png/test gs://${BUCKET_NAME}/cifar10_png/test
    
    1. Create an empty Vertex AI image dataset. Your project identifier is displayed by clicking on the top left enter image description here
        $ LOCATION="continent-zonenumber"
        $ PROJECT_ID="your_project_id"
        $ curl -X POST "https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}/datasets" \
        -H "Authorization: Bearer $(gcloud auth print-access-token)" \
        -H "Content-Type: application/json; charset=utf-8" \
        -d '{"display_name": "<replace_by_your_table_name>", "metadata_schema_uri": "gs://google-cloud-aiplatform/schema/dataset/metadata/image_1.0.0.yaml"}'
    
    1. Keep the identifier of the created dataset returned.enter image description here
    2. Import the image in your dataset
        $ DATASET_ID="you_dataset_id"
        $ curl -X POST "https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}/datasets/${DATASET_ID}:import" \
        -H "Authorization: Bearer $(gcloud auth print-access-token)" \
        -H "Content-Type: application/json; charset=utf-8" \
        -d  '{"import_configs": [{"gcs_source": {"uris": "gs://<replace_by_your_bucket_name>/cifar10_png/test"}, "import_schema_uri" : "gs://google-cloud-aiplatform/schema/dataset/ioformat/image_classification_single_label_io_format_1.0.0.yaml"}]}'
    
    1. Go to Vertex AI > Datasets and wait 5-10 minutes for the import to finish. More information about the creation of dataset for image classification here.