google-cloud-automlgoogle-cloud-vertex-ai

Training Google-Cloud-Automl Model on multiple datasets


I would like to train an automl model on gcp's vertex ai using multiple datasets. I would like to keep the datasets separate, since they come from different sources, want to train on them separately, etc. Is that possible? Or will I need to create a dataset containing both datasets? It looks like I can only select one dataset in the web UI.


Solution

  • It is possible via the Vertex AI API as long as your sources are in Google Cloud Storage, just provide a list of training data which are in JSON or CSV format that qualifies with the best practices for formatting of training data.

    See code for creating and importing datasets. See documentation for code reference and further details.

    from typing import List, Union
    from google.cloud import aiplatform
    
        def create_and_import_dataset_image_sample(
            project: str,
            location: str,
            display_name: str,
            src_uris: Union[str, List[str]], // example: ["gs://bucket/file1.csv", "gs://bucket/file2.csv"]
            sync: bool = True,
        ):
            aiplatform.init(project=project, location=location)
        
            ds = aiplatform.ImageDataset.create(
                display_name=display_name,
                gcs_source=src_uris,
                import_schema_uri=aiplatform.schema.dataset.ioformat.image.single_label_classification,
                sync=sync,
            )
        
            ds.wait()
        
            print(ds.display_name)
            print(ds.resource_name)
            return ds
    

    NOTE: The links provided are for Vertex AI AutoML Image. If you access the links there are options for other AutoML products like Text, Tabular and Video.