pythongoogle-cloud-platformspacy-3google-cloud-vertex-aispacy-transformers

Training spaCy model as a Vertex AI Pipeline "Component"


I am trying to train a spaCy model , but turning the code into a Vertex AI Pipeline Component. My current code is:

@component(
    packages_to_install=[
        "setuptools",
        "wheel", 
        "spacy[cuda113,transformers,lookups]",
    ],
    base_image="gcr.io/deeplearning-platform-release/base-cu113",
    output_component_file="train.yaml"
)
def train(train_name: str, dev_name: str) -> NamedTuple("output", [("model_path", str)]):
    """
    Trains a spacy model
    
    Parameters:
    ----------
    train_name : Name of the spaCy "train" set, used for model training.
    dev_name: Name of the spaCy "dev" set, , used for model training.
    
    Returns:
    -------
    output : Destination path of the saved model.
    """
    import spacy
    import subprocess
    
    spacy.require_gpu()  # <=== IMAGE FAILS TO BE COMPILED HERE
    
    # NOTE: The remaining code has already been tested and proven to be functional.
    #       It has been edited since the project is private.
    
    # Presets for training
    subprocess.run(["python", "-m", "spacy", "init", "fill-config", "gcs/secret_path_to_config/base_config.cfg", "config.cfg"])

    # Training model
    location = "gcs/secret_model_destination_path/TestModel"
    subprocess.run(["python", "-m", "spacy", "train", "config.cfg",
                    "--output", location,
                    "--paths.train", "gcs/secret_bucket/secret_path/{}.spacy".format(train_name),
                    "--paths.dev", "gcs/secret_bucket/secret_path/{}.spacy".format(dev_name),
                    "--gpu-id", "0"])
    
    return (location,)

The Vertex AI Logs display the following as main cause of the failure:

enter image description here

The libraries are successfully installed, and yet I feel like there is some missing library / setting (as I know by experience); however I don't know how to make it "Python-based Vertex AI Components Compatible". BTW, the use of GPU is mandatory in my code.

Any ideas?


Solution

  • After some rehearsals, I think I have figured out what my code was missing. Actually, the train component definition was correct (with some minor tweaks relative to what was originally posted); however the pipeline was missing the GPU definition. I will first include a dummy example code, which trains a NER model using spaCy, and orchestrates everything via Vertex AI Pipeline:

    from kfp.v2 import compiler
    from kfp.v2.dsl import pipeline, component, Dataset, Input, Output, OutputPath, InputPath
    from datetime import datetime
    from google.cloud import aiplatform
    from typing import NamedTuple
    
    
    # Component definition
    
    @component(
        packages_to_install=[
            "setuptools",
            "wheel", 
            "spacy[cuda113,transformers,lookups]",
        ],
        base_image="gcr.io/deeplearning-platform-release/base-cu113",
        output_component_file="generate.yaml"
    )
    def generate_spacy_file(train_path: OutputPath(), dev_path: OutputPath()):
        """
        Generates a small, dummy 'train.spacy' & 'dev.spacy' file
        
        Returns:
        -------
        train_path : Relative location in GCS, for the "train.spacy" file.
        dev_path: Relative location in GCS, for the "dev.spacy" file.
        """
        import spacy
        from spacy.training import Example
        from spacy.tokens import DocBin
    
        td = [    # Train (dummy) dataset, in 'spacy V2 presentation'
                  ("Walmart is a leading e-commerce company", {"entities": [(0, 7, "ORG")]}),
                  ("I reached Chennai yesterday.", {"entities": [(19, 28, "GPE")]}),
                  ("I recently ordered a book from Amazon", {"entities": [(24,32, "ORG")]}),
                  ("I was driving a BMW", {"entities": [(16,19, "PRODUCT")]}),
                  ("I ordered this from ShopClues", {"entities": [(20,29, "ORG")]}),
                  ("Fridge can be ordered in Amazon ", {"entities": [(0,6, "PRODUCT")]}),
                  ("I bought a new Washer", {"entities": [(16,22, "PRODUCT")]}),
                  ("I bought a old table", {"entities": [(16,21, "PRODUCT")]}),
                  ("I bought a fancy dress", {"entities": [(18,23, "PRODUCT")]}),
                  ("I rented a camera", {"entities": [(12,18, "PRODUCT")]}),
                  ("I rented a tent for our trip", {"entities": [(12,16, "PRODUCT")]}),
                  ("I rented a screwdriver from our neighbour", {"entities": [(12,22, "PRODUCT")]}),
                  ("I repaired my computer", {"entities": [(15,23, "PRODUCT")]}),
                  ("I got my clock fixed", {"entities": [(16,21, "PRODUCT")]}),
                  ("I got my truck fixed", {"entities": [(16,21, "PRODUCT")]}),
        ]
        
        dd = [    # Development (dummy) dataset (CV), in 'spacy V2 presentation'
                  ("Flipkart started it's journey from zero", {"entities": [(0,8, "ORG")]}),
                  ("I recently ordered from Max", {"entities": [(24,27, "ORG")]}),
                  ("Flipkart is recognized as leader in market",{"entities": [(0,8, "ORG")]}),
                  ("I recently ordered from Swiggy", {"entities": [(24,29, "ORG")]})
        ]
    
        
        # Converting Train & Development datasets, from 'spaCy V2' to 'spaCy V3'
        nlp = spacy.blank("en")
        db_train = DocBin()
        db_dev = DocBin()
    
        for text, annotations in td:
            example = Example.from_dict(nlp.make_doc(text), annotations)
            db_train.add(example.reference)
            
        for text, annotations in dd:
            example = Example.from_dict(nlp.make_doc(text), annotations)
            db_dev.add(example.reference)
        
        db_train.to_disk(train_path + ".spacy")  # <== Obtaining and storing "train.spacy"
        db_dev.to_disk(dev_path + ".spacy")      # <== Obtaining and storing "dev.spacy"
        
    
    # ----------------------- ORIGINALLY POSTED CODE -----------------------
    
    @component(
        packages_to_install=[
            "setuptools",
            "wheel", 
            "spacy[cuda113,transformers,lookups]",
        ],
        base_image="gcr.io/deeplearning-platform-release/base-cu113",
        output_component_file="train.yaml"
    )
    def train(train_path: InputPath(), dev_path: InputPath(), output_path: OutputPath()):
        """
        Trains a spacy model
        
        Parameters:
        ----------
        train_path : Relative location in GCS, for the "train.spacy" file.
        dev_path: Relative location in GCS, for the "dev.spacy" file.
        
        Returns:
        -------
        output : Destination path of the saved model.
        """
        import spacy
        import subprocess
        
        spacy.require_gpu()  # <=== IMAGE NOW MANAGES TO GET BUILT!
    
        # Presets for training
        subprocess.run(["python", "-m", "spacy", "init", "fill-config", "gcs/secret_path_to_config/base_config.cfg", "config.cfg"])
    
        # Training model
        subprocess.run(["python", "-m", "spacy", "train", "config.cfg",
                        "--output", output_path,
                        "--paths.train", "{}.spacy".format(train_path),
                        "--paths.dev", "{}.spacy".format(dev_path),
                        "--gpu-id", "0"])
    
    # ----------------------------------------------------------------------
        
    
    # Pipeline definition
    
    @pipeline(
        pipeline_root=PIPELINE_ROOT,
        name="spacy-dummy-pipeline",
    )
    def spacy_pipeline():
        """
        Builds a custom pipeline
        """
        # Generating dummy "train.spacy" + "dev.spacy"
        train_dev_sets = generate_spacy_file()
        # With the output of the previous component, train a spaCy modeL    
        model = train(
            train_dev_sets.outputs["train_path"],
            train_dev_sets.outputs["dev_path"]
        
        # ------ !!! THIS SECTION DOES THE TRICK !!! ------
        ).add_node_selector_constraint(
            label_name="cloud.google.com/gke-accelerator",
            value="NVIDIA_TESLA_T4"
        ).set_gpu_limit(1).set_memory_limit('32G')
        # -------------------------------------------------
    
    # Pipeline compilation   
    
    compiler.Compiler().compile(
        pipeline_func=spacy_pipeline, package_path="pipeline_spacy_job.json"
    )
    
    
    # Pipeline run
    
    TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
    
    run = aiplatform.PipelineJob(  # Include your own naming here
        display_name="spacy-dummy-pipeline",
        template_path="pipeline_spacy_job.json",
        job_id="ml-pipeline-spacydummy-small-{0}".format(TIMESTAMP),
        parameter_values={},
        enable_caching=True,
    )
    
    
    # Pipeline gets submitted
    
    run.submit()
    

    Now, the explanation; according to Google:

    By default, the component will run on as a Vertex AI CustomJob using an e2-standard-4 machine, with 4 core CPUs and 16GB memory.

    Therefore, when the train component gets compiled, it fails as "it was not seeing any GPU available as resource"; in the same link however, all the available settings for both CPU and GPU are mentioned. In my case as you can see, I set train component to run under ONE (1) NVIDIA_TESLA_T4 GPU card, and I also increased my CPU memory, to 32GB. With these modifications, the resulting pipeline looks as follows:

    enter image description here

    And as you can see, it gets compiled successfully, as well as trains (and eventually obtains) a functional spaCy model. From here, you can tweak this code, to fit your own needs.

    I hope this helps to anyone who might be interested.

    Thank you.