google-cloud-platformgoogle-bigquerygoogle-cloud-vertex-aiautoml

Can I trace back the version of the data my model was trained on in VertexAI?


Let's suppose I have a table in BigQuery and I create a dataset on VertexAI based on it. I train my model. A while later, the data gets updated several times in BigQuery.

But can I simply go to my model and get redirected to the exact version of he data it was trained on?

Using time travel, I can still access the historical data in BigQuery. But I didn't manage to go to my model and figure out on which version of the data it was trained and look at that data.


Solution

  • On the Vertex Ai creating a dataset from BigQuery there is this statement:

    The selected BigQuery table will be associated with your dataset. Making changes to the referenced BigQuery table will affect the dataset before training.

    So there is no copy or clone of the table prepared automatically for you.

    1. Another fact is that usually you don't need the whole base table to create the database, you probably subselect based on date, or other WHERE statements. Essentially the point here is that you filter your base table, and your new dataset is only a subselect of it.

    The recommended way is to create a dataset, where you will drop your table sources, lets call them vertex_ai_dataset. In this dataset you will store all your tables that are part of a vertex ai dataset. Make sure to version them, and not update them.

    So BASETABLE -> SELECT -> WRITE AS vertex_ai_dataset.dataset_for_model_v1 (use the later in Vertex AI).

    1. Another option is that whenever you issue a TRAIN action, you also SNAPSHOT the base table. But we aware this need to be maintained, and cleaned as well.

      CREATE SNAPSHOT TABLE dataset_to_store_snapshots.mysnapshotname CLONE dataset.basetable;

    Other params and some guide is here.

    1. You could also automate this, by observing the Vertex AI, train event (it should documented here), and use EventArc to start a Cloud Workflow, that will automatically create a BigQuery table snapshot for you.