google-cloud-platformgoogle-cloud-dataprep

How do I run Google Dataprep jobs automatically?


Is there a way to trigger Google Dataprep flow over API?

I need to run like 30 different flows every day. Every day the source dataset changes and the result has to be appended to Google BigQuery table. Is there a way to automate this process? Source files are .xls files. I can upload them to cloud storage and write a cloud function that will upload it wherever Dataprep needs. The problem is that it seems to be impossible to replace the source dataset in a Dataprep flow. If so, then what's the point of scheduled runs and the new Job Run API?


Solution

  • There are several ways to do this. You will probably end up combining the parameterization and the scheduling features to run scheduled jobs that would pick new files every time.

    Depending on your use case, you can for e.g. do:

    Importing a directory

    If you setup a directory that only contains one excel file (see picture below), you can use the + button to use the directory as input dataset. Every time you will run a job the files present in that directory will be processed.

    Import directory

    You can now schedule the job, create an output destination and you should be all set.

    Using date time parameters

    Let's assume you are in the situation where you add a new file every day with the date in the file name. For e.g. in Cloud storage, it would look like this:

    GCS UI

    You can use the Parameterize button in the Dataprep file browser and setup the following parameter:

    date time parameter

    This should select the file from the previous day:

    Preview of parameterization

    You can them import the dataset and schedule the flow. If your schedule run every day, it will pick up the new file each time.

    Using variables

    Alternatively, you can define a variable in the file path of your dataset.

    variable for the folder name

    You can then use the JobGroup API to override that variable.

    POST /v4/jobGroups
    
    {
      "wrangledDataset": {
        "id": datasetId
      },
      "runParameters": {
        "overrides": {
          "data": [
            {
              "key": "folder-name",
              "value": "new folder name"
            }
          ]
        }
      }
    }
    

    Note that for this to work, your file need to have the same structure. See https://cloud.google.com/dataprep/docs/html/Create-Dataset-with-Parameters_118228628#structuring-your-data for more details.

    Using a wildcard parameter should also be possible as an alternative to the first method should also be possible.