google-bigquerygoogle-cloud-dataflowgoogle-cloud-dataprep

Dataprep is leaving Datasets/Tables behind in BigQuery


I am using Google Cloud Dataprep for processing data stored in BigQuery. I am having an issue with dataprep/dataflow creates a new dataset with a name starting with "temp_dataset_beam_job_"

It seems to crate the temporary dataset both for failed and successful dataflow jobs, that dataprep creates. This is an issue as BigQuery becomes messy very quickly with all these flows.

This has not been an issue in the past.

A similar issue has been described in this in this GitHub thread: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/609

Is there any way of not creating temporary datasets, or instead creating them in a Cloud Storage folder?


Solution

  • I wrote a cleanup script that I am running in Cloud Run (see this article) using Cloud Scheduler.

    Below is the script:

    #!/bin/bash
    
    PROJECT={PROJECT_NAME}
    
    # get list of datasets with temp_dataset_beam
    # optional: write list of files to cloud storage
    obj="gs://{BUCKET_NAME}/maintenance-report-$(date +%s).txt"
    bq ls --max_results=100 | grep "temp_dataset_beam" | gsutil -q cp -J - "${obj}"
    
    datasets=$(bq ls --max_results=100 | grep "temp_dataset_beam")
    
    for dataset in $datasets
    do
      echo $PROJECT:$dataset
      # WARNING: Uncomment the line below to remove datasets
      # bq rm --dataset=true --force=true $PROJECT:$dataset
    done