I am using Google Cloud Dataprep for processing data stored in BigQuery. I am having an issue with dataprep/dataflow creates a new dataset with a name starting with "temp_dataset_beam_job_"
It seems to crate the temporary dataset both for failed and successful dataflow jobs, that dataprep creates. This is an issue as BigQuery becomes messy very quickly with all these flows.
This has not been an issue in the past.
A similar issue has been described in this in this GitHub thread: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/609
Is there any way of not creating temporary datasets, or instead creating them in a Cloud Storage folder?
I wrote a cleanup script that I am running in Cloud Run (see this article) using Cloud Scheduler.
Below is the script:
#!/bin/bash
PROJECT={PROJECT_NAME}
# get list of datasets with temp_dataset_beam
# optional: write list of files to cloud storage
obj="gs://{BUCKET_NAME}/maintenance-report-$(date +%s).txt"
bq ls --max_results=100 | grep "temp_dataset_beam" | gsutil -q cp -J - "${obj}"
datasets=$(bq ls --max_results=100 | grep "temp_dataset_beam")
for dataset in $datasets
do
echo $PROJECT:$dataset
# WARNING: Uncomment the line below to remove datasets
# bq rm --dataset=true --force=true $PROJECT:$dataset
done