google-cloud-storagegoogle-cloud-dataprep

ETL with Dataprep - Union Dataset


I'm a newcomer to GCP, and I'm learning every day and I'm loving this platform. I'm using GCP's dataprep to join several csv files (with the same column structure), treat some data and write to a BigQuery.

I created a storage (butcket) to put all 60 csv files inside. In dataprep can I define a data set to be the union of all these files? Or do you have to create a dataset for each file?

Thank you very much for your time and attention.


Solution

  • If you have all your files inside a directory in GCS you can import that directory as a single dataset. The process is the same as importing single files. You have to make sure though, that the column structure is exactly the same for all the files inside the directory.

    If you create a separate dataset for each file you are more flexible on the structure they have when you use the UNION page to concatenate them.

    However, if your use case is just to load all the files (~60) to a single table in Bigquery without any transformation, I would suggest to just use a BigQuery load job. You can use a wildcard in the Cloud Storage URI to specify the files you want. Currently, BigQuery load jobs are free of charge, so it would be a very cost-effective solution compared to the use of Dataprep.