google-bigquerygoogle-cloud-storagefederated-queriesfederated-table

How to read every file in a Google Storage bucket from BigQuery and create one table for each file in the bucket?


I have a bucket that has 28 files and I would like to load them in different tables (one table per file in the bucket).

Each file in the bucket looks like this:

file_1_2020.csv
file_2_2020.csv
..
file_28_2020.csv

I would like that the table had the first 6 characters of the filename + _table(they are all csv with headers). For example, file_1_table.

I read in a blog that I could achieve this by creating a federated query that list all files in the bucket. Then I could loop through all names and create the tables I need. The thing is that all documentation I have read on GCP is about federating queries in Cloud SQL, so I don't really know how I can make BigQuery read my bucket.

How can I achieve this? It's not important if it is done with a federated query (but I would like to know, because it's always good to know and learn new things). Also, I would like to have the tables partitioned and clustered by datetime, which is the first column of each file (the schema of each file is datetime:TIMESTAMP,col1:FLOAT,col2:FLOAT)


Solution

  • You can use EXECUTE IMMEDIATE to achieve this:

    Step-1: Create an external table to get list of files in storage bucket

    CREATE EXTERNAL TABLE mydataset.mytable OPTIONS (format = 'CSV', uris = ['gs://bucket_name/*.csv']);
    

    Step-2: Use EXECUTE IMMEDIATE to create tables dynamically

    DECLARE FILE_LIST ARRAY<STRING>;
    DECLARE TABLE_NAME STRING;
    DECLARE I INT64 DEFAULT 1;
    DECLARE CNT INT64 DEFAULT 0;
    SET FILE_LIST = ARRAY(SELECT DISTINCT _FILE_NAME as FILENAME FROM mydataset.mytable);
    SET CNT = ARRAY_LENGTH(FILE_LIST);
    WHILE I <= CNT 
    DO
      SET TABLE_NAME = CONCAT(SUBSTR(REPLACE(SUBSTR(FILE_LIST[ORDINAL(i)], INSTR(FILE_LIST[ORDINAL(i)], '/', -1) + 1) ,'.csv', ''), 1, 6), '_table');
      EXECUTE IMMEDIATE "CREATE EXTERNAL TABLE mydataset." || TABLE_NAME || " OPTIONS (format = 'CSV', uris = ['" || FILE_LIST[ORDINAL(I)] || "'])";
      SET I = I + 1;
    END WHILE;