google-cloud-platformgoogle-bigquerygzipbq

GCP Bigquery Error in Load Operation: Bytes are Missing


I am very new to Google Cloud Platform and I'm trying to create a table in bigquery from ~60,000 csv.gz files stored in a GCP bucket.

To do this, I've opened Cloud Shell, and I'm trying the following:

$ bq --location=US mk my_data
$ bq --location=US \
     load --null_marker='' \
     --source_format=CSV --autodetect \
     my_data.my_table gs://my_bucket/*.csv.gz

This throws the following error:

BigQuery error in load operation: Error processing job 'my_job:bqjob_r3eede45779dc9a51_0000017529110a63_1': 
Error while reading data, error message:
FAILED_PRECONDITION: Invalid gzip file: bytes are missing

I don't know how to find which file might be problematic when loading the files. I've checked a few of the files, and they are all valid .gz files that I can open with any csv reader after decompression, but I don't know how to check through all the files to find a problematic one.

Thank you in advance for any help with this!


Solution

  • To loop through your bucket, you can use the eval command

    #!/bin/bash
    FILES="gsutil ls gs://YOUR_BUCKET"
    RESULTS=$(eval $FILES)
    for f in $RESULTS
    do
      read="gsutil cat $f | zcat | wc -c"
      if [[ $(eval $read) == "0" ]]
        then
            #<Process it, Print name or Delete from bucket like below>
            delete="gsutil rm $f"
            eval $delete
        fi
    done
    

    Another option is to download all your files locally, if possible, and process from there:

    gsutil -m cp -R gs://YOUR_BUCKET .