sqlgoogle-cloud-platformgoogle-bigquery

Why does loading parquet files into Bigquery display gibberish values in the table?


When I load parquet files into Bigquery table, values stored are weird. It seems to be the encoding of BYTES fields or whatever else.

Screenshot showing values such as 'MDE=' in fields

Here's the format of the create fields

Screenshot showing all but one field to be of type BYTES

So when I read the table with casted fields, I get the readable values.

Screenshot of SELECT query using cast() to get readable values

I found the solution here

My question is why is BigQuery behaving like this?


Solution

  • According to this GCP documentation, there are some parquet data types that can be converted into multiple BigQuery data types. A workaround is to add the data type that you want to parse to BigQuery.

    For example, to convert the Parquet INT32 data type to the BigQuery DATE data type, specify the following:

    optional int32 date_col (DATE);

    And another way is to add the schema to the bq load command:

    bq load --source_format=PARQUET --noreplace --noautodetect --parquet_enum_as_string=true --decimal_target_types=STRING [project]:[dataset].[tables] gs://[bucket]/[file].parquet  Column_name:Data_type