google-bigquery

Bigquery streaming inserts using dataflow with null fields


We are using Bigquery streaming inserts with Dataflow using the predefined Dataflow job template.

I ran into some peculiarities when using this with nullable and repeated fields.

For instance, with the schema

name   STRING, NULLABLE

Attempting to do the insert {name: null}

fails with the error:

generic::invalid_argument: This field is not a record.","location":"name","message":"This field is not a record."

This is not such a big deal since it's easy enough to simply drop null fields, and similarly for empty arrays.

However, now if our schema is:

name   STRING, REPEATED

and we want to insert ["a", "b", null, "c"] we get a similar error referencing the third element.


Solution

  • To provide a row with a null value for a NULLABLE field, simply omit the field from the row that you are inserting. For your second example, a REPEATED field (or an ARRAY in SQL terms) cannot have a null element. To model an array of NULLABLE STRING, you can use a REPEATED RECORD that contains a STRING field named value, for instance, or equivalently an ARRAY<STRUCT<value STRING>> in SQL terms.