I have a DAG airflow task where it goes into a specific bucket and creates an external table from all the files inside together. I am getting the error (CSV table references column position 19, but line starting at position:325 contains only 19 columns.) when trying to query the external table in bigquery. I've looked at the specific file it is mentioning and the last field in the first row (and others) is left blank when it should be 0 if it is nothing, or a number if there is something. How can I go about working around this?
Example records from file:
I've scrolled over so you can see the last columns (column 19 being Total Pieces) all the other records have a number in this field or even a 0. The first record has nothing, as well as some other records. I can imagine this might happen again in other files too, so i can't manually just fix them.
create_external_table = BigQueryUpsertTableOperator(
task_id=f"create_external_{TABLE}_table",
dataset_id=DATASET,
project_id=INGESTION_PROJECT_ID,
table_resource={
"tableReference": {"tableId": f"{TABLE}_external"},
"externalDataConfiguration": {
"sourceFormat": "CSV",
"allow_quoted_newlines": True,
"autodetect": True,
"sourceUris": [f"gs://{ARCHIVE_BUCKET}/{DATASET}_data/*.csv"],
},
"labels": labeler.get_labels_bigquery_table_v2(
target_project=INGESTION_PROJECT_ID,
target_dataset=DATASET,
target_table=f"{TABLE}_external",
),
},
)
the option for the task "allow_jagged_rows" allows for records with missing data in the last column to ingestion fine when creating an external table. Or any table for that matter, my new task looks like the following:
create_external_table = BigQueryUpsertTableOperator(
task_id=f"create_external_{TABLE}_table",
dataset_id=DATASET,
project_id=INGESTION_PROJECT_ID,
table_resource={
"tableReference": {"tableId": f"{TABLE}_external"},
"externalDataConfiguration": {
"sourceFormat": "CSV",
"allow_quoted_newlines": True,
"allow_jagged_rows":True, ##Added option
"autodetect": True,
"sourceUris": [f"gs://{ARCHIVE_BUCKET}/{DATASET}_data/*.csv"],
},
"labels": labeler.get_labels_bigquery_table_v2(
target_project=INGESTION_PROJECT_ID,
target_dataset=DATASET,
target_table=f"{TABLE}_external",
),