csvgoogle-bigquerygoogle-cloud-storagedirected-acyclic-graphsexternal-tables

Getting Error: Missing close quote character (") when reading table after using BigQueryUpsertTableOperator from GCS bucket


I have a task that reads all files from a specific bucket path. All these files follow the same schema/format, and only difference is data is from each day so the file name is blahblah-(date).csv. I am using a BigQueryUpsertTableOperator task to wildcard these files and bring them all into the the same external table. The problem I am running into is when trying to query this table in bigquery I recieve this error:

Error while reading table: irm-eap-edp-ingestion-nonprod.route4me.routes_external, error message: Error detected while parsing row starting at position: 44648. Error: Missing close quote character ("). File: gs://ingestion-bucket/data/blahblah-2023-06-28.csv

Here specifying that there is something wrong with the file from this date. I have downloaded this specific file and checked it on google sheets seeing if there were any open quotes and have found that someone inputting data with month'day BLR 28th June'23 (Original Route: XN). So this is the ' quote that is being opened but then not closed. I believe this is the problem that is occuring.

How can I go about editing these files automatically to incase specific columns in full quote? Or avoid quotes causing this problem.

I have specified "allowQuotedNewlines": True, in my task now. As i was having this problem beforehand without it. However still running into this problem.

This is my DAG task currently to create the external tables from the gcs bucket.

# Create external tables
    create_external_table = BigQueryUpsertTableOperator(
        task_id=f"create_external_{TABLE}_table",
        dataset_id=DATASET,
        project_id=INGESTION_PROJECT_ID,
        table_resource={
            "tableReference": {"tableId": f"{TABLE}_external"},
            "externalDataConfiguration": {
                "sourceFormat": "CSV",
                "allowQuotedNewlines": True,
                "autodetect": True,
                "sourceUris": [f"gs://{ARCHIVE_BUCKET}/{DATASET}_data/*.csv"],
            },
            "labels": labeler.get_labels_bigquery_table_v2(
                target_project=INGESTION_PROJECT_ID,
                target_dataset=DATASET,
                target_table=f"{TABLE}_external",
            ),
        },
    )

EDIT: for some reason I have downloaded the file itself and tried to manually create a table with it by uploading.Tried creating table and it failed. Then I selected Allow quotes new lines in the advanced settings and the table was created fine and I am able to query it. Am i not using the correct flag in my task or is it even working? ("allowQuotedNewlines": True,)


Solution

  • I found out I am using the wrong "allowQuotedNewlines" option instead it should be "allow_quoted_newlines" the process is now working as intended.