Is there a library/connector available to import Delta Lake files stored in Google Cloud Storage (GCS) directly to BigQuery?
I have managed to write BigQuery tables using a Spark Dataframe as intermediary but I can't find any direct connector or BigQuery Library that does this without transitioning through Spark dataframes.
I tried using the official connector spark-bigquery-connector, but documentation is lacking on how to point to a specific project in BigQuery so I couldn't go further than loading the DeltaLake files from GCS in a Dataframe.
Using Javier's comment, I managed to write to BQ, but this solution isn't optimized and as much as I can optimize the Spark Job, it won't be as direct as using a Google Bigquery library that does it under the hood
Not finding any direct solution, I ended up using spark-bigquery-connector to ingest Delta files as follows:
myDeltaTable.toDF
.write
.format("bigquery")
.mode("overwrite")
.option("parentProject", "MY_PARENT_PROJECT")
.option("project", "MY_PROJECT")
.option("temporaryGcsBucket", "A_STAGING_BUCKET_IN_GCS") // Mandatory
.option("partitionField", "MY_PARTITION_FIELD")
.save("DATASET.TABLE")
There is no way to ingest a Datalake file in GCS to BigQuery without going through some intermediary.
You could setup a GCE VM that downloads the Datalake file from GCS, reads it using the Datalake Standalone Connector and then write to BigQuery (either by the streaming API or writing to a supported format like Parquet and importing).
However this is essentially doing manually the same thing that Spark would be doing.