pythonapache-sparkgoogle-bigquerydbtdataproc

Set Spark configuration when running python in dbt for BigQuery


Making some progress on a proof of concept for a python dbt model in GCP (BigQuery). Built a dataproc cluster for Spark and able to execute the model, but I'm getting an error in the model that requires a configuration change for Spark. Specifically, I need to set the following:

"spark.sql.legacy.parquet.int96RebaseModeInRead": "CORRECTED" "spark.sql.legacy.parquet.int96RebaseModeInWrite": "CORRECTED" "spark.sql.legacy.parquet.datetimeRebaseModeInRead": "CORRECTED" "spark.sql.legacy.parquet.datetimeRebaseModeInWrite": "CORRECTED"

I'm uncertain where/how to set these spark configuration options. Is it in the profiles.yml file or can I do it programatically in the python model itself...or somewhere else?

Tried setting the following in the profiles.yml file:

analytics_profile: outputs: dev: server_side_parameters: "spark.sql.legacy.parquet.int96RebaseModeInRead": "CORRECTED" "spark.sql.legacy.parquet.int96RebaseModeInWrite": "CORRECTED" "spark.sql.legacy.parquet.datetimeRebaseModeInRead": "CORRECTED" "spark.sql.legacy.parquet.datetimeRebaseModeInWrite": "CORRECTED"


Solution

  • It should have been obvious, but I didn't connect the dots that the spark config was part of the session object that gets passed into the model function. This fixed the error I was getting:

    session.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
    session.conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
    session.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
    session.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")