pythonapache-sparkpysparkazure-databricks

how to add spark config to DatabricksSession


I used to work with a custom spark object define as follow :

from pyspark.sql import SparkSession

spark_builder = SparkSession.builder.appName(settings.project_name)
config = {**self.DEFAULT_CONFIG, **spark_config}
for key, value in config.items():
   spark_builder.config(key, value)
self._spark = spark_builder.getOrCreate()

this is part of a bigger object. Here, self.DEFAULT_CONFIG and spark_config are both python dict and they contain spark configs e.g. {"spark.driver.extraJavaOptions": "-Xss32M"}

I'm trying to swith using DatabricksSession instead.
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect/python/install

They do something in the doc like this :

from databricks.connect import DatabricksSession
from databricks.sdk.core import Config

config = Config(
   host       = f"https://{retrieve_workspace_instance_name()}",
   token      = retrieve_token(),
   cluster_id = retrieve_cluster_id()
)

spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()

which works fine for me, I manage to make it work. But I want to reproduce that "config" mechanism that I used to have previously.

I tryied to add my custom config to the Config object... which does not provide any error (Config signature includes kwargs) but when I check my config using :

spark.cong.get("spark.driver.extraJavaOptions")

I do not see my "custom" config.

I also try to use a config method, but the DatabricksSession object does not have it :

AttributeError: 'Builder' object has no attribute 'config'

Any idea on how I could do that ?


Solution

  • I think you misunderstand what DatabricksSession is. It's NOT a SparkSession. It's a façade over the remote/real SparkSession.

    Creation of DatabricksSession and SparkSession are separate.

    DatabricksSession.builder.getOrCreate() creates a DatabricksSession façade object. That's why it only needs the info (Config() object in your code) to be able to find the cluster where the actual SparkSession is running (or will run once it launches the cluster).

    So the creation of SparkSession happens on the cluster. And uses the Spark config as configured in cluster config. You can change it:

    Both these would require a cluster restart.


    If you're looking to change config at runtime (without restarting SparkSession, recreating SparkContext, restarting the cluster) then.

    I'm not sure, but IIRC there are config items that you can change and there are some that can not change. E.g.