I used to work with a custom spark object define as follow :
from pyspark.sql import SparkSession
spark_builder = SparkSession.builder.appName(settings.project_name)
config = {**self.DEFAULT_CONFIG, **spark_config}
for key, value in config.items():
spark_builder.config(key, value)
self._spark = spark_builder.getOrCreate()
this is part of a bigger object. Here, self.DEFAULT_CONFIG
and spark_config
are both python dict and they contain spark configs e.g. {"spark.driver.extraJavaOptions": "-Xss32M"}
I'm trying to swith using DatabricksSession
instead.
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect/python/install
They do something in the doc like this :
from databricks.connect import DatabricksSession
from databricks.sdk.core import Config
config = Config(
host = f"https://{retrieve_workspace_instance_name()}",
token = retrieve_token(),
cluster_id = retrieve_cluster_id()
)
spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()
which works fine for me, I manage to make it work. But I want to reproduce that "config" mechanism that I used to have previously.
I tryied to add my custom config to the Config
object... which does not provide any error (Config
signature includes kwargs) but when I check my config using :
spark.cong.get("spark.driver.extraJavaOptions")
I do not see my "custom" config.
I also try to use a config
method, but the DatabricksSession object does not have it :
AttributeError: 'Builder' object has no attribute 'config'
Any idea on how I could do that ?
I think you misunderstand what DatabricksSession
is. It's NOT a SparkSession
. It's a façade over the remote/real SparkSession
.
Creation of DatabricksSession
and SparkSession
are separate.
DatabricksSession.builder.getOrCreate()
creates a DatabricksSession
façade object. That's why it only needs the info (Config()
object in your code) to be able to find the cluster where the actual SparkSession
is running (or will run once it launches the cluster).
So the creation of SparkSession
happens on the cluster. And uses the Spark config as configured in cluster config. You can change it:
Both these would require a cluster restart.
If you're looking to change config at runtime (without restarting SparkSession
, recreating SparkContext
, restarting the cluster) then.
I'm not sure, but IIRC there are config items that you can change and there are some that can not change. E.g.
spark.driver.extraJavaOptions
or spark.master
is something that requires recreation of the context, butmy.some.app.property
may be changeable at runtime using spark.conf.set('my.some.app.property', 'new-value')
(where the spark
is created using spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()