I need to create a Hudi table from a PySpark DataFrame using Azure Databricks notebook and save it into Azure DataLake Gen2. Here's my approach:
spark.sparkContext.setSystemProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark.sparkContext.setSystemProperty("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
tableName = "my_hudi_table"
basePath = <<path_to_save_data>>
dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
hudi_options = {
'hoodie.table.name': tableName,
'hoodie.datasource.write.recordkey.field': 'uuid',
'hoodie.datasource.write.partitionpath.field': 'partitionpath',
'hoodie.datasource.write.table.name': tableName,
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'ts',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2
}
This creates a PySpark DataFrame but when I try to save it as a Hudi table I get the following error:
df.write.format("hudi").options(**hudi_options).mode("overwrite").save(basePath)
org.apache.hudi.exception.HoodieException: hoodie only support org.apache.spark.serializer.KryoSerializer as spark.serializer
However, if I run the following command:
print(spark.sparkContext.getConf().get("spark.serializer"))
org.apache.spark.serializer.KryoSerializer
Note that I haven't done:
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
Because I get the error:
AnalysisException: Cannot modify the value of a Spark config: spark.serializer
The cluster I'm working on uses Databricks Runtime version 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12) and I have installed the jar file "hudi_spark3_1_2_bundle_2_12_0_10_1.jar". I don't think it is an issue with the DataLake since I can save the data in another format (parquet, csv). Am I missing something?
Any help would be appreciated. Thanks in advance.
cannot modify value of a Spark config
In databricks notebook, the conf are not updatable within code and throw this error. You will have to configure those hudi conf at the cluster level, before the spark session is created.
Overall sparkSession are singleton and you cannot modify the config at runtime, or you will have to stop/start again to make the config taken into account.
BTW, spark.sparkContext.setSystemProperty
is not meant for settingbthose conf.
Reference: https://kb.databricks.com/en_US/scala/cannot-modify-spark-serializer