Running a simple ETL PySpark job on Dataproc 2.2 with job property spark.jars.packages
set to io.delta:delta-core_2.12:2.4.0
. Other settings are set to default. I have the following config:
conf = (
SparkConf()
.set(
"spark.sql.extensions",
"io.delta.sql.DeltaSparkSessionExtension",
)
.set(
"spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog",
)
.set(
"spark.sql.parquet.enableVectorizedReader",
"false",
)
)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
Getting the following error:
Traceback (most recent call last):
File "/tmp/job-b0fc313a/historical.py", line 71, in <module>
etl(args.source_uri, args.target_uri)
File "/tmp/job-b0fc313a/historical.py", line 53, in etl
hist_df.write.format("delta").mode("overwrite").save(target_uri)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1463, in save
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o89.save.
: java.lang.NoSuchMethodError: 'scala.collection.Seq org.apache.spark.sql.types.StructType.toAttributes()'
Tried changing the versions of io.delta:delta-core_2.x:x.x.0
to no avail. I've read that the problem stems from version incompatibility of Scala, but Dataproc 2.2 is running on Scala 2.12.
Changed the property spark.jars.packages
from io.delta:delta-core_2.12:2.4.0
to io.delta:delta-spark_2.12:3.2.0
.