apache-sparkpysparkgoogle-cloud-dataprocapache-hudi

Apache Hudi on Dataproc


Is there any guide to deploy Apache Hudi on a Dataproc Cluster? i´m trying to deploy via Hudi Quick Start Guide but i can´t.

Spark 3.1.1

Python 3.8.13

Debian 5.10.127 x86_64

launch code:

pyspark --jars gs://bucket/artifacts/hudi-spark3.1.x_2.12-0.11.1.jar,gs://bucket/artifacts/spark-avro_2.12-3.1.3.jar \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'

Try:

dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()

Error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'JavaPackage' object is not callable

Edit 1:

pyspark --jars gs://bucket/artifacts/hudi-spark3.1.x_2.12-0.11.1.jar,gs://bucket/artifacts/spark-avro_2.12-3.1.3.jar --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

Throw conf error

WARN org.apache.spark.sql.SparkSession: Cannot use org.apache.spark.sql.hudi.HoodieSparkSessionExtension to configure session extensions. java.lang.ClassNotFoundException: org.apache.spark.sql.hudi.HoodieSparkSessionExtension.

and also get same error trying sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()

Edit 2:

I was using wrong .jar..., this edit correct first problem

Correct pyspark call:

pyspark --jars gs://dev-dama-stg-spark/artifacts/hudi-spark3.1-bundle_2.12-0.12.1.jar,gs://dev-dama-stg-spark/artifacts/spark-avro_2.12-3.1.3.jar --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

However, new errors... after create table and hudi.options:

22/12/01 22:26:04 WARN org.apache.hudi.common.config.DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
22/12/01 22:26:04 WARN org.apache.hudi.common.config.DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
22/12/01 22:26:05 WARN org.apache.hudi.metadata.HoodieBackedTableMetadata: Metadata table was not found at path file:/tmp/hudi_trips_cow/.hoodie/metadata
22/12/01 22:26:07 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2) (... 2): java.io.FileNotFoundException: File file:/tmp/hudi_trips_cow does not exist

Any clues...?


Solution

  • Found the solution my self.

    first, to launch correctly pyspark, include hudi-spark_bundle and spark-avro as jars. Also, in my case i want to include some jdbc jars to connect with my on-premise service:`

    pyspark --jars gs://bucket/artifacts/hudi-spark3.1-bundle_2.12-0.12.1.jar,
    gs://bucket/artifacts/spark-avro_2.12-3.1.3.jar,
    gs://bucket/artifacts/mssql-jdbc-11.2.1.jre8.jar,
    gs://bucket/artifacts/ngdbc-2.12.9.jar \
    --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
    --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
    

    Follow the hudi quick guide and the only thing to change from this:

    basePath = "file:///tmp/hudi_trips_cow"
    

    to this

    basePath = "gs://bucket/tmp/hudi_trips_cow"
    

    With this configuration i was able to run correctly hudi in Dataproc.

    If i find new information i will- post here to keep this as a short guide.