javaapache-sparkoozieorchivecontext

Spark job that use hive context failing in oozie


In one of our pipelines we are doing aggregation using spark(java) and it is orchestrated using oozie. This pipelines writes the aggregated data to an ORC file using the following lines.

HiveContext hc = new HiveContext(sc);
DataFrame modifiedFrame = hc.createDataFrame(aggregateddatainrdd, schema);

modifiedFrame.write().format("org.apache.spark.sql.hive.orc").partitionBy("partition_column_name").save(output);

When the spark action in the oozie job gets triggered it throws the following exception

Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, org.apache.hadoop.hive.shims.HadoopShims.isSecurityEnabled()Z java.lang.NoSuchMethodError: org.apache.hadoop.hive.shims.HadoopShims.isSecurityEnabled()Z

But the same is getting succeeded after rerunning the workflow multiple times.

All the necessary jars are in place both during run time and compile time.

This is my first spark app and i am not able understand the issue.

Could someone help me in understanding the issue better and possible solution for the same.


Solution

  • "the same is getting succeeded after rerunning the workflow multiple times"

    Sounds like you have compiled / bundled your Spark job with a Hadoop client in a different version than the one running the cluster; as a result there are conflicting JARs in the CLASSPATH, and your job fails randomly depending on which JAR is picked up first.

    To be sure, choose one Oozie job that succeeded and one job that failed, get the "external ID" of the action (which is labeled job_*******_**** but refers to the YARN ID application_******_****) and inspect the YARN logs for both jobs. You should see a differenece in the actual order of JARs in the Java CLASSPATH.

    If that's indeed the case, then try a combination of

    You can guess what the user.classpath.first implies...!


    But, it might not work, if the conflicting JARs are actually not in the Hadoop client but in the Oozie ShareLib. And from YARN point of view, Oozie is the "client", you cannot set a precedence between what Oozie ships from its ShareLib and what it ships from your Spark job.

    In that case you would have to use the proper dependencies in your Java project, and match the Hadoop version you will be running against -- that's just common sense, don't you think?!?