Hadoop Streaming jar not found when submitting Google Dataproc Hadoop Job?

When trying to submit a Hadoop MapReduce job programmatically (from a Java application using the dataproc library), the job fails immediately. When submitting that exact same job through the UI, it works fine.

I've tried SSHing onto the Dataproc cluster to confirm the file exists, to check permissions, and changed the jar reference. Nothing has worked yet.

The error I'm getting:

Exception in thread "main" java.lang.ClassNotFoundException: file:///usr/lib/hadoop-mapreduce/hadoop-streaming-2.8.4.jar
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:18)
Job output is complete

When I clone the failed job in the console and look at the REST equivalent this is what I see:

POST /v1/projects/project-id/regions/us-east1/jobs:submit/
{
  "projectId": "project-id",
  "job": {
    "reference": {
      "projectId": "project-id",
      "jobId": "jobDoesNotWork"
    },
    "placement": {
      "clusterName": "cluster-name",
      "clusterUuid": "uuid"
    },
    "submittedBy": "service-account@project.iam.gserviceaccount.com",
    "jobUuid": "uuid",
    "hadoopJob": {
      "args": [
        "-Dmapred.reduce.tasks=20",
        "-Dmapred.output.compress=true",
        "-Dmapred.compress.map.output=true",
        "-Dstream.map.output.field.separator=,",
        "-Dmapred.textoutputformat.separator=,",
        "-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec",
        "-Dmapreduce.input.fileinputformat.split.minsize=268435456",
        "-Dmapreduce.input.fileinputformat.split.maxsize=268435456",
        "-mapper",
        "/bin/cat",
        "-reducer",
        "/bin/cat",
        "-inputformat",
        "org.apache.hadoop.mapred.lib.CombineTextInputFormat",
        "-outputformat",
        "org.apache.hadoop.mapred.TextOutputFormat",
        "-input",
        "gs://input/path/",
        "-output",
        "gs://output/path/"
      ],
      "mainJarFileUri": "file:///usr/lib/hadoop-mapreduce/hadoop-streaming-2.8.4.jar"
    }
  }
}

When I submit the job through the console it works. The REST equivalent of that job:

POST /v1/projects/project-id/regions/us-east1/jobs:submit/
{
  "projectId": "project-id",
  "job": {
    "reference": {
      "projectId": "project-id,
      "jobId": "jobDoesWork"
    },
    "placement": {
      "clusterName": "cluster-name,
      "clusterUuid": ""
    },
    "submittedBy": "user_email_account@email.com",
    "jobUuid": "uuid",
    "hadoopJob": {
      "args": [
        "-Dmapred.reduce.tasks=20",
        "-Dmapred.output.compress=true",
        "-Dmapred.compress.map.output=true",
        "-Dstream.map.output.field.separator=,",
        "-Dmapred.textoutputformat.separator=,",
        "-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec",
        "-Dmapreduce.input.fileinputformat.split.minsize=268435456",
        "-Dmapreduce.input.fileinputformat.split.maxsize=268435456",
        "-mapper",
        "/bin/cat",
        "-reducer",
        "/bin/cat",
        "-inputformat",
        "org.apache.hadoop.mapred.lib.CombineTextInputFormat",
        "-outputformat",
        "org.apache.hadoop.mapred.TextOutputFormat",
        "-input",
        "gs://input/path/",
        "-output",
        "gs://output/path/"
      ],
      "mainJarFileUri": "file:///usr/lib/hadoop-mapreduce/hadoop-streaming-2.8.4.jar"
    }
  }
}

I ssh’ed into the box and confirmed that the file is, in fact, present. The only difference I can really see is the “submittedBy”. One works, one doesn't. I'm guessing this is a permission thing, but I cannot seem to tell where the permissions are being pulled from in each scenario. In both cases, the Dataproc cluster is created with the same service account.

Looking at permissions for that jar on the cluster I see:

-rw-r--r-- 1 root root  133856 Nov 27 20:17 hadoop-streaming-2.8.4.jar
lrwxrwxrwx 1 root root      26 Nov 27 20:17 hadoop-streaming.jar -> hadoop-streaming-2.8.4.jar

I tried changing the mainJarFileUri from pointing explicitly to the versioned jar to the link (since it had open permissions), but didn't really expect it to work. And it didn't.

Does anyone with more Dataproc experience have any idea what's going on here, and how I can resolve it?

Solution

One common mistake that's easy to make in code is to call setMainClass when you intended to call setMainJarFileUri or vice-versa. The java.lang.ClassNotFoundException you received indicates that Dataproc was trying to submit that jarfile string as a classname rather than a jarfile, so somewhere along the way Dataproc thought you set main_class. You might want to double-check your code to see if this is the bug you encountered.

The reason using "clone job" in the GUI hides this problem is that the GUI tries to be more user-friendly by offering a single text box for setting either main_class or main_jar_file_uri, and infers whether it is a jarfile by looking at the file extension. So, if you submit a job with a jarfile URI in the main_class field and it fails, then you click clone and submit the new job, the GUI will try to be smart and recognize that the new job actually specified a jarfile name, and thus will correctly set the main_jar_file_uri field in the JSON request instead of main_class.