amazon-web-serviceshadoophdfselastic-map-reduce

Clear data from HDFS on AWS EMR in Hadoop 1.0.3


For various reasons I'm running some jobs on EMR with AMI 2.4.11/Hadoop 1.0.3. I'm trying to run a cleanup of HDFS after my jobs by adding an additional EMR step. Using boto:

    step = JarStep(
        'HDFS cleanup',
        'command-runner.jar',
        action_on_failure='CONTINUE',
        step_args=['hadoop', 'dfs', '-rmr', '-skipTrash', 'hdfs:/tmp'])
    emr_conn.add_jobflow_steps(cluster_id, [step])

However it regularly fails with nothing in stderr in the EMR console. Why I am confused is if I ssh into the master node and run the command:

hadoop dfs -rmr -skipTrash hdfs:/tmp

It succeeds with a 0 and a message that it successfully deleted everything. All the normal hadoop commands seem to work as documented. Does anyone know if there's an obvious reason for this? Issue with the Amazon distribution? Undocumented behavior in certain commands?

Note: I have other jobs running in Hadoop 2 and the documented:

hdfs dfs -rm -r -skipTrash hdfs:/tmp

works as one would expect both as a step and as a command.


Solution

  • My solution generally was to upgrade everything to Hadoop2, in which case this works:

            JarStep(
                '%s: HDFS cleanup' % self.job_name,
                'command-runner.jar',
                action_on_failure='CONTINUE',
                step_args=['hdfs', 'dfs', '-rm', '-r', '-skipTrash', path]
            )
    

    This was the best I could get with Hadoop1 that worked pretty well.

            JarStep(
                '%s: HDFS cleanup' % self.job_name,
                'command-runner.jar',
                action_on_failure='CONTINUE',
                step_args=['hadoop', 'fs', '-rmr', '-skipTrash',
                           'hdfs:/tmp/mrjob']
            )