hadoophadoop-streaminghortonworks-data-platform

Hadoop streaming with python on Windows


I'm using Hortonworks HDP for Windows and have it successfully configured with a master and 2 slaves.

I'm using the following command;

bin\hadoop jar contrib\streaming\hadoop-streaming-1.1.0-SNAPSHOT.jar -files file:///d:/dev/python/mapper.py,file:///d:/dev/python/reducer.py -mapper "python mapper.py" -reducer "python reduce.py" -input /flume/0424/userlog.MDAC-HD1.MDAC.local..20130424.1366789040945 -output /flume/o%1 -cmdenv PYTHONPATH=c:\python27

The mapper runs through fine, but the log reports that the reduce.py file wasn't found. In the exception it looks like the hadoop taskrunner is creating the symlink for the reducer to the mapper.py file.

When I check the job configuration file, I noticed that mapred.cache.files is set to;

hdfs://MDAC-HD1:8020/mapred/staging/administrator/.staging/job_201304251054_0021/files/mapper.py#mapper.py

It looks like although the reduce.py file is being added to the jar file, it's not being included in the configuration correctly and can't be found when the reducer tries to run.

I think my command is correct, I've tried using -file parameters instead but then neither file is found.

Can anyone see or know of an obvious reason?

Please note, this is on Windows.

EDIT- I've just run it locally and it worked, looks like my problem may be with the copying of the files round the cluster.

Still welcome input!


Solution

  • Well, thats embarrassing... my first question and I answer it myself.

    I found the problem by renaming the hadoop conf file to force default settings which meant the local job tracker.

    The job ran properly and it gave me the room to work out what the problem is, looks like communication around the cluster isn't as complete as it need be.