kuberneteshadoopmrjob

Run Python mrjob in a Kubernetes on Hadoop Cluster


I'm exploring this python package mrjob to run MapReduce jobs in python. I've tried running it in the local environment and it works perfectly.

I have Hadoop 3.3 runs on Kubernetes (GKE) cluster. So I also managed to run mrjob successfully in the name-node pod from inside.

Now, I've got a Jupyter Notebook pod running in the same Kubernetes cluster (same namespace). I wonder whether I can run mrjob MapReduce jobs from the Jupyter Notebook.

The problem seems to be that I don't have $HADOOP_HOME defined in the Jupyter Notebook environment. So based on the documentation I created a config file called mrjob.conf as follows;

runners:
 hadoop:
  cmdenv:
    PATH: <pod name>:/opt/hadoop

However mrjob is still unable to detect hadoop bin and gives the below error

FileNotFoundError: [Errno 2] No such file or directory: 'hadoop'

So is there a way in which I can configure mrjob to run with my existing Hadoop installation on the GKE cluster? I've tried searching for similar examples but was unable to find one.


Solution

  • mrjob is a wrapper around hadoop-streaming, therefore requires Hadoop binaries to be installed on the server(s) where the code will run (pods here, I guess); including the Juptyer pod that submits the application.

    IMO, it would be much easier for you to deploy PySpark/PyFlink/Beam applications in k8s than hadoop-streaming since you don't "need" Hadoop in k8s to run such distributed processes.

    Beam would be recommended since it is compatible with GCP DataFlow