[SOLVED] What is the main constraint on running larger YARN jobs and how do I increase it?

What is the main constraint on running larger YARN jobs and how do I increase it?

What is the main constraint on running larger YARN jobs (Hadoop version HDP-3.1.0.0 (3.1.0.0-78)) and how do I increase it? Basically, want to do more (all of which are pretty large) sqoop jobs concurrently.

I am currently assuming that I need to increase the Resource Manager heap size (since that is what I see going up on the Ambari dashboard when I run YARN jobs). How to add more resources to RM heap / why does RM heap appear to be such a small fraction of total RAM available (to YARN?) across the cluster?

Looking in Ambari: YARN cluster memory is 55GB, but RM heap is only 900MB. Could anyone with more experience tell me what is the difference and which is the limiting factor in running more YARN applications (and again, how do I increase it)? Anything else that I should be looking at? Any docs explaining this in more detail?

Solution

The convenient way to tune your YARN and MapReduce memory is to use yarn-utils script.

Download Companion Files ## Ref

wget http://public-repo-1.hortonworks.com/HDP/tools/2.6.0.3/hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz

tar zxvf hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz

Executing YARN Utility Script ## Ref

You can execute yarn-utils.py python script by providing Available Cores, Available Memory, No. of Disks, HBase is installed or not.

If you have a heterogeneous Hadoop Cluster then you have to create Configuration groups based on Nodes specification. If you need more info on that let me know I will update my answer according to that.