apache-sparkhadoophdfshadoop-yarnhdp

HDP + ambari + yarn node lables and HDFS


we have Hadoop cluster ( HDP 2.6.4 cluster with ambari , with 5 datanodes machines )

we are using spark streaming application (spark 2.1 run over Hortonworks 2.6.x )

the current situation is that spark streaming applications runs on all datanodes machines

as maybe some are know by yarn node labels we can enable spark streaming application to run only on the first 2 data-nodes machines

so if for example - we configured yarn node labels on the first 2 data-nodes machines then on the other 3 data-nodes machines spark application will not run because yarn node lables is disabled

my question is - is it possible by yarn node labels also to disable the HDFS on the 3 last data-nodes machines , ( in order to avoid any replica of HDFS on the 3 last data-nodes )

reference - http://crazyadmins.com/configure-node-labels-on-yarn/


Solution

  • You can decomission a datanode. If you do this, then by definition, it is not part of HDFS, meaning you're basically halting HDFS services and removing them from the cluster, which is not the same as limiting which jobs get ran on them (e.g. via YARN Node Labels)

    Node Labels control which NodeManagers run code, not directly related to DataNodes.

    You could have NodeManangers running outside of DataNodes, but that defeats the purpose of using HDFS's feature of "moving compute to the data", thus causing jobs to run slower