[SOLVED] HDFS replica + and min data nodes number in the HDFS cluster

HDFS replica + and min data nodes number in the HDFS cluster

we have Hadoop cluster with only 2 data nodes machines

in HDFS configuration we defined the Block replication to 3

Block replication=3

is it OK? to defined Block replication=3 , when we have only two data nodes in the cluster

from my understanding when we defined Block replication=3 while we have 2 data nodes machines in HDFS cluster its means that one machine should have 2 replica and the other machine one replica , am I correct here?

Solution

The whole purpose of replication factor is fault tolerance. For example replication factor is 3 and if we lose hadoop datanode from cluster we can have the data replicated with 2 more copies in cluster. So in your case if datanodes are 2 in numbers and if replication factor is 3, yes if node-a will have 2 copies and the other node-b has 1 copy(say). If we lose a node-a or node-b, here we will have the data available in other node to serve the purpose anyways. Except the fact that the node-a will occupy double space which is unnecessary since replication factor 2 itself will already satisfy the fault tolerance purpose.

Again this whole explanation is specific to your case. And the whole concept will make better sense when it is visualized in a cluster with more than 2 nodes.

Below is detailed explanation from hadoop docs https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Replication