hdfsdistcp

distcp causing skewness in HDFS


I have a folder(around 2 TB in size) in HDFS, which was created using save method from Apache Spark. It is almost evenly distributed across nodes (I checked this using hdfs fsck).

When I try to distcp this folder (intra-cluster), and run hdfs fsck on the destination folder, it turns out to be highly skewed, that is, few nodes have a lot of blocks whereas few nodes have very less blocks stored on them. This skewness on HDFS is causing performance issues.

We tried moving the data using mv from source to destination (intra-cluster), and this time the skewness in the destination was fine, that is, the data was evenly distributed.

Is there any way to reduce the skewness in HDFS when using distcp?


Solution

  • The number of mappers in the distcp were equal to the number of nodes which were heavily loaded.

    So I increased the number of mappers in distcp using the -m option to the number of machines present in the cluster, and the output was much lesser skewed.

    An added benefit: the distcp job completed much quicker than what it used to take earlier.