I have a folder(around 2 TB in size) in HDFS, which was created using save
method from Apache Spark. It is almost evenly distributed across nodes (I checked this using hdfs fsck
).
When I try to distcp
this folder (intra-cluster), and run hdfs fsck
on the destination folder, it turns out to be highly skewed, that is, few nodes have a lot of blocks whereas few nodes have very less blocks stored on them. This skewness on HDFS is causing performance issues.
We tried moving the data using mv
from source to destination (intra-cluster), and this time the skewness in the destination was fine, that is, the data was evenly distributed.
Is there any way to reduce the skewness in HDFS when using distcp
?
The number of mappers in the distcp
were equal to the number of nodes which were heavily loaded.
So I increased the number of mappers in distcp
using the -m
option to the number of machines present in the cluster, and the output was much lesser skewed.
An added benefit: the distcp
job completed much quicker than what it used to take earlier.