hadoopcopyhdfsdistcp

Efficient copy method in Hadoop


Is there a faster or more efficient way of copying files across HDFS other than distcp. I tried both the regular hadoop fs -cp as well as distcp and both seem to be giving the same transfer rate, around 50 MBPS.

I have 5TB of data split into smaller files of 500GB each which I have to copy to a new location on HDFS. Any thoughts?

Edit: The original distcp is only spawning 1 mapper so I added -m100 option to increase the mappers

hadoop distcp -D mapred.job.name="Gigafiles distcp" -pb -i -m100 "/user/abc/file1" "/xyz/aaa/file1"

But still it is spawning only 1 and not 100 mappers. Am I missing something here?


Solution

  • I was able to solve this by using a pig script to read the data from path A, convert to parquet (which is the desired storage format anyway) and write it in path B. The process took close to 20 mins on average for 500GB files. Thank you for the suggestions.