hadoophdfsnfsdistributed-systemdistcp

Copy Files from NFS or Local FS to HDFS


I am trying to copy large number of files (100k+, total size 2 TB) from NFS to HDFS. What is the efficient way to do it.

I have tried below options after mounting it to edge node

  1. hdfs dfs -put : It fails with memory error and transfer is also slow
  2. distcp : Getting Error Caused by:

    org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException: java.io.FileNotFoundException:

However the file exists.

I have tried the same on a local file without using NFS mounted location. I was aware of one of the caveat for distcp is, the destination has to be distributed. Does it apply for source as well? Or its a bug and have workaround for this ?

distcp command:

hadoop distcp file:/home/<user>/t1/f1.dat hdfs://<hdfs-ip>:8020/user/<user>/t1

Error:

Error: java.io.IOException: org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException: java.io.FileNotFoundException: File file:/home/<user>/t1/f1.dat does not exist
        at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:224)
        at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:50)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:796)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)
Caused by: org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException: java.io.FileNotFoundException: File file:/home/<user>/t1/f1.dat does not exist
        ... 10 more

Solution

  • In order for distcp to work, local file should be accessible from all worker nodes within a cluster, either via mount points on every node to access shared NFS location, or by physically copying it to local file system of every node.

    Alternatively, hdfs dfs -put (or -copyFromLocal) could still work if you increase the heap size of hadoop client:

    $ export HADOOP_CLIENT_OPTS="-DXmx4096m $HADOOP_CLIENT_OPTS"  
    

    But as you said, the transfer will be slower compared to distcp.