I am trying to copy large number of files (100k+, total size 2 TB) from NFS to HDFS. What is the efficient way to do it.
I have tried below options after mounting it to edge node
distcp : Getting Error Caused by:
org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException: java.io.FileNotFoundException:
However the file exists.
I have tried the same on a local file without using NFS mounted location. I was aware of one of the caveat for distcp is, the destination has to be distributed. Does it apply for source as well? Or its a bug and have workaround for this ?
distcp command:
hadoop distcp file:/home/<user>/t1/f1.dat hdfs://<hdfs-ip>:8020/user/<user>/t1
Error:
Error: java.io.IOException: org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException: java.io.FileNotFoundException: File file:/home/<user>/t1/f1.dat does not exist
at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:224)
at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:50)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:796)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)
Caused by: org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException: java.io.FileNotFoundException: File file:/home/<user>/t1/f1.dat does not exist
... 10 more
In order for distcp
to work, local file should be accessible from all worker nodes within a cluster, either via mount points on every node to access shared NFS location, or by physically copying it to local file system of every node.
Alternatively, hdfs dfs -put
(or -copyFromLocal
) could still work if you increase the heap size of hadoop client:
$ export HADOOP_CLIENT_OPTS="-DXmx4096m $HADOOP_CLIENT_OPTS"
But as you said, the transfer will be slower compared to distcp.