hadoopgoogle-cloud-platformhdfsgsutildistcp

Hadoop distcp copy from on prem to gcp strange behavior


when I user distcp command as

hadoop distcp /a/b/c/d  gs:/gcp-bucket/a/b/c/ , where d is a folder on HDFS containing subfolders.

If folder c is already there on gcp then it copies d ( and its subfolders) from HDFS to gcp inside c but if c folder is not there on gcp then it creates c folder on gcp and copies subfolders of d (but not d it self ) inside of c folder of gcp.

So if e is the sub folder in d on HDFS and folder c exists on gcp then the out put of following command :

hadoop distcp /a/b/c/d  gs:/gcp-bucket/a/b/c/ 

will be

gs://a/b/c/d

If e is the sub folder in d on HDFS and folder c does not exist on gcp then the out put of following command

hadoop distcp /a/b/c/d  gs:/gcp-bucket/a/b/c/ 

will be
gs://a/b/c/e

why is the out put of second command not same as out put of first command ? both commands are same.


Solution

  • There are no subdirectories on Cloud Storage. Instead there is a flat namespace where all the objects are hosted.

    The hierarchichal view that one sees is due to the gsutil tool that makes naming work the way users would expect. So when one copies a file name your-file to the target gs://[BUCKET]/path/to/target/ the Cloud Storage service interprets this as a file named gs://[BUCKET]/path/to/target/your-file.

    In your case when "folder c" doesn't exist and you try to copy under this "subdirectory", the first time that you will run this command, the following object will be created:

    gs://a/b/c/e
    

    If "folder c" exists, then "folder d" and all its contents(including d itself) will be copied under subdirectory c

    Your observation:

    If folder c is already there on gcp then it copies d ( and its subfolders) from HDFS to gcp inside c but if c folder is not there on gcp then it creates c folder on gcp and copies subfolders of d (but not d it self ) inside of c folder of gcp.

    is totally right and this behavior is expected.

    You may find more details regarding the rules that are applied and how subdirectories work in the Cloud Storage documentation