amazon-s3emrdistcps3distcp

s3DistCp order of concatenation of files


I am trying to use the S3DistCp tool on AWS EMR to merge multiple files (1.txt, 2.txt, 3.txt) to a single gzip file. I am using the groupBy flag. For now the output seems like the concatenation of source files in the reverse order by name.

So the resulting order of contents are 3.txt, 2.txt and then 1.txt.

Is this how it is by design? Is there a way to allow the concatenation in the same order the files are created ( by creation time)?


Solution

  • Yes, it seems to be by design since the launch of s3-dist-cp. Every s3-dist-cp job creates a manifest file from --src location.

    To solve the issue, you can :

    1. Use create one using --outputManifest.
    2. Then modify this file to reverse the order.
    3. Provide this file during copy operation --copyFromManifest for achieving your goal.

    https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html