amazon-web-servicesfilepattern-matchingls

aws ls to list only files inside folders with a specific pattern and extension


I have a bucket with several folders, and inside each several files. I just want to list the .tsv files inside those folders that contain the pattern"*report_filtered_count10" in their name.

I can do:

aws s3 ls s3://<my bucket> --recursive

but not:

aws s3 ls s3://<my bucket>/*/*report_filtered_count10.tsv --recursive

My workaround has been to use aws sync and then ls locally, like this:

aws s3 sync s3://<my bucket>/ tsv_files --exclude '*' --include '*report_filtered_count10.tsv'
ls tsv_files/*/*.tsv

... and then parse the output.

I also tried this aws sync nomenclature with aws ls, but it does not work either:

aws s3 ls s3://<my bucket>/ --exclude '*' --include '*report_filtered_count10.tsv'

Unknown options: --exclude,*,--include,*report_filtered_count10.tsv

Any idea how to do this simple task? my desired output would be:

s3://<my bucket>/folder1/file1_report_filtered_count10.tsv
s3://<my bucket>/folder2/file2_report_filtered_count10.tsv
s3://<my bucket>/folder3/file3_report_filtered_count10.tsv
s3://<my bucket>/folder4/file4_report_filtered_count10.tsv
s3://<my bucket>/folder5/file5_report_filtered_count10.tsv
...

Solution

  • In my experience, the aws s3 ls command can be quite limited(my opinion) in its filtering capabilities, and aws s3api provides more flexibility. For a task like this one, I’ll utilize the aws s3api list-objects-v2 command combined with grep and awk.

    So you can use aws s3api list-objects-v2 to get detailed information about the objects, which allows for more complex filtering. And that will look something like this :

    aws s3api list-objects-v2 --bucket <my-bucket> --query 'Contents[?ends_with(Key, `.tsv`)]' --output text
    

    This command lists all .tsv files, but it does not filter them by your specific pattern yet.

    Then you can pipe the output of the above command to grep to filter files that contain the specific pattern "*report_filtered_count10":

    aws s3api list-objects-v2 --bucket <your-bucket> --query 'Contents[?ends_with(Key, `.tsv`)].Key' --output text | grep 'report_filtered_count10'
    

    And then to prepend the S3 bucket URL path to each line of the output, you can use awk:

    aws s3api list-objects-v2 --bucket <your-bucket> --query 'Contents[?ends_with(Key, `.tsv`)].Key' --output text | grep 'report_filtered_count10' | awk '{print "s3://<my-bucket>/" $0}'
    

    So the full command will look like this :

    aws s3api list-objects-v2 --bucket <your-bucket> --query 'Contents[?ends_with(Key, `.tsv`)].Key' --output text | grep 'report_filtered_count10' | awk '{print "s3://<my-bucket>/" $0}'
    

    This command should produce the output your expect in the format:

    s3://<your-bucket>/folder1/file1_report_filtered_count10.tsv
    s3://<your-bucket>/folder2/file2_report_filtered_count10.tsv
    s3://<your-bucket>/folder3/file3_report_filtered_count10.tsv
    ...
    

    The use of --query in the aws s3api list-objects-v2 command helps filter the objects on the client side, reducing the need for excessive local processing.

    Hope this helps or at least you can take a cue from it.