I have a bucket with several folders, and inside each several files. I just want to list the .tsv
files inside those folders that contain the pattern"*report_filtered_count10"
in their name.
I can do:
aws s3 ls s3://<my bucket> --recursive
but not:
aws s3 ls s3://<my bucket>/*/*report_filtered_count10.tsv --recursive
My workaround has been to use aws sync
and then ls
locally, like this:
aws s3 sync s3://<my bucket>/ tsv_files --exclude '*' --include '*report_filtered_count10.tsv'
ls tsv_files/*/*.tsv
... and then parse the output.
I also tried this aws sync
nomenclature with aws ls
, but it does not work either:
aws s3 ls s3://<my bucket>/ --exclude '*' --include '*report_filtered_count10.tsv'
Unknown options: --exclude,*,--include,*report_filtered_count10.tsv
Any idea how to do this simple task? my desired output would be:
s3://<my bucket>/folder1/file1_report_filtered_count10.tsv
s3://<my bucket>/folder2/file2_report_filtered_count10.tsv
s3://<my bucket>/folder3/file3_report_filtered_count10.tsv
s3://<my bucket>/folder4/file4_report_filtered_count10.tsv
s3://<my bucket>/folder5/file5_report_filtered_count10.tsv
...
In my experience, the aws s3 ls
command can be quite limited(my opinion) in its filtering capabilities, and aws s3api
provides more flexibility. For a task like this one, I’ll utilize the aws s3api list-objects-v2
command combined with grep
and awk
.
So you can use aws s3api list-objects-v2
to get detailed information about the objects, which allows for more complex filtering. And that will look something like this :
aws s3api list-objects-v2 --bucket <my-bucket> --query 'Contents[?ends_with(Key, `.tsv`)]' --output text
This command lists all .tsv
files, but it does not filter them by your specific pattern yet.
Then you can pipe the output of the above command to grep
to filter files that contain the specific pattern "*report_filtered_count10":
aws s3api list-objects-v2 --bucket <your-bucket> --query 'Contents[?ends_with(Key, `.tsv`)].Key' --output text | grep 'report_filtered_count10'
And then to prepend the S3 bucket URL path to each line of the output, you can use awk
:
aws s3api list-objects-v2 --bucket <your-bucket> --query 'Contents[?ends_with(Key, `.tsv`)].Key' --output text | grep 'report_filtered_count10' | awk '{print "s3://<my-bucket>/" $0}'
So the full command will look like this :
aws s3api list-objects-v2 --bucket <your-bucket> --query 'Contents[?ends_with(Key, `.tsv`)].Key' --output text | grep 'report_filtered_count10' | awk '{print "s3://<my-bucket>/" $0}'
This command should produce the output your expect in the format:
s3://<your-bucket>/folder1/file1_report_filtered_count10.tsv
s3://<your-bucket>/folder2/file2_report_filtered_count10.tsv
s3://<your-bucket>/folder3/file3_report_filtered_count10.tsv
...
The use of --query
in the aws s3api list-objects-v2
command helps filter the objects on the client side, reducing the need for excessive local processing.
Hope this helps or at least you can take a cue from it.