amazon-web-servicesamazon-s3boto3cyberduck

Need to export the path/url of each file in Amazon S3 server


I have an Amazon S3 server filled with multiple buckets, each bucket containing multiple subfolders. There are easily 50,000 files in total. I need to generate an excel sheet that contains the path/url of each file in each bucket.

For eg, If I have a bucket called b1, and it has a file called f1.txt, I want to be able to export the path of f1 as b1/f1.txt. This needs to be done for every one of the 50,000 files.

I have tried using S3 browsers like Expandrive and Cyberduck, however they require you to select each and every file to copy their urls. I also tried exploring the boto3 library in python, however I did not come across any in built functions to get the file urls.

I am looking for any tool I can use, or even a script I can execute to get all the urls. Thanks.


Solution

  • What you should do is have a look again at boto3 documentation as it is what you are looking for. It is fairly simple to do what you are asking but may take you a bit of reading if you are new to it. Since there is multiple steps involved I will try to steer you in the right direction.

    In boto3 for S3 the method you are looking for is list_objects_v2(). This will give you the 'Key' or object path of every object. You will notice that it will return the entire json blob for each object. Since you only are interested in the Key, you can target this just the same way you would access Key/Values in a dict. E.g. list_objects_v2()['Contents'][0]['Key'] should return only object path of the very first object.

    If you've got that working the next step is to try to loop and get all values. You can either use a for loop to do this or there is an awesome python package I regularly use called jmespath - https://jmespath.org/

    Here is how you can retrieve all object paths up to 1000 objects in one line.

    import jmespath
    bucket_name='im-a-bucket'
    s3_client = boto3.client('s3')
    bucket_object_paths = jmespath.search('Contents[*].Key', s3_client.list_objects_v2(Bucket=bucket_name))
    

    Now since your buckets may have more than 1000 objects, you will need to use the paginator to do this. Have a look at this to understand it. How to get more than 1000 objects from S3 by using list_objects_v2?

    Basically the way it works is only 1000 objects can be returned. To overcome this we use a paginator which allows you to return the entire result and treats the limit of 1000 as a pagination so you just need to also use it within a for loop to get all the results you are looking for.

    Once you get this working for one bucket, store the result in a variable which will be of type list and repeat for the rest of the buckets. Once you have all this data you could easily just copy paste it into an excel sheet or use python to do it. (Haven't tested the code snippets but they should work).