pythonpython-3.xamazon-web-servicesweb-scraping

Download files from a public S3 bucket


I'm trying to download some files from a public s3 bucket as part of the Google Analytics course. However, I am not getting the links returned in my request. I'm not sure if I need to use boto3 or a different API package since it's a public URL with visible links. Reading the docs from Boto3, I am not 100% sure on how I would list the zip files that are list on the page links. Sorry I'm fairly new at this.

So far, this is what I've gotten:

    import requests
    from bs4 import BeautifulSoup

    r = requests.get('https://divvy-tripdata.s3.amazonaws.com/index.html')
    data = r.text
    soup = BeautifulSoup(data)
    
    links = []
    for link in soup.find_all('a'):
        links.append(link.get('href'))

The request to the URL is returning a 200, however, the href links[] from the 'a' tags are coming up empty. I am trying to get all of the hrefs so I can create a loop to download the files with an urllib.request. to the base URL with a /filename for each zip file.

Any help would be greatly appreciated and thank you in advance!


Solution

  • Thank you for your comments. While the AWS CLI worked just fine, I wanted to bake this into my python script for future reference ease of access. As such, I was able to figured out how to download the zip files using boto3.

    This solution uses the lower level for boto3, botocore to bypass the authentication using config 'UNSIGNED'. I found out about this through another Github project called s3-key-listener which "List all keys in any public Amazon s3 bucket, option to check if each object is public or private. Saves result as a .csv file"

    #Install boto3
    !pip install boto3 #this includes botocore    
    
    import boto3
    from botocore import UNSIGNED
    from botocore.client import Config
    import os #this is for joining the download directory
    
    def get_s3_public_data(bucket='bucket_name'):
        #create the s3 client and assign credentials (UNSIGEND for public 
                                                         bucket)
        client = boto3.client('s3', config=Config(signature_version=UNSIGNED))
    
        #create a list of 'Contect' objects from the s3 bucket
        list_files = client.list_objects(Bucket=cyclistic_bucket)['Contents']
    
        for key in list_files:
            if key['Key'].endswith('.zip'):
                print(f'downloading... {key["Key"]}') #print file name
                client.download_file(
                                        Bucket=bucket, #assign bucket name
                                        Key=key['Key'], #key is the file name
                                        Filename=os.path.join('./data', 
                                            key['Key']) #storage file path
                                    )
            else:
                pass #if it's not a zip file do nothing
    
    get_s3_public_data()
    

    This connects to the s3 bucket and fetches the zip files for me. Hope this helps anyone else dealing with a similar issue.