python python-3.x amazon-web-services web-scraping

Download files from a public S3 bucket

I'm trying to download some files from a public s3 bucket as part of the Google Analytics course. However, I am not getting the links returned in my request. I'm not sure if I need to use boto3 or a different API package since it's a public URL with visible links. Reading the docs from Boto3, I am not 100% sure on how I would list the zip files that are list on the page links. Sorry I'm fairly new at this.

So far, this is what I've gotten:

    import requests
    from bs4 import BeautifulSoup

    r = requests.get('https://divvy-tripdata.s3.amazonaws.com/index.html')
    data = r.text
    soup = BeautifulSoup(data)
    
    links = []
    for link in soup.find_all('a'):
        links.append(link.get('href'))

The request to the URL is returning a 200, however, the href links[] from the 'a' tags are coming up empty. I am trying to get all of the hrefs so I can create a loop to download the files with an urllib.request. to the base URL with a /filename for each zip file.

Any help would be greatly appreciated and thank you in advance!

Solution

Thank you for your comments. While the AWS CLI worked just fine, I wanted to bake this into my python script for future reference ease of access. As such, I was able to figured out how to download the zip files using boto3.

This solution uses the lower level for boto3, botocore to bypass the authentication using config 'UNSIGNED'. I found out about this through another Github project called s3-key-listener which "List all keys in any public Amazon s3 bucket, option to check if each object is public or private. Saves result as a .csv file"

#Install boto3
!pip install boto3 #this includes botocore    

import boto3
from botocore import UNSIGNED
from botocore.client import Config
import os #this is for joining the download directory

def get_s3_public_data(bucket='bucket_name'):
    #create the s3 client and assign credentials (UNSIGEND for public 
                                                     bucket)
    client = boto3.client('s3', config=Config(signature_version=UNSIGNED))

    #create a list of 'Contect' objects from the s3 bucket
    list_files = client.list_objects(Bucket=cyclistic_bucket)['Contents']

    for key in list_files:
        if key['Key'].endswith('.zip'):
            print(f'downloading... {key["Key"]}') #print file name
            client.download_file(
                                    Bucket=bucket, #assign bucket name
                                    Key=key['Key'], #key is the file name
                                    Filename=os.path.join('./data', 
                                        key['Key']) #storage file path
                                )
        else:
            pass #if it's not a zip file do nothing

get_s3_public_data()

This connects to the s3 bucket and fetches the zip files for me. Hope this helps anyone else dealing with a similar issue.