I'm trying to download some files from a public s3 bucket as part of the Google Analytics course. However, I am not getting the links returned in my request. I'm not sure if I need to use boto3 or a different API package since it's a public URL with visible links. Reading the docs from Boto3, I am not 100% sure on how I would list the zip files that are list on the page links. Sorry I'm fairly new at this.
So far, this is what I've gotten:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://divvy-tripdata.s3.amazonaws.com/index.html')
data = r.text
soup = BeautifulSoup(data)
links = []
for link in soup.find_all('a'):
links.append(link.get('href'))
The request to the URL is returning a 200, however, the href links[] from the 'a' tags are coming up empty. I am trying to get all of the hrefs so I can create a loop to download the files with an urllib.request. to the base URL with a /filename for each zip file.
Any help would be greatly appreciated and thank you in advance!
Thank you for your comments. While the AWS CLI worked just fine, I wanted to bake this into my python script for future reference ease of access. As such, I was able to figured out how to download the zip files using boto3.
This solution uses the lower level for boto3, botocore to bypass the authentication using config 'UNSIGNED'. I found out about this through another Github project called s3-key-listener which "List all keys in any public Amazon s3 bucket, option to check if each object is public or private. Saves result as a .csv file"
#Install boto3
!pip install boto3 #this includes botocore
import boto3
from botocore import UNSIGNED
from botocore.client import Config
import os #this is for joining the download directory
def get_s3_public_data(bucket='bucket_name'):
#create the s3 client and assign credentials (UNSIGEND for public
bucket)
client = boto3.client('s3', config=Config(signature_version=UNSIGNED))
#create a list of 'Contect' objects from the s3 bucket
list_files = client.list_objects(Bucket=cyclistic_bucket)['Contents']
for key in list_files:
if key['Key'].endswith('.zip'):
print(f'downloading... {key["Key"]}') #print file name
client.download_file(
Bucket=bucket, #assign bucket name
Key=key['Key'], #key is the file name
Filename=os.path.join('./data',
key['Key']) #storage file path
)
else:
pass #if it's not a zip file do nothing
get_s3_public_data()
This connects to the s3 bucket and fetches the zip files for me. Hope this helps anyone else dealing with a similar issue.