I am trying to create an automated Python script that goes to a webpage like this, finds the link at the bottom of the body text (anchor text "here"), and downloads the PDF that loads after clicking said download link. I am able to retrieve the HTML from the original and find the download link, but I don't know how to get the link to the PDF from there. Any help would be much appreciated. Here's what I have so far:
import urllib3
from urllib.request import urlopen
from bs4 import BeautifulSoup
# Open page and locate href for bill text
url = 'https://www.murphy.senate.gov/newsroom/press-releases/murphy-blumenthal-introduce-legislation-to-create-a-national-green-bank-thousands-of-clean-energy-jobs'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
links = []
for link in soup.findAll('a', href=True, text=['HERE', 'here', 'Here']):
links.append(link.get('href'))
links2 = [x for x in links if x is not None]
# Open download link to get PDF
html = urlopen(links2[0])
soup = BeautifulSoup(html, 'html.parser')
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
links2 = [x for x in links if x is not None]
At this point the list of links I get does not include the PDF that I am looking for. Is there any way to grab this without hardcoding the link to the PDF in the code (that would be counterintuitive to what I am trying to do here)? Thanks!
Looks for the a
element with the text here
then follows the trail.
import requests
from bs4 import BeautifulSoup
url = 'https://www.murphy.senate.gov/newsroom/press-releases/murphy-blumenthal-introduce-legislation-to-create-a-national-green-bank-thousands-of-clean-energy-jobs'
user_agent = {'User-agent': 'Mozilla/5.0'}
s = requests.Session()
r = s.get(url, headers=user_agent)
soup = BeautifulSoup(r.content, 'html.parser')
for a in soup.select('a'):
if a.text == 'here':
href = a['href']
r = s.get(href, headers=user_agent)
print(r.status_code, r.reason)
print(r.headers)
_, dl_url = r.headers['refresh'].split('url=', 1)
r = s.get(dl_url, headers=user_agent)
print(r.status_code, r.reason)
print(r.headers)
file_bytes = r.content # here's your PDF; you can write it out to a file