The request response is 200, but still, the file is downloaded as blank. Please help to solve this challenge.
import requests
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"}
r = requests.get(link, stream=True, headers=HEADERS)
with open(output_filename, 'wb') as f:
f.write(r.content)
There is a form in the html page. You can submit that form, or get key/encoded string and send request for pdf with new link/url.
import requests
import re
from bs4 import BeautifulSoup as bs
url = "url1"
# get key from html page -> <form> -> <script> {encodedString = dfsfsdfs}</script>
with requests.Session() as session:
res = session.get(url)
soup = bs(res.text, 'html.parser')
form = soup.select_one('form#SummaryForm > script', string=re.compile('encodedString'))
key = re.findall(r"encodedString = '([^']*)'", form.text)[0]
# fid from url string
fid = re.findall(r'fundid=(\d+)', url)[0]
# For the download link, searched for xhr, by going to inspect devtools, -> Network tabs, and filter by fetch/xhr.
link = f'new_url_to_pdf?key={key}&fid={fid}'
output_filename = 'file.pdf'
r = requests.get(link)
with open(output_filename, 'wb') as f:
f.write(r.content)
There is an alternative to this which is Selenium and requests-html. Since you need very less interaction with browser, you can try requests-html.
from requests_html import HTMLSession
url = 'url2'
script = """
window.addEventListener('load', function () {
iframe = document.selector('iframe:not([style*="display: none"])')[0].click();
e = $.Event("keydown");
e.which = 83; // S
e.ctrlKey = true; // CTRL
$(document).trigger(e);
})
"""
session = HTMLSession()
res = session.get(url)
res.html.render(sleep=2, keep_page=True, script=script)
It will automatically download webdriver when you first run the code. So, first run might take a little time, another run will be faster.
Edit 2:
requests-html
doesn't work due to presence of iframe