pythonweb-scrapingwebpython-requests

Pdf download as blank from link even after giving headers also in request


The request response is 200, but still, the file is downloaded as blank. Please help to solve this challenge.

import requests
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"}

r = requests.get(link, stream=True, headers=HEADERS)
with open(output_filename, 'wb') as f:
    f.write(r.content)

Solution

  • There is a form in the html page. You can submit that form, or get key/encoded string and send request for pdf with new link/url.

    import requests
    import re
    from bs4 import BeautifulSoup as bs
    
    url = "url1"
    
    # get key from html page -> <form> -> <script> {encodedString = dfsfsdfs}</script>
    with requests.Session() as session:
        res = session.get(url)
    soup = bs(res.text, 'html.parser')
    form = soup.select_one('form#SummaryForm > script', string=re.compile('encodedString'))
    
    key = re.findall(r"encodedString = '([^']*)'", form.text)[0]
    # fid from url string
    fid = re.findall(r'fundid=(\d+)', url)[0]   
    
    # For the download link, searched for xhr, by going to inspect devtools, -> Network tabs, and filter by fetch/xhr.
    link = f'new_url_to_pdf?key={key}&fid={fid}'
    output_filename = 'file.pdf'
    r = requests.get(link)
    with open(output_filename, 'wb') as f:
        f.write(r.content)
    

    There is an alternative to this which is Selenium and requests-html. Since you need very less interaction with browser, you can try requests-html.

    from requests_html import HTMLSession
    
    
    url = 'url2'
    script = """
    window.addEventListener('load', function () {
        iframe = document.selector('iframe:not([style*="display: none"])')[0].click();
        e = $.Event("keydown");
        e.which = 83; // S
        e.ctrlKey = true; // CTRL
        $(document).trigger(e);
    })
    """
    
    session = HTMLSession()
    res = session.get(url)
    res.html.render(sleep=2, keep_page=True, script=script)
    

    It will automatically download webdriver when you first run the code. So, first run might take a little time, another run will be faster.

    Edit 2:
    requests-html doesn't work due to presence of iframe