python.netcookiesscrapy

How to download files generated according to Cookies url in Scrapy


I'm trying to download a file which download link is generated according certain Cookies. I have a PDF file which is showed in a Viewer, this viewer has a Download button enter image description here. When I click in this icon, a temporary download link is generated according to hidden_document_field_id input value in the HTML

enter image description here

so the temporary download link in this case is the join of 3 things:

1.The base url link(https://onlineservices.miami-dadeclerk.com/officialrecords/)

2.Input value(DocumentHandler.axd/docs/304e6a24-0dbe-489d-b8a1-9a947d447136/rev1)

3.Download word

FULL LINK https://onlineservices.miami-dadeclerk.com/officialrecords/DocumentHandler.axd/docs/304e6a24-0dbe-489d-b8a1-9a947d447136/rev1/download

This link is generated according some Cookies like session cookie and others, which means this link will not work for you unless you have my cookies.

I have tried to download the file using Scrapy but I got 500 Internal Server Error, I don't know what is happening, I already set all cookies used by this website

class TestSpider(scrapy.Spider):
    name = "test_spider"

    def start_requests(self):
        url = "https://onlineservices.miami-dadeclerk.com/officialrecords/StandardSearch.aspx"
        yield scrapy.Request(url=url, callback=self.med)

    def med(self,response):
        yield scrapy.Request(url="https://onlineservices.miami-dadeclerk.com/officialrecords/CFNDetailsHTML5.aspx?QS=5p8%2fNlBjKYBarc%2fJA16mTghonf9CxQ8L9b1X0TFjFkhkowtaD%2b8z7w%3d%3d", callback=self.parse,cookies={'AspxAutoDetectCookieSupport': '1'})

    def parse(self, response):
        cookies = response.request.headers
        print(cookies)
        start_link = "https://onlineservices.miami-dadeclerk.com/officialrecords/"
        body = response.css('#hidden_document_field_id::attr(value)').get()
        end_link = "/download"
        full_link = start_link + body + end_link
        item = MyItem()
        item["file_urls"] = [full_link]
        yield item

The code is very short, a simple start_request is created in order to have the basic cookies, then I use med request to add AspxAutoDetectCookieSupport cookie which is not added at beginning due some reason finally I build the full link

My cookies:

{b'Referer': [b'https://onlineservices.miami-dadeclerk.com/officialrecords/StandardSearch.aspx?AspxAutoDetectCookieSupport=1'], b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Mozilla/5
.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'], b'Accept-Encoding': [b'gzip,deflate'], b'Cookie': [b'AspxAutoDetectCookieSupport=1; NSC_JOohzzemcpqd5cxccsdmkpe5tar0zcM=ffffffff09303a5345525d5f4f58455e445a4a42378b;
 AspxAutoDetectCookieSupport=1; ASP.NET_SessionId=2kudtndycb15ffk2fsjtqyer']}

P'S: I'm not looking to fix my code. P'S: I realized web site backend is made with .NET.

I'm looking to download a file using Scrapy and with the Viewer link


Solution

  • The error could hint that something else is missing missing in order to fully process your request:

    With requests.get from python's requests-library you can add parameter params and a header headers next to cookies. You can retrieve these additional arguments by curling the download and translate the curl to a python request. In most cases, this has worked for me. A curl imitates the network interactions including the whole browser session.

    Then, you would get something like:

        def parse(self, response):
                cookies = { "PHPSESSID" : #your session id}
                headers = { ... } # "User-Agent", "Accept", "Accept-Language", "Referer", "DNT", "Connection"
                ...
    
        def med(self,response):
                requests.get(url="...",cookies=cookies, headers=headers, params=params}
    

    Instructions to obtain the curl and translate it into arguments for python's requests-library: https://curl.trillworks.com/

    Hope this helps even though it changes your code (I am not sure how to implement it in scrapy).