asp.netscrapydopostback

Scraping a website that contains _dopostback method written with URL hidden


I am new to Scrapy. I am trying to scrape this website in asp, that contains various profiles. It has a total of 259 pages. To navigate over the pages, there are several links at the bottom like 1,2,3....and so on.These links use _dopostback

href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$RepeaterPaging$ctl00$Pagingbtn','')"

For each page only the bold text changes. How do I use scrapy to iterate over several pages and extract the information? the form data is as follows:

__EVENTTARGET: ctl00%24ContentPlaceHolder1%24RepeaterPaging%24ctl01%24Pagingbtn
__EVENTARGUMENT: 
__VIEWSTATE: %2FwEPDwUKMTk1MjIxNTU1Mw8WAh4HdG90cGFnZQKDAhYCZg9kFgICAw9kFgICAQ9kFgoCAQ8WAh4LXyFJdGVtQ291bnQCFBYoZg9kFgJmDxUFCDY0MzMuanBnCzggR2VtcyBMdGQuCzggR2VtcyBMdGQuBDY0MzMKOTgyOTEwODA3MGQCAQ9kFgJmDxUFCDMzNTkuanBnCDkgSmV3ZWxzCDkgSmV3ZWxzBDMzNTkKOTg4NzAwNzg4OGQCAg9kFgJmDxUFCDc4NTEuanBnD0EgLSBTcXVhcmUgR2Vtcw9BIC0gU3F1YXJlIEdlbXMENzg1MQo5OTI5NjA3ODY4ZAIDD2QWAmYPFQUIMTg3My5qcGcLQSAmIEEgSW1wZXgLQSAmIEEgSW1wZXgEMTg3Mwo5MzE0Njk1ODc0ZAIED2QWAmYPFQUINzc5Ni5qcGcTQSAmIE0gR2VtcyAmIEpld2VscxNBICYgTSBHZW1zICYgSmV3ZWxzBDc3OTYKOTkyOTk0MjE4NWQCBQ9kFgJmDxUFCDc2NjYuanBnDEEgQSBBICBJbXBleAxBIEEgQSAgSW1wZXgENzY2Ngo4MjkwNzkwNzU3ZAIGD2QWAmYPFQUINjM2OC5qcGcaQSBBIEEgJ3MgIEdlbXMgQ29ycG9yYXRpb24aQSBBIEEgJ3MgIEdlbXMgQ29ycG9yYXRpb24ENjM2OAo5ODI5MDU2MzM0ZAIHD2QWAmYPFQUINjM2OS5qcGcPQSBBIEEgJ3MgSmV3ZWxzD0EgQSBBICdzIEpld2VscwQ2MzY5Cjk4MjkwNTYzMzRkAggPZBYCZg8VBQg3OTQ3LmpwZwxBIEcgIFMgSW1wZXgMQSBHICBTIEltcGV4BDc5NDcKODk0Nzg2MzExNGQCCQ9kFgJmDxUFCDc4ODkuanBnCkEgTSBCIEdlbXMKQSBNIEIgR2VtcwQ3ODg5Cjk4MjkwMTMyODJkAgoPZBYCZg8VBQgzNDI2LmpwZxBBIE0gRyAgSmV3ZWxsZXJ5EEEgTSBHICBKZXdlbGxlcnkEMzQyNgo5MzE0NTExNDQ0ZAILD2QWAmYPFQUIMTgyNS5qcGcWQSBOYXR1cmFsIEdlbXMgTi4gQXJ0cxZBIE5hdHVyYWwgR2VtcyBOLiBBcnRzBDE4MjUKOTgyODAxMTU4NWQCDA9kFgJmDxUFCDU3MjYuanBnC0EgUiBEZXNpZ25zC0EgUiBEZXNpZ25zBDU3MjYAZAIND2QWAmYPFQUINzM4OS5qcGcOQSBSYXdhdCBFeHBvcnQOQSBSYXdhdCBFeHBvcnQENzM4OQBkAg4PZBYCZg8VBQg1NDcwLmpwZxBBLiBBLiAgSmV3ZWxsZXJzEEEuIEEuICBKZXdlbGxlcnMENTQ3MAo5OTI4MTA5NDUxZAIPD2QWAmYPFQUIMTg5OS5qcGcSQS4gQS4gQS4ncyBFeHBvcnRzEkEuIEEuIEEuJ3MgRXhwb3J0cwQxODk5Cjk4MjkwNTYzMzRkAhAPZBYCZg8VBQg0MDE5LmpwZwpBLiBCLiBHZW1zCkEuIEIuIEdlbXMENDAxOQo5ODI5MDE2Njg4ZAIRD2QWAmYPFQUIMzM3OS5qcGcPQS4gQi4gSmV3ZWxsZXJzD0EuIEIuIEpld2VsbGVycwQzMzc5Cjk4MjkwMzA1MzZkAhIPZBYCZg8VBQgzMTc5LmpwZwxBLiBDLiBSYXRhbnMMQS4gQy4gUmF0YW5zBDMxNzkKOTgyOTY2NjYyNWQCEw9kFgJmDxUFCDc3NTEuanBnD0EuIEcuICYgQ29tcGFueQ9BLiBHLiAmIENvbXBhbnkENzc1MQo5ODI5MTUzMzUzZAIDDw8WAh4HRW5hYmxlZGhkZAIFDw8WAh8CaGRkAgcPPCsACQIADxYEHghEYXRhS2V5cxYAHwECCmQBFgQeD0hvcml6b250YWxBbGlnbgsqKVN5c3RlbS5XZWIuVUkuV2ViQ29udHJvbHMuSG9yaXpvbnRhbEFsaWduAh4EXyFTQgKAgAQWFGYPZBYCAgEPDxYKHg9Db21tYW5kQXJndW1lbnQFATAeBFRleHQFATEeCUJhY2tDb2xvcgoAHwJoHwUCCGRkAgEPZBYCAgEPDxYEHwYFATEfBwUBMmRkAgIPZBYCAgEPDxYEHwYFATIfBwUBM2RkAgMPZBYCAgEPDxYEHwYFATMfBwUBNGRkAgQPZBYCAgEPDxYEHwYFATQfBwUBNWRkAgUPZBYCAgEPDxYEHwYFATUfBwUBNmRkAgYPZBYCAgEPDxYEHwYFATYfBwUBN2RkAgcPZBYCAgEPDxYEHwYFATcfBwUBOGRkAggPZBYCAgEPDxYEHwYFATgfBwUBOWRkAgkPZBYCAgEPDxYEHwYFATkfBwUCMTBkZAINDw8WAh8HBQ1QYWdlIDEgb2YgMjU5ZGRkfEDzDJt%2FoSnSGPBGHlKDPRi%2Fbk0%3D
__EVENTVALIDATION: %2FwEWDALTg7oVAsGH9qQBAsGHisMBAsGHjuEPAsGHotEBAsGHpu8BAsGHupUCAsGH%2FmACwYeS0QICwYeW7wIC%2FLHNngECkI3CyQtVVahoNpNIXsQI6oDrxjKGcAokIA%3D%3D

I looked at multiple solutions and posts which are suggesting to see the parameters of post call and use them but I am not able to make sense of the parameters which are provided in post.


Solution

  • In short, all you need is sending __EVENTTARGET, __EVENTARGUMENT, __VIEWSTATE and __EVENTVALIDATION.

    It's worth mention that when you extract names, the actual xpath may be different from what you copy from the Chrome.

    Actual xpath: //*[@id="aspnetForm"]/div/section/div/div/div[1]/div/h3/text()
    Chrome version: //*[@id="aspnetForm"]/div[3]/section/div/div/div[1]/div/h3/text()
    

    Update: For pages beyond 05, you should update __VIEWSTATE and __EVENTVALIDATION everytime, and use "ctl00$ContentPlaceHolder1$RepeaterPaging$ctl06$Pagingbtn" as the __EVENTTARGET to get the next page.

    The 00 part in __EVENTTARGET is related to current page, for example:

     1  2  3  4  5  6  7  8  9 10
    00 01 02 03 04 05 06 07 08 09
                   ^^
    To get page 7: use index 06
    ------------------------------
     2  3  4  5  6  7  8  9 10 11
    00 01 02 03 04 05 06 07 08 09
                   ^^
    To get page 8: use index 06
    ------------------------------
    12 13 14 15 16 17 18 19 20 21
    00 01 02 03 04 05 06 07 08 09
                   ^^
    To get page 18: use index 06
    ------------------------------
    current page: ^^
    

    The other part of __EVENTTARGET remains the same, which means the current page is encoded in __VIEWSTATE (and __EVENTVALIDATION? no so sure but it doesn't matter). We can extract and send them again to show the server we are now at page 10, 100, ...

    To get the next page, we can use a fixed __EVENTTARGET: ctl00$ContentPlaceHolder1$RepeaterPaging$ctl06$Pagingbtn.

    Of course, you can use ctl00$ContentPlaceHolder1$RepeaterPaging$ctl07$Pagingbtn to get the next 2 page.


    Here is a demo(updated):

    # SO Debug Spider
    # OUTPUT: 2018-07-22 10:54:31 [SOSpider] INFO: ['Aadinath Gems & Jewels']
    # The first person of page 4 is Aadinath Gems & Jewels
    #
    # OUTPUT: 2018-07-23 10:52:07 [SOSpider] ERROR: ['Ajay Purohit']
    # The first person of page 12 is Ajay Purohit
    
    import scrapy
    
    class SOSpider(scrapy.Spider):
      name = "SOSpider"
      url = "http://www.jajaipur.com/Member_List.aspx"
    
      def start_requests(self):
        yield scrapy.Request(url=self.url, callback=self.parse_form_0_5)
    
      def parse_form_0_5(self, response):
        selector = scrapy.Selector(response=response)
        VIEWSTATE = selector.xpath('//*[@id="__VIEWSTATE"]/@value').extract_first()
        EVENTVALIDATION = selector.xpath('//*[@id="__EVENTVALIDATION"]/@value').extract_first()
    
        # It's fine to use this method from page 1 to page 5
        formdata = {
          # change pages here
          "__EVENTTARGET": "ctl00$ContentPlaceHolder1$RepeaterPaging$ctl03$Pagingbtn",
          "__EVENTARGUMENT": "",
          "__VIEWSTATE": VIEWSTATE,
          "__EVENTVALIDATION": EVENTVALIDATION,
        }
        yield scrapy.FormRequest(url=self.url, formdata=formdata, callback=self.parse_0_5)
    
        # After page 5, you should try this
        # get page 6
        formdata["__EVENTTARGET"] = "ctl00$ContentPlaceHolder1$RepeaterPaging$ctl05$Pagingbtn"
        yield scrapy.FormRequest(url=self.url, formdata=formdata, callback=self.parse, meta={"PAGE": 6})
    
      def parse(self, response):
        # use a metadata to control when to break
        currPage = response.meta["PAGE"]
        if (currPage == 15):
          return
    
        # extract names here
        selector = scrapy.Selector(response=response)
        names = selector.xpath('//*[@id="aspnetForm"]/div/section/div/div/div[1]/div/h3/text()').extract()
        self.logger.error(names)
    
        # parse VIEWSTATE and EVENTVALIDATION again, 
        # which contain current page
        VIEWSTATE = selector.xpath('//*[@id="__VIEWSTATE"]/@value').extract_first()
        EVENTVALIDATION = selector.xpath('//*[@id="__EVENTVALIDATION"]/@value').extract_first()
    
        # get next page
        formdata = {
          # 06 is the next 1 page, 07 is the next 2 page, ...
          "__EVENTTARGET": "ctl00$ContentPlaceHolder1$RepeaterPaging$ctl06$Pagingbtn",
          "__EVENTARGUMENT": "",
          "__VIEWSTATE": VIEWSTATE,
          "__EVENTVALIDATION": EVENTVALIDATION,
        }
        yield scrapy.FormRequest(url=self.url, formdata=formdata, callback=self.parse, meta={"PAGE": currPage+1})
    
      def parse_0_5(self, response):
        selector = scrapy.Selector(response=response)
        # only extract name
        names = selector.xpath('//*[@id="aspnetForm"]/div/section/div/div/div[1]/div/h3/text()').extract()
        self.logger.error(names)