javascriptpythonscrapyscrapinghub

Achieving Next page through javascript in scrapy python with splash?


Actually my intension is to achieve the Next from "href="javascript:submitAction_win0(document.win0,'HRS_APPL_WRK_HRS_LST_NEXT')", so Just for an example I am taking [this url][1]. From this url as you can see the Next at the end of the page, so if you observe html of that they are written through href="javascript:submitAction_win0(document.win0,'HRS_APPL_WRK_HRS_LST_NEXT') which has href tags as # , I am just trying to collect that href tags even though they are #.

def parse(self,response):
        selector = Selector(response)
        links = []
        for link in selector.css('span.PSEDITBOX_DISPONLY').re('.*>(\d+)<.*'):
		#intjid = selector.css('span.PSEDITBOX_DISPONLY').re('.*>(\d+)<.*')
                abc = 'xxxx'
                #print abc
		yield Request(abc,callback=self.parse_listing_page,dont_filter=True)
                          #meta={"use_splash": False}
                         # ) 

        nav_page = selector.css('div#win0divHRS_APPL_WRK_HRS_LST_NEXT a').extract()
        print nav_page
	for nav_page in nav_page:
       
     ## To pass the url to parse function
                yield Request(urljoin('xxx',nav_page),self.parse,dont_filter=True)

When I run the above code I am getting the result as " HTTP status code is not handled or not allowed", I mean none, can anyone tell me how to achieve the Next through that ""href="javascript:submitAction_win0(document.win0,'HRS_APPL_WRK_HRS_LST_NEXT')"" functions and why the result is empty. I am observing some kind of wierd in html, for example one of the page in Next has anchor tag as "<a id="HRS_APPL_WRK_HRS_LST_NEXT" class="PSHYPERLINK" href="javascript:submitAction_win0(document.win0,'HRS_APPL_WRK_HRS_LST_NEXT');" tabindex="74" ptlinktgt="pt_replace" name="HRS_APPL_WRK_HRS_LST_NEXT"></a>"

Thanks in advance

output :

[u'<a name="HRS_APPL_WRK_HRS_LST_NEXT" id="HRS_APPL_WRK_HRS_LST_NEXT" ptlinktgt="pt_replace" tabindex="74" href="javascript:submitAction_win0(document.win0,\'HRS_APPL_WRK_HRS_LST_NEXT\');" class="PSHYPERLINK">Next</a>']


Solution

  • Scrapy Doesn't support java script call by itself. But there are a couple of mechanisms that you can use for facing java-script.

    1. Splash - Splash is a javascript rendering service with an HTTP API. It's a lightweight browser with an HTTP API, implemented in Python using Twisted and QT
    2. Scrapyjs - This library provides Scrapy-Javascript integration through two different mechanisms: a Scrapy download handler, a Scrapy downloader middlware
    3. SpiderMonkey - Execute arbitrary JavaScript code from Python. Allows you to reference arbitrary Python objects and functions in the JavaScript VM
    4. spynner - Spynner is a stateful programmatic web browser module for Python. It is based upon PyQT and WebKit. It supports Javascript, AJAX, and every other technology that !WebKit is able to handle (Flash, SVG, ...). Spynner takes advantage of JQuery. a powerful Javascript library that makes the interaction with pages and event simulation really easy