pythonreactjsscrapyscreen-scrapingscrapy-splash

Scrapy-splash not rendering dynamic content from a certain react-driven site


I am curious to see if any splash can get the dynamic job content from this page - https://nreca.csod.com/ux/ats/careersite/4/home?c=nreca#/requisition/182

in order for splash to receive the URL fragment you have to use a SplashRequest. In order for it to handle the JS cookies I have had to use a lua script. below is my environment, script, and scrapy code.

the site seems to render in 3 'steps':

  1. basically empty html with a script tag
  2. above script runs and generates the site header/footer and another script is retrieved
  3. script from #2 runs and in conjunction with a JS set cookie retrieves dynamic content (the job I want to scrape)

if you do a simple GET on the URL (i.e. in postman) you will see only step 1 content. with splash I am only getting the result of step 2 (header/footer). I do see the JS cookies in response.cookiejar

I cannot get the dynamic job content (step 3) to render.

Environment:

scrapy 1.3.3 scrapy-splash 0.72 settings

    script = """
        function main(splash)
          splash:init_cookies(splash.args.cookies)
          assert(splash:go{
            splash.args.url,
            headers=splash.args.headers,
            http_method=splash.args.http_method,
            body=splash.args.body,
            })
          assert(splash:wait(15))

          local entries = splash:history()
          local last_response = entries[#entries].response
          return {
            url = splash:url(),
            headers = last_response.headers,
            http_status = last_response.status,
            cookies = splash:get_cookies(),
            html = splash:html(),
          }
        end
    """

    return SplashRequest('https://nreca.csod.com/ux/ats/careersite/4/home?c=nreca#/requisition/182', 
        self.parse_detail, 
        endpoint='execute',
        cache_args=['lua_source'],
        args={
            'lua_source': script,
            'wait': 10,
            'headers': {'User-Agent': 'Mozilla/5.0'}
        },
    )

Solution

  • This has to be a problem with splash being run by default in private browsing mode (specifically not allowing access to window.localStorage). This often causes javascript exceptions to occur. Try to start splash with --disable-private-mode option or refer to this documentation entry: http://splash.readthedocs.io/en/stable/faq.html#disable-private-mode.