python-3.xweb-scrapingscrapyframesetscrapy-splash

Render page that use frameset


I'm using scrapy + splash to crawl sites for my university. Some of the pages are ancient and use techniques I'm not familiar with. I noticed a few sites are not rendering entirely. All the incomplete pages use <frameset> instead of having a traditional <body>. Through the splash gui, the page seems to render completely (I can see the snapshot) but the html doesn't contain the content from the frame src. Here's some code that illustrates my issue:

import scrapy
from scrapy_splash import SplashRequest

class Frameset(scrapy.Spider):

    name = 'frameset'

    def start_requests(self):
        yield SplashRequest(
            'http://www.cs.odu.edu/~cs411/Summer03/AquaTrac/',
            endpoint = 'render.json',
            args = { 
                'iframes': 1,
                'html': 1,
                'timeout': 10, 
            }   
        )   

        ##yield scrapy.Request(
        ##    'http://www.cs.odu.edu/~cs411/Summer03/AquaTrac/',
        ##    meta = {
        ##        'splash': {
        ##            'endpoint': 'render.json',
        ##            'args': {
        ##                'iframes': 1,
        ##                'html': 1,
        ##                'timeout': 5,
        ##            }
        ##        }
        ##    }
        ##) 

    def parse(self, response):
        print(response.xpath('//html').extract())

It renders properly, but this is all the html returned:

<html><head><title>« AquaTrac »</title>
</head><frameset rows="120,2,25,2,*,2,25" framespacing="0" frameborder="NO" border="0">
<frame name="banner" scrolling="no" noresize="" src="banner.htm">
<frame name="space" scrolling="no" noresize="" src="about:blank">
<frame name="links" scrolling="no" noresize="" src="links.htm">
<frame name="space" scrolling="no" noresize="" src="about:blank">
<frame name="main" scrolling="auto" noresize="" src="main.htm">
<frame name="space" scrolling="no" noresize="" src="about:blank">
<frame name="info" scrolling="no" noresize="" src="info.htm">
</frameset>
</html>

I want to get all the html in 1 request instead of having to make multiple requests to each frame src if possible. If you use developer mode in chrome or firefox, you'll see the entire html including the content from the frame src. Judging by the snapshot splash generates, splash should have the entire html as well. Is there a way to get all the html in a single request using splash and scrapy?


Solution

  • You need to use render.json endpoint and iframes option:

    def start_requests(self):
           yield SplashRequest(self.root_url, self.parse_detail,
                endpoint='render.json',
                args={
                    'iframes': 1,
                    'html' : 1,
                    'timeout': 90
                }
            ) 
    def parse(self, response):
    
        for frame in response["data"]["childFrames"]:
            frame_html = frame["html"]