pythonweb-scrapingscrapyscrapy-splash

How can I handle pagination with Scrapy and Splash, if the href of the button is javascript:void(0)


I am trying to scrape the names and links of universities from this website: https://www.topuniversities.com/university-rankings/world-university-rankings/2021, and encountered a problem when dealing with pagination, as the href of the button which directs to the next page is javascript:void(0), so I could not reach the next page with scrapy.Request() or response.follow(), is there any way to handle pagination like this?

screen shot of the website

screen shot of the tag and href

The URL of this website does not contain params, and if the next page button is clicked, the URL remains unchanged, so I could not handle pagination by altering the URL.

The code snippet below can only fetch the names and links of the universities on the first and second page:

import scrapy
from scrapy_splash import SplashRequest


class UniSpider(scrapy.Spider):
    name = 'uni'
    allowed_domains = ['www.topuniversities.com']

    script = """
    function main(splash, args)
      splash:set_user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36")
      splash.private_mode_enabled = false
      assert(splash:go(args.url))
      assert(splash:wait(3))

      return {
        html = splash:html()
      }
    end
    """

    next_page = """
    function main(splash, args)
        splash:set_user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36")
        splash.private_mode_enabled = false
        assert(splash:go(args.url))
        assert(splash:wait(3))

        local btn = assert(splash:jsfunc([[
        function(){
        document.querySelector("#alt-style-pagination a.page-link.next").click()
      }
        ]]))
        assert(splash:wait(2))
        btn()

        splash:set_viewport_full()
        assert(splash:wait(3))

        return {
          html = splash:html()
        }
    end
    """

    def start_requests(self):
        yield SplashRequest(
            url="https://www.topuniversities.com/university-rankings/world-university-rankings/2021",
            callback=self.parse, endpoint="execute",
            args={"lua_source": self.script})

    def parse(self, response):
        for uni in response.css("a.uni-link"):
            uni_link = response.urljoin(uni.css("::attr(href)").get())
            yield {
                "name": uni.css("::text").get(),
                "link": uni_link
            }

        yield SplashRequest(
            url=response.url,
            callback=self.parse, endpoint="execute",
            args={"lua_source": self.next_page}
        )


Solution

  • You don't need splash for this simple website.

    Try loading following link instead:

    https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/2057712.txt

    This has all the universities, the website loads this file/json only once and then show information with pagination.

    Here is the short code (not using scrapy):

    from requests import get
    from json import loads, dumps
    from lxml.html import fromstring
    
    url = "https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/2057712.txt"
    html = get(url, stream=True)
    
    ## another approach for loading json
    # jdata = loads(html.content.decode())
    
    jdata = html.json()
    for x in jdata['data']:
        core_id = x['core_id']
        country = x['country']
        city = x['city']
        guide = x['guide']
        nid = x['nid']
        title = x['title']
        logo = x['logo']
        score = x['score']
        rank_display = x['rank_display']
        region = x['region']
        stars = x['stars']
        recm = x['recm']
        dagger = x['dagger']
    
        ## convert title to text
        soup = fromstring(title)
        title = soup.xpath(".//a/text()")[0]
    
        print ( title )
    

    Above code prints 'title' of individual universities, try saving it in CSV/Excel file along with other available columns. Result looks like:

    Massachusetts Institute of Technology (MIT) 
    Stanford University
    Harvard University
    California Institute of Technology (Caltech)
    University of Oxford
    ETH Zurich - Swiss Federal Institute of Technology
    University of Cambridge
    Imperial College London