pythonweb-crawlerurllib2dynamic-pages

How to crawl dynamic web with api url returning null?


I have a task to crawl all Pulitzer Winner, and I found this page has all I want: https://www.pulitzer.org/prize-winners-by-year/2018.

But I got the following problems,

Problem 1: How to crawl a dynamic page? I use python/urllib2.urlopen, to get the page's content, but this dynamic page doesn't return the real content from this.

Problem 2: I then found an API URL from devtool: https://www.pulitzer.org/cache/api/1/winners/year/166/raw.json. But when I sent a GET request from urllib2.urlopen, I always get null. How does it happen? Or how can I handle with it?

If this is too naive for you, please name some words so that I can learn it from Google.

Thanks in advance!


Solution

  • One way to handle is to create a session using requests module. This way, it passes necessary session details required for next api call, you also have to pass one more parameter Referer to the header. This differentiates which year you are looking for in the api call.

    import requests
    s = requests.session()
    url = "https://www.pulitzer.org/prize-winners-by-year/2017"
    resp1 = s.get(url)
    headers = {'Referer': 'https://www.pulitzer.org/prize-winners-by-year/2017'}
    api = "https://www.pulitzer.org/cache/api/1/winners/year/166/raw.json"
    data = s.get(api,headers=headers)
    

    now you can extract the data from the response in data.