pythonweb-scrapingbeautifulsouppython-requestsmechanicalsoup

Data Scraping from pogdesign.co.uk/cat/


I am trying to scrape some data from http://www.pogdesign.co.uk/cat/.

I want to get the channel and the air-time of each program, but the problem is that by default they do not appear. Only after manually configuring the settings and saving them, the channel and the air-time of each program appear.

As I understand after inspecting the 'Network' section in the Chrome's developer tools, what actually happens after I click 'Save Settings' is that a POST request is being sent, with the relevant data parameters (e.g. 's_networks':'on' and etc'), then a GET request is being sent, to retrieve the html file with channel and the air-time displayed.

I tried to emulate this process (POST request then GET request) using both the python's requests package, and the mechanicalsoup package.

requests:

s = requests.Session()
s.post('http://www.pogdesign.co.uk/cat/', data = {'s_networks':'on'})
s.get('http://www.pogdesign.co.uk/cat/')

mechanicalsoup:

mcs = mechanicalsoup.Browser()
res_post = mcs.post('http://www.pogdesign.co.uk/cat/', data {'s_networks':'on'})
res_get = mcs.get('http://www.pogdesign.co.uk/cat/')

Yet the response I receive does not contain the channel and the air-time data.

The only difference I noticed is that the status code returned from the browser's POST request is 302, and the returned status code from my python requests is 200.


Solution

  • It is because of cookie which stores the user info, you can try the following code

    import requests
    
    s = requests.Session()
    data = {
        "style": 3,
        "timezone": "GMT",
        "s_numbers": "on",
        "s_epnames": "on",
        "s_airtimes": "on",
        "s_popups": "on",
        "s_wunwatched": "on",
        "s_sortbyname": "on",
        "s_weekstyle": "on",
        "s_24hr": "on",
        "settings": None
    }
    cookies = { # you can get the cookie info from dev tool
        "CAT_UID":'' ,
        "PHPSESSID":'' ,
        "_ga": '',
        "_gid": '',
        "_gat": ""
    }
    post = s.post('http://www.pogdesign.co.uk/cat/', data=data, cookies=cookies)
    text = post.text
    get = s.get('http://www.pogdesign.co.uk/cat/', cookies=cookies)
    text1 = get.text