pythonweb-scrapingwgetmechanize

POST URL Encoded vs Line-based text data via Python Requests


I'm trying to scrape some data from a website and I can't get the POST to work, it acts as though I didn't give it the input data ("appnote").

When I examine the POST data it looks relatively the same except that the actual webform's POST is called "URL Encoded" and lists each form input, whereas mine is labeled "Line-based text data".

Here's my code: (appnote) and Search (Search) are the most relevant pieces I need

import requests
import cookielib


jar = cookielib.CookieJar()
url = 'http://www.vivotek.com/faq/'
headers = {'content-type': 'application/x-www-form-urlencoded'}

post_data = {#'__EVENTTARGET':'',
             #'__EVENTARGUMENT':'',
             '__LASTFOCUS':'',
             '__VIEWSTATE':'',
             '__VIEWSTATEGENERATOR':'',
             '__VIEWSTATEENCRYPTED':'',
             '__PREVIOUSPAGE':'',
             '__EVENTVALIDATION':''
             'ctl00$HeaderUc1$LanguageDDLUc1$ddlLanguage':'en',
             'ctl00$ContentPlaceHolder1$CategoryDDLUc1$DropDownList1':'-1',
             'ctl00$ContentPlaceHolder1$ProductDDLUc1$DropDownList1':'-1',
             'ctl00$ContentPlaceHolder1$Content':'appnote',
             'ctl00$ContentPlaceHolder1$Search':'Search'
            }
response = requests.get(url, cookies=jar)

response = requests.post(url, cookies=jar, data=post_data, headers=headers)

print(response.text)

Links to images of what I'm talking about in Wireshark:

I also tried it using wget with the same results.


Solution

  • The main problem is that you are not setting the important hidden field values, like __VIEWSTATE.

    For this to work using requests, you need to parse the page html and get the appropriate input values.

    Here's the solution using BeautifulSoup HTML parser and requests:

    from bs4 import BeautifulSoup
    import requests
    
    url = 'http://www.vivotek.com/faq/'
    query = 'appnote'
    
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36'}
    
    session = requests.Session()
    response = session.get(url, headers=headers)
    
    soup = BeautifulSoup(response.content)
    
    post_data = {'__EVENTTARGET':'',
                 '__EVENTARGUMENT':'',
                 '__LASTFOCUS':'',
                 '__VIEWSTATE': soup.find('input', id='__VIEWSTATE')['value'],
                 '__VIEWSTATEGENERATOR': soup.find('input', id='__VIEWSTATEGENERATOR')['value'],
                 '__VIEWSTATEENCRYPTED': '',
                 '__PREVIOUSPAGE': soup.find('input', id='__PREVIOUSPAGE')['value'],
                 '__EVENTVALIDATION': soup.find('input', id='__EVENTVALIDATION')['value'],
    
                 'ctl00$HeaderUc1$LanguageDDLUc1$ddlLanguage': 'en',
                 'ctl00$ContentPlaceHolder1$CategoryDDLUc1$DropDownList1': '-1',
                 'ctl00$ContentPlaceHolder1$ProductDDLUc1$DropDownList1': '-1',
                 'ctl00$ContentPlaceHolder1$Content': query,
                 'ctl00$ContentPlaceHolder1$Search': 'Search'
                }
    
    response = session.post(url, data=post_data, headers=headers)
    
    soup = BeautifulSoup(response.content)
    for item in soup.select('a#ArticleShowLink'):
        print item.text.strip()
    

    Prints the specific results for the appnote query:

    How to troubleshoot when you can't watch video streaming?
    Recording performance benchmarking tool
    ...