python-2.7web-scrapingbeautifulsoup

Problems Scraping a Page With Beautiful Soup


I am using Beautiful Soup to try and scrape a page.

I am trying to follow this tutorial.

I am trying to get the contents of the following page after submitting a Stock Ticker Symbol:

http://www.cboe.com/delayedquote/quotetable.aspx

The tutorial is for a page with a "GET" method, my page is a "POST". I wonder if that is part of the problem?

I want use the first text box – under where it says:

“Enter a Stock or Index symbol below for delayed quotes.”

Relevant code:

user_agent = 'Mozilla/5 (Solaris 10) Gecko'
headers = { 'User-Agent' : user_agent }

values = {'ctl00$ctl00$AllContent$ContentMain$ucQuoteTableCtl$txtSymbol' : 'IBM' } 
data = urllib.urlencode(values)
request = urllib2.Request("http://www.cboe.com/delayedquote/quotetable.aspx", data, headers)
response = urllib2.urlopen(request)

The call does not fail, I do not get a set of options and prices returned to me like when I run the page interactively. I a bunch of garbled HTML.

Thanks in advance!


Solution

  • Ok - I think I figured out the problem (and found another). I decided to switch to 'mechanize' from 'urllib2'. Unfortunately, I kept having problems getting the data. Finally, I realized that there are two 'submit' buttons, so I tried passing the name parameter when submitting the form. That did the trick as far as getting the correct response.

    However, the next problem was that I could not get BeautifulSoup to parse the HTML and find the necessary tags. A brief Google search revealed others having similar problems. So, I gave up on BeautifulSoup and just did a basic regex on the HTML. Not as elegant as BeautifulSoup, but effective.

    Ok - enough speechifying. Here's what I came up with:

    import mechanize
    import re
    
    br = mechanize.Browser()
    url = 'http://www.cboe.com/delayedquote/quotetable.aspx'
    br.open(url)
    br.select_form(name='aspnetForm')
    br['ctl00$ctl00$AllContent$ContentMain$ucQuoteTableCtl$txtSymbol'] = 'IBM'
    # here's the key step that was causing the trouble - pass the name parameter
    # for the button when calling submit
    response = br.submit(name="ctl00$ctl00$AllContent$ContentMain$ucQuoteTableCtl$btnSubmit")
    data = response.read()
    
    match = re.search( r'Bid</font><span>&nbsp;\s*([0-9]{1,4}\.[0-9]{2})', data, re.MULTILINE|re.M|re.I)
    if match:
       print match.group(1)
    else:
       print "There was a problem retrieving the quote"