pythonbeautifulsoupscraper

Manipulating BeautifulSoup's ResultSet list object


I am trying to extract 2 pieces of data: 1) The value of the option element's "value" attribute (ie "01000.html" below). 2) The string that is within the <option></option> tags (ie "Alabama"). There is limited information on the ResultSet list object that is created with I use

url = 'http://quickfacts.census.gov/qfd/states/' page = urllib2.urlopen(url) soup = BeautifulSoup(page) state_list = soup.find_all("option")

to extract the list of states from the US Census QFD page's drop-down menu (itself a element with these options).

Big picture, I was trying to loop through all the counties in the US using a simple i counter, but apparently the counties and States are not numbered uniformly. I am therefore trying to build a list of these options in order to loop through the "value" (which becomes part of the URL) attributes for the "States (strings)".

state_list

[<option value="01000.html">Alabama</option>,
 <option value="02000.html">Alaska</option>,
 <option value="04000.html">Arizona</option>,
 <option value="05000.html">Arkansas</option>,
 <option value="06000.html">California</option>,
 <option value="08000.html">Colorado</option>,
 <option value="09000.html">Connecticut</option>,
 <option value="10000.html">Delaware</option>,
 <option value="11000.html">District of Columbia</option>,
 <option value="12000.html">Florida</option>,
 <option value="13000.html">Georgia</option>,
 <option value="15000.html">Hawaii</option>,
 <option value="16000.html">Idaho</option>,
 <option value="17000.html">Illinois</option>,
 <option value="18000.html">Indiana</option>,
 <option value="19000.html">Iowa</option>,
 <option value="20000.html">Kansas</option>,
 <option value="21000.html">Kentucky</option>,
 <option value="22000.html">Louisiana</option>,
 <option value="23000.html">Maine</option>,
 <option value="24000.html">Maryland</option>,
 <option value="25000.html">Massachusetts</option>,
 <option value="26000.html">Michigan</option>,
 <option value="27000.html">Minnesota</option>,
 <option value="28000.html">Mississippi</option>,
 <option value="29000.html">Missouri</option>,

(etc...)

Solution

  • You can extract tag attributes like a dictionary, and access the text with the .text property.

    for state in state_list:
        print state['value'].split(".")[0], state.text