pythonweb-scrapingbeautifulsoup

Python Web Scraping; Beautiful Soup


This was covered in this post: Python web scraping involving HTML tags with attributes

But I haven't been able to do something similar for this web page: http://www.expatistan.com/cost-of-living/comparison/melbourne/auckland?

I'm trying to scrape the values of:

  <td class="price city-2">
                                                      NZ$15.62
                                      <span style="white-space:nowrap;">(AU$12.10)</span>
                                                  </td>
  <td class="price city-1">
                                                      AU$15.82
                              </td>

Basically price city-2 and price city-1 (NZ$15.62 and AU$15.82)

Currently have:

import urllib2

from BeautifulSoup import BeautifulSoup

url = "http://www.expatistan.com/cost-of-living/comparison/melbourne/auckland?"
page = urllib2.urlopen(url)

soup = BeautifulSoup(page)

price2 = soup.findAll('td', attrs = {'class':'price city-2'})
price1 = soup.findAll('td', attrs = {'class':'price city-1'})

for price in price2:
    print price

for price in price1:
    print price

Ideally, I'd also like to have comma separated values for:

<th colspan="3" class="clickable">Food</th>, 

Extracting 'Food',

<td class="item-name">Daily menu in the business district</td>

Extracting 'Daily menu in the business district'

and then the values for price city-2, and price-city1

So the printout would be:

Food, Daily menu in the business district, NZ$15.62, AU$15.82

Thanks!


Solution

  • I find BeautifulSoup awkward to use. Here is a version based on the webscraping module:

    from webscraping import common, download, xpath
    
    # download html
    D = download.Download()
    html = D.get('http://www.expatistan.com/cost-of-living/comparison/melbourne/auckland')
    
    # extract data
    items = xpath.search(html, '//td[@class="item-name"]')
    city1_prices = xpath.search(html, '//td[@class="price city-1"]')
    city2_prices = xpath.search(html, '//td[@class="price city-2"]')
    
    # display and format
    for item, city1_price, city2_price in zip(items, city1_prices, city2_prices):
        print item.strip(), city1_price.strip(), common.remove_tags(city2_price, False).strip()
    

    Output:

    Daily menu in the business district AU$15.82 NZ$15.62

    Combo meal in fast food restaurant (Big Mac Meal or similar) AU$7.40 NZ$8.16

    1/2 Kg (1 lb.) of chicken breast AU$6.07 NZ$10.25

    ...